MAGICS-LAB / DNABERT_2

[ICLR 2024] DNABERT-2: Efficient Foundation Model and Benchmark for Multi-Species Genome
Apache License 2.0
212 stars 49 forks source link

Special token treatment. #81

Closed prwoolley closed 1 month ago

prwoolley commented 2 months ago

By default, the tokenizer adds special tokens to the "input_ids", specifically [CLS] at the beginning and [SEP] at the end of each token array. Was DNABERT-2 trained with these tokens present? If so, has the [CLS] token been used for finetuning, as an alternative to mean pooling?

Thanks for the model!

Zhihan1996 commented 1 month ago

Sorry for the late reply!

Yes, DNABERT-2 is trained with these tokens present. And by default, it is fine-tuned with [CLS] token instead of mean pooling.