Is it neccessary to train a specific BPE tokenizer on own datasets?

MAGICS-LAB / DNABERT_2

[ICLR 2024] DNABERT-2: Efficient Foundation Model and Benchmark for Multi-Species Genome

Apache License 2.0

212 stars 49 forks source link

Is it neccessary to train a specific BPE tokenizer on own datasets? #88

Closed amssljc closed 1 month ago

Zhihan1996 commented 1 month ago

In most cases, if your data distribution is not very different from our training data (reference genomes from GenBank), using the BPE tokenizer should be fine.