Pre-training Data - Githubissues

MAGICS-LAB / DNABERT_2

[ICLR 2024] DNABERT-2: Efficient Foundation Model and Benchmark for Multi-Species Genome

Apache License 2.0

212 stars 49 forks source link

Pre-training Data #51

Closed leannmlindsey closed 6 months ago

leannmlindsey commented 8 months ago

In the paper you state, "In order to facilitate further research on large-scale genome foundational models, we have collated and made available multi-species genome datasets for both pre-training of models (Sec. 4.1) and benchmarking (Sec. 4.2)."

but I cannot see where these datasets are, I have looked both on Huggingface and your github?

Have I overlooked them somewhere?

emerson-h commented 7 months ago

I would also be interested in the dataset that is mentioned as released in the paper.

Zhihan1996 commented 6 months ago

Sorry for the confusion. I will organize the data are share it soon. Will let you know when it's ready.