MAGICS-LAB / DNABERT_2

[ICLR 2024] DNABERT-2: Efficient Foundation Model and Benchmark for Multi-Species Genome
Apache License 2.0
212 stars 49 forks source link

Pretraining, Pretraining, Pretraining!!! #76

Closed multydoffer closed 1 month ago

multydoffer commented 3 months ago

PLZ PLZ PLZ. Release the code for pretraining, I am dying for it.

Zhihan1996 commented 3 months ago

Sorry for the delay in sharing the pre-training codes. We used and slightly modified the MosaicBERT implementation for DNABERT-2 https://github.com/mosaicml/examples/tree/main/examples/benchmarks/bert . You should be able to replicate the model training following the instructions.

Or you can use the run_mlm.py at https://github.com/huggingface/transformers/tree/main/examples/pytorch/language-modeling by importing the BertModelForMaskedLM from https://huggingface.co/zhihan1996/DNABERT-2-117M/blob/main/bert_layers.py. It should produce a very similar model.

The training data is available here. https://drive.google.com/file/d/1dSXJfwGpDSJ59ry9KAp8SugQLK35V83f/view?usp=sharing.

multydoffer commented 2 months ago

Thanks a lot!