HazyResearch / hyena-dna

Official implementation for HyenaDNA, a long-range genomic foundation model built with Hyena
https://arxiv.org/abs/2306.15794
Apache License 2.0
574 stars 82 forks source link

CUDA out of memory with hyena-1m on A100-80G #15

Closed pone7 closed 1 year ago

pone7 commented 1 year ago

Hello,

Thank you for sharing such great work! Based on the A100-80G, I tried to use hyena-1m on a species classification task but got the error "CUDA out of memory." Here is my training command python -m train wandb=null experiment=hg38/species dataset.species=[human,mouse,hippo,pig,lemur] train.global_batch_size=256 optimizer.lr=6e-5 trainer.devices=4 dataset.batch_size=1 dataset.max_length=1000000 dataset.species_dir=/data/species_cls/ model.layer.l_max=1000002 model.d_model=256 model.n_layer=8 trainer.max_epochs=150 decoder.mode=last train.pretrained_model_path=null train.pretrained_model_state_hook=null

I noticed that the A100-80G should be able to train 1m models. Is there anything extra I should be aware of?

exnx commented 1 year ago

Thank you!

See the README here for checkpointing, which you need to turn on.

pone7 commented 1 year ago

Thanks a lot!

It works now, but the loss goes to nan after 3 epochs. Is this normal in train from scratch?

exnx commented 1 year ago

You also need to set the precision to bf16, and play with the learning rate.

train.precision=bf16

pone7 commented 1 year ago

I see. BTW, have you tried hyena on LRA Benmarks? I went to the safari repo and couldn't find a corresponding config, and setting one up myself didn't work well. Do you have any suggestions?

exnx commented 1 year ago

No we did not, we focus on DNA here :)

pone7 commented 1 year ago

OK, thanks for your timely response!