aqlaboratory / openfold

Trainable, memory-efficient, and GPU-friendly PyTorch reproduction of AlphaFold 2
Apache License 2.0
2.8k stars 540 forks source link

Question regarding training time and resources #99

Open Meteord opened 2 years ago

Meteord commented 2 years ago

Hi :),

First of all, thanks for your great work!

I'm trying to train openfold (with just 10 evoformer blocks) on proteinnet with two quadro rtx 8000 GPUs.

My loss isn't increasing much after about 5 epochs, and the results are far away from being useful: loss

Now I am wondering if I can expect the loss to converge at some time in the near future with more training. Or does it seem as I am doing something completely wrong? I'm asking because the original alphafold training used a lot of hardware resources (several weeks with about 128 TPUv3) which I do not currently own.

Which hardware did you use for training and how long did it take until you got useful results?

Thanks in advance!

gahdritz commented 2 years ago

We're using approx. 45 A100s ATM, and it takes a very long time (weeks) to get good results. Our loss curves look pretty much the same---they all have a rapid initial learning phase followed by an extremely gradual (but steady) decrease in the loss. With just 2 GPUs, it might be worth considering trying to finetune the AlphaFold weights instead of training from scratch.