Question regarding training time and resources

aqlaboratory / openfold

Trainable, memory-efficient, and GPU-friendly PyTorch reproduction of AlphaFold 2

Apache License 2.0

2.8k stars 540 forks source link

Hi :),

First of all, thanks for your great work!

I'm trying to train openfold (with just 10 evoformer blocks) on proteinnet with two quadro rtx 8000 GPUs.

My loss isn't increasing much after about 5 epochs, and the results are far away from being useful: loss

Now I am wondering if I can expect the loss to converge at some time in the near future with more training. Or does it seem as I am doing something completely wrong? I'm asking because the original alphafold training used a lot of hardware resources (several weeks with about 128 TPUv3) which I do not currently own.

Which hardware did you use for training and how long did it take until you got useful results?

Thanks in advance!

aqlaboratory / openfold

Question regarding training time and resources #99