aqlaboratory / openfold

Trainable, memory-efficient, and GPU-friendly PyTorch reproduction of AlphaFold 2
Apache License 2.0
2.72k stars 509 forks source link

OOM after reducing number of evoformer blocks #67

Open lhatsk opened 2 years ago

lhatsk commented 2 years ago

Just hoping someone has an idea of what might be going on, because I'm completely puzzled by this behavior.

After reducing "no_blocks" from 48 to 12 I get an OOM error. It's the only thing I change in-between runs. The network is in FP32.

OOM occurs in self.optimizer.backward(loss)

RuntimeError: CUDA out of memory. Tried to allocate 4.50 GiB (GPU 0; 31.75 GiB total capacity; 23.94 GiB already allocated; 1.81 GiB free; 28.35 GiB reserved in total by PyTorch)

gahdritz commented 2 years ago

Are you saying that it didn't OOM with the full 48 blocks?

lhatsk commented 2 years ago

Yes

gahdritz commented 2 years ago

Is this the 256 or 384 setting?

lhatsk commented 2 years ago

384, but I kept "max_extra_msa": 1024 and "max_msa_clusters": 128 because everything else is too expensive. With 256 it works fine.

I guess reducing no_blocks will actually not have a big effect because we are checkpointing anyway, right? Still, memory increase is weird.

gahdritz commented 2 years ago

Does torch allocate more total memory when you run it with 48 blocks? Sometimes it seems like torch opportunistically allocates less than it ultimately ends up needing. My hunch is that memory fragmentation, which seems to be happening here, is to blame for the underestimation.

lhatsk commented 2 years ago

I will check. There might also be a regression between pytorch 1.9 and 1.10: https://github.com/pytorch/pytorch/issues/67680

lhatsk commented 2 years ago

Yep, the 48 blocks network allocates roughly 1GB more. Downgrading to pytorch 1.9.1 didn't help.

gahdritz commented 2 years ago

All I can say on this for now is that we're working on more memory-efficient attention. In principle, there's no reason why we shouldn't be able to get it as efficient as AlphaFold's.