gcorso / DiffDock

Implementation of DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking
https://arxiv.org/abs/2210.01776
MIT License
976 stars 238 forks source link

Float division by zero on CUDA memory outage #234

Open Linus-XZX opened 2 weeks ago

Linus-XZX commented 2 weeks ago

After CUDA runs out of memory, the loss calculation and summary will fail due to a division by zero error. It seems that the batch skipping functionality is not entirely working...? (Screenshot is taken on batch size 16.) image Conda and pip envs are as follows. pip_env.txt conda_env.txt

While using a smaller batch size is a valid workaround, any help will be appreciated here.

Edit: It seems that at batch size 4 the skip works properly, but messes with the validation possibly due to skipping 1-sized batches.