torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 100.00 MiB. GPU

chaidiscovery / chai-lab

Chai-1, SOTA model for biomolecular structure prediction

https://www.chaidiscovery.com

Other

1.02k stars 131 forks source link

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 100.00 MiB. GPU #17

Open cpplyqz opened 1 week ago

cpplyqz commented 1 week ago

Hello, I really appreciate the chai_lab you and your team developed. I have tried the server and compared it with AF3. Now I want to deploy it locally. I have successfully installed it, but there is a problem. My video memory is not enough, and I will get torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 100.00 MiB. GPU . I have four 2080ti graphics cards with 10G video memory. I would like to ask if it supports multi-GPU parallel processing? Can I modify the cuda: parameter in predict_structure.py to achieve the purpose? Thanks!!!

stan1233 commented 1 week ago

I didn't encounter this issue with my RTX 4090

kimdn commented 1 day ago

I also experienced a similar error

This is weird since I used 40GB GPU memory A100 as

I see above people's GPU (2080 ti -> memory 11 GB, rtx 4090 -> memory 24 GB).

I think that something uncertain is going on (either I misused my gpu, or there is some racing condition bug/mismatch).

kimdn commented 1 day ago

Ok, at least for me, using a100_shared partition (that charges me only real number of gpu that I requested by sbatch) on my linux workstation caused this memory issue.

Using a100 partition (that charges me all 8 gpus) didn't cause the GPU memory issue.

I confirmed this FACT with my repeated checks with different input files at different running times.