baker-laboratory / RoseTTAFold-All-Atom

Other
582 stars 100 forks source link

CUDA Out of Memory Error Predicting Protein with 1200 Amino Acids #63

Open RichenLee opened 4 months ago

RichenLee commented 4 months ago

Issue Description:

I encountered a CUDA out of memory error when trying to predict a protein containing 1200 amino acids. It's worth noting that I did not encounter this issue when running the provided example, leading me to believe it may not be related to environmental configuration.The error message is as follows:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.23 GiB (GPU 0; 23.65 GiB total capacity; 22.00 GiB already allocated; 241.06 MiB free; 22.95 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

I've attempted to adjust the PYTORCH_CUDA_ALLOC_CONF environment variable based on the suggestion in the error message, ranging from 32 to 1024, but it hasn't resolved the issue. Additionally, I tried adjusting the loader_params.MAXCYCLE parameter, setting it to 1, but this also did not solve the problem.

System Information: Operating System: Ubuntu 20.04 GPU Model: NVIDIA GeForce RTX 4090 24GB PyTorch Version: 2.0.1 CUDA Version: 11.8

Considering that the provided example runs smoothly without encountering memory issues, I'm curious to know what approximate memory capacity would be necessary to successfully predict the structure of a protein with 1200 amino acids.

Gibo-Chun commented 4 months ago

I came across a similar question, also waiting for an answer

sherryliu987 commented 3 months ago

I am also getting CUDA memory issues, sometimes for quite short sequences (few hundred)

ullahsamee commented 3 months ago

@RichenLee For "what approximate memory capacity would be necessary to successfully predict the structure of a protein with 1200 amino acids" I think this benchmark(NVIDIA V100 vs A800 vs H800) will help you alot.

RFAA
RichenLee commented 3 months ago

@RichenLee For "what approximate memory capacity would be necessary to successfully predict the structure of a protein with 1200 amino acids" I think this benchmark(NVIDIA V100 vs A800 vs H800) will help you alot.

Thank you for your suggestion and the information! Based on the benchmarks you provided, it seems that my RTX 4090 may not be sufficient alone for predicting the structure of a protein with 1200 amino acids. Looking forward to future technologies that allow for shared memory across multiple GPUs to handle such large-scale computational tasks. Thanks again for your help!

he-hai commented 2 months ago

I'm running the same problem when modeling a homo-dimer with each having a small substrate and a NAD+ on a 40G A100. While I still haven't resolve the problem, it seems that this https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html is what you @RichenLee asked about multiple GPU task.