HeliXonProtein / OmegaFold

OmegaFold Release Code
Apache License 2.0
555 stars 79 forks source link

Issues with running model2 #62

Open s-kyungyong opened 1 year ago

s-kyungyong commented 1 year ago

Hi!

It looks like the run was killed due to the issues with GPU memory usage when model2 is used. However, the same input sequence runs fine with model 1. Do you have any clues?


 nvidia-smi
Thu Feb 23 16:38:21 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA TITAN RTX    On   | 00000000:1A:00.0 Off |                  N/A |
| 41%   39C    P2    71W / 280W |   7148MiB / 24220MiB |      2%      Default |
|                               |                      |                  N/A |

omegafold --num_cycle 1 --model 1 gene_5088_NI907.fasta test4
INFO:root:Loading weights from /global/scratch/users/skyungyong/omegafold/omegafold_ckpt/model.pt
INFO:root:Constructing OmegaFold
INFO:root:Reading gene_5088_NI907.fasta
INFO:root:Predicting 1th chain in gene_5088_NI907.fasta
INFO:root:365 residues in this chain.
INFO:root:Finished prediction in 23.76 seconds.
INFO:root:Saving prediction to test4/gene_5088_NI907.pdb
INFO:root:Saved
INFO:root:Done!

omegafold --num_cycle 1 --model 2 gene_5088_NI907.fasta test5
INFO:root:Loading weights from /global/scratch/users/skyungyong/omegafold/omegafold_ckpt/model2.pt
INFO:root:Constructing OmegaFold
Killed

Using a better GPU didn't help.


 nvidia-smi
Thu Feb 23 16:39:10 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03   Driver Version: 470.141.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A40          Off  | 00000000:41:00.0 Off |                    0 |
|  0%   32C    P8    30W / 300W |     23MiB / 45634MiB |      0%      Default |
|                               |                      |                  N/A |

omegafold --num_cycle 1 --model 1 gene_5088_NI907.fasta test10
INFO:root:Loading weights from /global/scratch/users/skyungyong/omegafold/omegafold_ckpt/model.pt
INFO:root:Constructing OmegaFold
INFO:root:Reading gene_5088_NI907.fasta
INFO:root:Predicting 1th chain in gene_5088_NI907.fasta
INFO:root:365 residues in this chain.
INFO:root:Finished prediction in 12.72 seconds.
INFO:root:Saving prediction to test10/gene_5088_NI907.pdb
INFO:root:Saved
INFO:root:Done!

omegafold --num_cycle 1 --model 2 gene_5088_NI907.fasta test11
INFO:root:Loading weights from /global/scratch/users/skyungyong/omegafold/omegafold_ckpt/model2.pt
INFO:root:Constructing OmegaFold
INFO:root:Reading gene_5088_NI907.fasta
INFO:root:Predicting 1th chain in gene_5088_NI907.fasta
INFO:root:365 residues in this chain.
INFO:root:Failed to generate test11/gene_5088_NI907.pdb due to CUDA out of memory. Tried to allocate 10.67 GiB (GPU 0; 44.56 GiB total capacity; 32.65 GiB already allocated; 9.25 GiB free; 33.13 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
INFO:root:Skipping...
INFO:root:Done!

Using --subbatch_size also didn't help.

omegafold --subbatch_size 1 --num_cycle 1 --model 2 gene_5088_NI907.fasta test11
INFO:root:Loading weights from /global/scratch/users/skyungyong/omegafold/omegafold_ckpt/model2.pt
INFO:root:Constructing OmegaFold
INFO:root:Reading gene_5088_NI907.fasta
INFO:root:Predicting 1th chain in gene_5088_NI907.fasta
INFO:root:365 residues in this chain.
INFO:root:Failed to generate test11/gene_5088_NI907.pdb due to CUDA out of memory. Tried to allocate 10.67 GiB (GPU 0; 44.56 GiB total capacity; 32.65 GiB already allocated; 9.25 GiB free; 33.13 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
INFO:root:Skipping...
INFO:root:Done!

Thanks!

bzhousd commented 1 year ago

I got the same error,

INFO:root:Failed to generate my_output4/ranked_0.pdb due to CUDA out of memory. Tried to allocate 7.80 GiB (GPU 0; 31.75 GiB total capacity; 24.66 GiB already allocated; 5.82 GiB free; 24.92 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

my command is omegafold a.fa my_output4 --model 2 --subbatch_size 1 --num_cycle 1

and model 1 works just fine and the length of my sequence is 311. any suggestion? thanks

Edit: OOM message was printed within RecycleEmbedder in my case, however I didn't find this class is using subbatch_size.