Bug description
When I use esmfold, adding the "-- cpu offload" parameter does not solve the problem of long sequences. My GPU is A100 32GB. Please help me. The error is reported as follows:
22/11/10 13:56:33 | INFO | root | Reading sequences from hipAB.fasta
22/11/10 13:56:33 | INFO | root | Loaded 5 sequences from hipAB.fasta
22/11/10 13:56:33 | INFO | root | Loading model
22/11/10 13:57:23 | INFO | torch.distributed.nn.jit.instantiator | Created a temporary directory at /tmp/tmp4ekwnkqe
22/11/10 13:57:23 | INFO | torch.distributed.nn.jit.instantiator | Writing /tmp/tmp4ekwnkqe/_remote_module_non_scriptable.py
22/11/10 13:57:23 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 0
22/11/10 13:57:23 | INFO | torch.distributed.distributed_c10d | Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
22/11/10 13:57:26 | INFO | root | Starting Predictions
22/11/10 13:59:12 | INFO | root | Predicted structure for hipAB-750 with length 765, pLDDT 51.2, pTM 0.264 in 105.6s. 1 / 5 completed.
22/11/10 13:59:15 | INFO | root | Failed (CUDA out of memory) on sequence hipAB-900 of length 918.
22/11/10 13:59:17 | INFO | root | Failed (CUDA out of memory) on sequence hipAB-1050 of length 1071.
22/11/10 13:59:19 | INFO | root | Failed (CUDA out of memory) on sequence hipAB-1200 of length 1224.
22/11/10 13:59:20 | INFO | root | Failed (CUDA out of memory) on sequence hipAB-1350 of length 1377.
/home/houj21/miniconda3/envs/esm/lib/python3.7/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:930: UserWarning: Module is put on CPU and will thus have flattening and sharding run on CPU, which is less efficient than on GPU. We recommend passing in device_id argument which will enable FSDP to put module on GPU device, module must also be on GPU device to work with sync_module_states=True flag which requires GPU communication.
"Module is put on CPU and will thus have flattening and sharding"
Bug description When I use esmfold, adding the "-- cpu offload" parameter does not solve the problem of long sequences. My GPU is A100 32GB. Please help me. The error is reported as follows:
22/11/10 13:56:33 | INFO | root | Reading sequences from hipAB.fasta 22/11/10 13:56:33 | INFO | root | Loaded 5 sequences from hipAB.fasta 22/11/10 13:56:33 | INFO | root | Loading model 22/11/10 13:57:23 | INFO | torch.distributed.nn.jit.instantiator | Created a temporary directory at /tmp/tmp4ekwnkqe 22/11/10 13:57:23 | INFO | torch.distributed.nn.jit.instantiator | Writing /tmp/tmp4ekwnkqe/_remote_module_non_scriptable.py 22/11/10 13:57:23 | INFO | torch.distributed.distributed_c10d | Added key: store_based_barrier_key:1 to store for rank: 0 22/11/10 13:57:23 | INFO | torch.distributed.distributed_c10d | Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes. 22/11/10 13:57:26 | INFO | root | Starting Predictions 22/11/10 13:59:12 | INFO | root | Predicted structure for hipAB-750 with length 765, pLDDT 51.2, pTM 0.264 in 105.6s. 1 / 5 completed. 22/11/10 13:59:15 | INFO | root | Failed (CUDA out of memory) on sequence hipAB-900 of length 918. 22/11/10 13:59:17 | INFO | root | Failed (CUDA out of memory) on sequence hipAB-1050 of length 1071. 22/11/10 13:59:19 | INFO | root | Failed (CUDA out of memory) on sequence hipAB-1200 of length 1224. 22/11/10 13:59:20 | INFO | root | Failed (CUDA out of memory) on sequence hipAB-1350 of length 1377. /home/houj21/miniconda3/envs/esm/lib/python3.7/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py:930: UserWarning: Module is put on CPU and will thus have flattening and sharding run on CPU, which is less efficient than on GPU. We recommend passing in
device_id
argument which will enable FSDP to put module on GPU device, module must also be on GPU device to work withsync_module_states=True
flag which requires GPU communication. "Module is put on CPU and will thus have flattening and sharding"