Cuda - the memory issue

DmitrySemchonok commented 1 year ago

Dear colleagues,

I faced the following error running modelangelo with map; no protein sequence; mask

Any ideas?

thanks in advance.

Sincerely, Dmitry

================================================================= 20932: 100%|█████████▉| 2.05G/2.05G [38:53<00:00, 923kB/s] 20933: 100%|█████████▉| 2.05G/2.05G [38:53<00:00, 957kB/s] 20934: 100%|█████████▉| 2.05G/2.05G [38:53<00:00, 985kB/s] 20935: 100%|█████████▉| 2.05G/2.05G [38:53<00:00, 970kB/s] 20936: 100%|██████████| 2.05G/2.05G [38:53<00:00, 944kB/s] 20937: 2022-10-25 at 10:25:29 | INFO | ModelAngelo with args: {'volume_path': 'Runs/007501_ProtCryoSparc3DHomogeneousRefine/extra/cryosparc_P30_J158_007_volume_map.mrc', 'output_dir': 'Runs/007658_ProtModelAngelo/extra', 'mask_path': 'Runs/007626_ProtImportMask/extra/cryosparc_P30_J158_007_volume_mask_refine.mrc', 'device': 'cuda:0', 'config_path': None, 'model_bundle_name': 'original_no_seq', 'model_bundle_path': None, 'pipeline_control': False, 'func': <function main at 0x7f461f121ee0>} 20938: 2022-10-25 at 10:25:29 | INFO | Input volume preprocessing with args: {'target_voxel_size': 1.5, 'crop_z': 0, 'bfactor_to_apply': 0, 'auto_mask': False, 'input_path': 'Runs/007501_ProtCryoSparc3DHomogeneousRefine/extra/cryosparc_P30_J158_007_volume_map.mrc', 'output_path': 'Runs/007658_ProtModelAngelo/extra'} 20939: 2022-10-25 at 10:25:34 | INFO | Initial C-alpha prediction with args: {'model_checkpoint': 'chkpt.torch', 'bfactor': 0, 'batch_size': 4, 'stride': 16, 'dont_mask_input': True, 'threshold': 0.05, 'save_real_coordinates': False, 'save_cryo_em_grid': False, 'do_nucleotides': False, 'save_backbone_trace': False, 'save_ca_grid': False, 'crop': 6, 'log_dir': '/home/user/Data/Software/scipion3/software/em/modelangelomodels-0.1/hub/checkpoints/model_angelo/original_no_seq/c_alpha', 'map_path': 'Runs/007658_ProtModelAngelo/extra/cryosparc_P30_J158_007_volume_map_fixed.mrc', 'output_path': 'Runs/007658_ProtModelAngelo/extra/see_alpha_output', 'mask_path': 'Runs/007626_ProtImportMask/extra/cryosparc_P30_J158_007_volume_mask_refine.mrc', 'device': 'cuda:0', 'auto_mask': False} 20940: 2022-10-25 at 10:25:38 | INFO | Using model file /home/user/Data/Software/scipion3/software/em/modelangelomodels-0.1/hub/checkpoints/model_angelo/original_no_seq/c_alpha/model.py 20941: 2022-10-25 at 10:25:38 | INFO | Using checkpoint file /home/user/Data/Software/scipion3/software/em/modelangelomodels-0.1/hub/checkpoints/model_angelo/original_no_seq/c_alpha/chkpt.torch 20942: 2022-10-25 at 10:25:46 | INFO | Input structure has shape: (368, 368, 368) 20943: 2022-10-25 at 10:25:46 | INFO | Running with these arguments: 20944: 2022-10-25 at 10:25:46 | INFO | {'model_checkpoint': 'chkpt.torch', 'bfactor': 0, 'batch_size': 4, 'stride': 16, 'dont_mask_input': True, 'threshold': 0.05, 'save_real_coordinates': False, 'save_cryo_em_grid': False, 'do_nucleotides': False, 'save_backbone_trace': False, 'save_ca_grid': False, 'crop': 6, 'log_dir': '/home/user/Data/Software/scipion3/software/em/modelangelomodels-0.1/hub/checkpoints/model_angelo/original_no_seq/c_alpha', 'map_path': 'Runs/007658_ProtModelAngelo/extra/cryosparc_P30_J158_007_volume_map_fixed.mrc', 'output_path': 'Runs/007658_ProtModelAngelo/extra/see_alpha_output', 'mask_path': 'Runs/007626_ProtImportMask/extra/cryosparc_P30_J158_007_volume_mask_refine.mrc', 'device': 'cuda:0', 'auto_mask': False} 20945: 2022-10-25 at 10:25:46 | INFO | Model has these arguments: 20946: 2022-10-25 at 10:25:46 | INFO | Namespace(dataset_list='/ssd/see-alpha-phosphorus-unmasked/train.txt', log_dir='/ssd/train_std', valid_dataset_list='/ssd/see-alpha-phosphorus-unmasked/test.txt', validation_ratio=500, checkpoint_ratio=10000, num_steps=400000, box_size=64, batch_size=2, accumulate_grad_steps=1, lr=0.0001, use_cosine_annealing=True, use_focal_loss=True, use_tversky_loss=True, use_dice_loss=False, use_weighted_loss=True, use_backbone_trace_loss=True, debug=False, dont_load=False, image_ratio=100, clip_grad_norm=10.0, weight_decay=0.0, max_noise=0.5, dont_use_data_augmentation=False, positional_encoding_dim=0, use_global_normalization=False, match_model=True) 20947: 2022-10-25 at 12:44:18 | INFO | Model prediction done, took 8311.91 seconds for 8000 sliding windows 20948: 2022-10-25 at 12:44:18 | INFO | Average time is 1038.989 ms 20949: 2022-10-25 at 12:44:19 | INFO | Starting Cα grid to points... 20950: 2022-10-25 at 12:44:23 | INFO | Have 61453 Cα points before pruning and 50669 after pruning 20951: 2022-10-25 at 12:44:30 | INFO | Finished inference! 20952: 2022-10-25 at 12:44:30 | INFO | GNN model refinement round 1 with args: {'num_rounds': 3, 'crop_length': 200, 'repeat_per_residue': 3, 'esm_model': 'esm1b_t33_650M_UR50S', 'aggressive_pruning': False, 'seq_attention_batch_size': 200, 'map': 'Runs/007658_ProtModelAngelo/extra/cryosparc_P30_J158_007_volume_map_fixed.mrc', 'struct': 'Runs/007658_ProtModelAngelo/extra/see_alpha_output/see_alpha_output_ca.cif', 'output_dir': 'Runs/007658_ProtModelAngelo/extra/gnn_output_round_1', 'model_dir': '/home/user/Data/Software/scipion3/software/em/modelangelomodels-0.1/hub/checkpoints/model_angelo/original_no_seq/gnn', 'device': 'cuda:0'} 20953: 2022-10-25 at 12:44:33 | INFO | Loaded module from step: 529999 20954: 2022-10-25 at 12:44:36 | ERROR | Error in ModelAngelo 20955: Traceback (most recent call last): 20956:
20957: File "/home/user/Data/Software/miniconda/envs/modelangelo-git/bin/model_angelo", line 33, in 20958: sys.exit(load_entry_point('model-angelo', 'console_scripts', 'model_angelo')()) 20959: │ │ └ <function importlib_load_entry_point at 0x7f472ca5b280> 20960: │ └ 20961: └ <module 'sys' (built-in)> 20962:
20963: File "/home/user/Data/Software/scipion3/software/em/modelangelo-git/model-angelo/model_angelo/main.py", line 51, in main 20964: args.func(args) 20965: │ │ └ Namespace(volume_path='Runs/007501_ProtCryoSparc3DHomogeneousRefine/extra/cryosparc_P30_J158_007_volume_map.mrc', output_dir=... 20966: │ └ <function main at 0x7f461f121ee0> 20967: └ Namespace(volume_path='Runs/007501_ProtCryoSparc3DHomogeneousRefine/extra/cryosparc_P30_J158_007_volume_map.mrc', output_dir=... 20968:
20969: > File "/home/user/Data/Software/scipion3/software/em/modelangelo-git/model-angelo/model_angelo/apps/build_no_seq.py", line 205, in main 20970: gnn_output = gnn_no_seq_infer(gnn_infer_args) 20971: │ └ {'num_rounds': 3, 'crop_length': 200, 'repeat_per_residue': 3, 'esm_model': 'esm1b_t33_650M_UR50S', 'aggressive_pruning': Fal... 20972: └ <function infer at 0x7f461f121e50> 20973:
20974: File "/home/user/Data/Software/scipion3/software/em/modelangelo-git/model-angelo/model_angelo/gnn/inference_no_seq.py", line 263, in infer 20975: collated_results = init_empty_collate_results( 20976: └ <function init_empty_collate_results at 0x7f461f121ca0> 20977:
20978: File "/home/user/Data/Software/scipion3/software/em/modelangelo-git/model-angelo/model_angelo/gnn/inference_no_seq.py", line 112, in init_empty_collate_results 20979: result["edge_counts"] = torch.zeros(num_residues, num_residues, device=device) 20980: │ │ │ │ │ └ device(type='cuda', index=0) 20981: │ │ │ │ └ 50669 20982: │ │ │ └ 50669 20983: │ │ └ <built-in method zeros of type object at 0x7f46f3643200> 20984: │ └ <module 'torch' from '/home/user/Data/Software/miniconda/envs/modelangelo-git/lib/python3.9/site-packages/torch/init.py'> 20985: └ {'counts': tensor([0., 0., 0., ..., 0., 0., 0.], device='cuda:0')} 20986:
20987: RuntimeError: CUDA out of memory. Tried to allocate 9.56 GiB (GPU 0; 10.76 GiB total capacity; 536.01 MiB already allocated; 9.15 GiB free; 622.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF 20988: Traceback (most recent call last): 20989: File "/home/user/Data/Software/miniconda/envs/scipion3/lib/python3.8/site-packages/pyworkflow/protocol/protocol.py", line 202, in run 20990: self._run() 20991: File "/home/user/Data/Software/miniconda/envs/scipion3/lib/python3.8/site-packages/pyworkflow/protocol/protocol.py", line 253, in _run 20992: resultFiles = self._runFunc() 20993: File "/home/user/Data/Software/miniconda/envs/scipion3/lib/python3.8/site-packages/pyworkflow/protocol/protocol.py", line 249, in _runFunc 20994: return self._func(*self._args) 20995: File "/home/user/Data/Software/miniconda/envs/scipion3/lib/python3.8/site-packages/modelangelo/protocols/protocol_model_angelo.py", line 148, in predictStep 20996: raise ChildProcessError("Model angelo has failed: %s. See error log for more details." % line) from None 20997: ChildProcessError: Model angelo has failed: RuntimeError: CUDA out of memory. Tried to allocate 9.56 GiB (GPU 0; 10.76 GiB total capacity; 536.01 MiB already allocated; 9.15 GiB free; 622.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF. See error log for more details. 20998: Protocol failed: Model angelo has failed: RuntimeError: CUDA out of memory. Tried to allocate 9.56 GiB (GPU 0; 10.76 GiB total capacity; 536.01 MiB already allocated; 9.15 GiB free; 622.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF. See error log for more details.c

jamaliki commented 1 year ago

Hi Dmitry, This is a known issue. The model was too large and the network ran out of memory. I am currently working on a fix that will be included in the main repository soon. In the meantime, you can use the branch "minimize-memory-usage".

Would you like help in installing this version?

Best, Kiarash.

DmitrySemchonok commented 1 year ago

hi Kiarash, @jamaliki

Sure - I do all I can :)

Sincerely, Dmitry

jamaliki commented 1 year ago

So if you just navigate to the directory that model-angelo is in, and then you run the following commands:

git fetch
git checkout minimize-memory-usage
git pull
conda activate model_angelo
python setup.py install

This should update your code so that this does not happen.

DmitrySemchonok commented 1 year ago

hello Kiarash @jamaliki

git fetch

when I do that, I get the following -

[user@dataanalysisserver1 ~]$ git fetch fatal: Not a git repository (or any parent up to mount point /home) Stopping at filesystem boundary (GIT_DISCOVERY_ACROSS_FILESYSTEM not set). [user@dataanalysisserver1 ~]$

jamaliki commented 1 year ago

Hi Dmitry, you should be in the model-angelo directory. Where you originally did git clone

DmitrySemchonok commented 1 year ago

it seems to work :) thank you!

3dem / model-angelo

Cuda - the memory issue #20