[Question] GPU not utilized

ww2283 commented 1 year ago

First thanks for this great resource! I encountered a problem that my GPU is not utilized. I configured af2complex in the same conda env as the AlphaFold. I run examples and my own complex predictions with no problem, except that it seems the GPU is not utilized.

Info: input feature directory is af2c_fea
Info: result output directory is af2c_mod
Info: model preset is multimer_np
Info: using preset economy
Info: set num_ensemble = 1
Info: set max_recyles = 3
Info: set recycle_tol = 0.1
Info: mas_pairing mode is all
I0907 11:16:47.981306 140113952493952 xla_bridge.py:353] Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker:
I0907 11:16:47.981455 140113952493952 xla_bridge.py:353] Unable to initialize backend 'cuda': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
I0907 11:16:47.981494 140113952493952 xla_bridge.py:353] Unable to initialize backend 'rocm': module 'jaxlib.xla_extension' has no attribute 'GpuAllocatorConfig'
I0907 11:16:47.981989 140113952493952 xla_bridge.py:353] Unable to initialize backend 'tpu': INVALID_ARGUMENT: TpuPlatform is not available.
I0907 11:16:47.982033 140113952493952 xla_bridge.py:353] Unable to initialize backend 'plugin': xla_extension has no attributes named get_plugin_device_client. Compile TensorFlow with //tensorflow/compiler/xla/python:enable_plugin_device set to true (defaults to false) to enable this.
W0907 11:16:47.982073 140113952493952 xla_bridge.py:360] No GPU/TPU found, falling back to CPU. (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
I0907 11:16:48.739086 140113952493952 run_af2c_mod.py:495] Have 2 models: ['model_1_multimer_v3_p1', 'model_3_multimer_v3_p1']
# ...

May I know how to get the GPU into play?

FreshAirTonight commented 1 year ago

Did you run it in a docker container? If so, make sure that you use the option --gpus all. Or check if you have the correct jaxlib installed. The following shell command line might be helpful:

CUDA=11.1.1
pip3 install --upgrade --no-cache-dir jax==0.2.14 \
      jaxlib==0.1.69+cuda$(cut -f1,2 -d. <<< ${CUDA} | sed 's/\.//g') \
      -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html

ww2283 commented 1 year ago

Thank you for the information. I solved it but actually in the opposite direction: my cards are ada 6000, so I have to first update cuda to 11.8, which is the minimum version support ada gen card. Then I update jax with consulting https://github.com/google/jax/issues/13570. All seems to be working, except that the memory usage has warning:

Info: input feature directory is af2c_fea
Info: result output directory is af2c_mod
Info: model preset is multimer_np
2023-09-10 17:56:23.661448: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Info: using preset economy
Info: set num_ensemble = 1
Info: set max_recyles = 3
Info: set recycle_tol = 0.1
Info: mas_pairing mode is all
I0910 17:56:24.905803 140173545832832 xla_bridge.py:622] Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: Interpreter CUDA
I0910 17:56:24.906221 140173545832832 xla_bridge.py:622] Unable to initialize backend 'tpu': module 'jaxlib.xla_extension' has no attribute 'get_tpu_client'
I0910 17:56:25.976805 140173545832832 run_af2c_mod.py:495] Have 2 models: ['model_1_multimer_v3_p1', 'model_3_multimer_v3_p1']
Info: working on target b15g21e1
I0910 17:56:26.978667 140173545832832 run_af2c_mod.py:526] Using random seed 3042885909085202102 for the data pipeline
Info: b15g21e1 found monomer best1_327 msa_depth = 8812, seq_len = 327, num_templ = 6
Info: best1_327 reducing the number of structural templates to 4
Info: b15g21e1 found monomer gad2_86 msa_depth = 34406, seq_len = 500, num_templ = 20
Info: gad2_86 MSA size is too large, reducing to 10000
Info: gad2_86 reducing the number of structural templates to 4
Info: 6 chain(s) to model {'A': 'best1_327_1', 'B': 'best1_327_1', 'C': 'best1_327_1', 'D': 'best1_327_1', 'E': 'best1_327_1', 'F': 'gad2_86_1'}
Info: modeling b15g21e1 with msa_depth = 7491, seq_len = 2135, num_templ = 24
I0910 17:56:28.890576 140173545832832 run_af2c_mod.py:220] Running model model_1_multimer_v3_p1_230910_202102
I0910 17:56:28.890965 140173545832832 model.py:204] Running predict with shape(feat) = {'msa': (7491, 2135), 'bert_mask': (7491, 2135), 'num_alignments': (), 'aatype': (2135,), 'seq_length': (), 'template_aatype': (24, 2135), 'template_all_atom_mask': (24, 2135, 37), 'template_all_atom_positions': (24, 2135, 37, 3), 'all_atom_positions': (2135, 37, 3), 'template_domain_names': (24,), 'asym_id': (2135,), 'sym_id': (2135,), 'entity_id': (2135,), 'residue_index': (2135,), 'deletion_matrix': (7491, 2135), 'seq_mask': (2135,), 'msa_mask': (7491, 2135), 'cluster_bias_mask': (7491,), 'pdb_residue_index': (2135,)}
2023-09-10 17:58:37.133976: W external/xla/xla/service/hlo_rematerialization.cc:2202] Can't reduce memory use below 35.63GiB (38255886336 bytes) by rematerialization; only reduced to 37.00GiB (39730967133 bytes), down from 37.36GiB (40112345117 bytes) originally

This memory warning has caused a crash in a previous run, so I consulted oligomer predictions and trimmed off the low confident region from input sequence before feature generation. Is there anything I missed that caused the large memory (GPU) usage? I thought 2135 residues is not absurdly large.

FreshAirTonight commented 1 year ago

You may try to reduce the MSA input size like to 5000:

--max_mono_msa_depth=5000

Or use less number of structure templates such as 2 if necessary: --max_template_hits=2

Also, disable intermediate recycle metric calculations by --save_recycled=0

If it runs successfully, try longer recycles such as 8 or above, which could give you a better model.

ww2283 commented 1 year ago

Thank you! I can see that with those settings the OOM problem is alleviated. I also set TF_FORCE_UNIFIED_MEMORY=1 so that tf is not squeezing the VRAM at the same time, hopefully. I'd like to have some more information regarding the first two examples. The example1 used multimer_np and the example2 used monomer_ptm for model_preset. They nevertheless both works to predict a complex structure. Does the usage in example2 reduce the computing resources, i.e. suitable for folding larger complex structures? Also I'm curious, does the two ways of prediction in general give the same answer for the same targets? Another question is related to the 'preset' variable. Of all of them, which one is recommended in terms of the ability to catch any possible interactions?

ww2283 commented 1 year ago

I'm thinking of modifying the script so that the variables that can potentially contribute to different prediction results can be tested sequentially and automatically. Would you mind pointing out a list of variables, including modes, presets etc. that should be included for a batch test? Thank you

FreshAirTonight commented 1 year ago

Use expert preset if you would like to explore different configurations. For complex modeling, try the latest AF2 multimer models (v3 version) first, which was trained with more complexes and also computationally more efficient than previous multimer models. The MSA input is the most important, and make sure that your sequences have species specifiers added for pairing if you could find them. Also, try multiple runs, longer recycles, etc. If you know specific domains that interact, try these domains instead of full length is also a good idea.

For some challenging cases, the odds of getting a good model could be really small, like < 1%. But if you have enough computing resources and keep trying, you could be rewarded with a surprising success.

FreshAirTonight / af2complex

[Question] GPU not utilized #22