Open ww2283 opened 1 year ago
Did you run it in a docker container? If so, make sure that you use the option --gpus all
.
Or check if you have the correct jaxlib
installed. The following shell command line might be helpful:
CUDA=11.1.1
pip3 install --upgrade --no-cache-dir jax==0.2.14 \
jaxlib==0.1.69+cuda$(cut -f1,2 -d. <<< ${CUDA} | sed 's/\.//g') \
-f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
Thank you for the information. I solved it but actually in the opposite direction: my cards are ada 6000, so I have to first update cuda to 11.8, which is the minimum version support ada gen card. Then I update jax with consulting https://github.com/google/jax/issues/13570. All seems to be working, except that the memory usage has warning:
Info: input feature directory is af2c_fea
Info: result output directory is af2c_mod
Info: model preset is multimer_np
2023-09-10 17:56:23.661448: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Info: using preset economy
Info: set num_ensemble = 1
Info: set max_recyles = 3
Info: set recycle_tol = 0.1
Info: mas_pairing mode is all
I0910 17:56:24.905803 140173545832832 xla_bridge.py:622] Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: Interpreter CUDA
I0910 17:56:24.906221 140173545832832 xla_bridge.py:622] Unable to initialize backend 'tpu': module 'jaxlib.xla_extension' has no attribute 'get_tpu_client'
I0910 17:56:25.976805 140173545832832 run_af2c_mod.py:495] Have 2 models: ['model_1_multimer_v3_p1', 'model_3_multimer_v3_p1']
Info: working on target b15g21e1
I0910 17:56:26.978667 140173545832832 run_af2c_mod.py:526] Using random seed 3042885909085202102 for the data pipeline
Info: b15g21e1 found monomer best1_327 msa_depth = 8812, seq_len = 327, num_templ = 6
Info: best1_327 reducing the number of structural templates to 4
Info: b15g21e1 found monomer gad2_86 msa_depth = 34406, seq_len = 500, num_templ = 20
Info: gad2_86 MSA size is too large, reducing to 10000
Info: gad2_86 reducing the number of structural templates to 4
Info: 6 chain(s) to model {'A': 'best1_327_1', 'B': 'best1_327_1', 'C': 'best1_327_1', 'D': 'best1_327_1', 'E': 'best1_327_1', 'F': 'gad2_86_1'}
Info: modeling b15g21e1 with msa_depth = 7491, seq_len = 2135, num_templ = 24
I0910 17:56:28.890576 140173545832832 run_af2c_mod.py:220] Running model model_1_multimer_v3_p1_230910_202102
I0910 17:56:28.890965 140173545832832 model.py:204] Running predict with shape(feat) = {'msa': (7491, 2135), 'bert_mask': (7491, 2135), 'num_alignments': (), 'aatype': (2135,), 'seq_length': (), 'template_aatype': (24, 2135), 'template_all_atom_mask': (24, 2135, 37), 'template_all_atom_positions': (24, 2135, 37, 3), 'all_atom_positions': (2135, 37, 3), 'template_domain_names': (24,), 'asym_id': (2135,), 'sym_id': (2135,), 'entity_id': (2135,), 'residue_index': (2135,), 'deletion_matrix': (7491, 2135), 'seq_mask': (2135,), 'msa_mask': (7491, 2135), 'cluster_bias_mask': (7491,), 'pdb_residue_index': (2135,)}
2023-09-10 17:58:37.133976: W external/xla/xla/service/hlo_rematerialization.cc:2202] Can't reduce memory use below 35.63GiB (38255886336 bytes) by rematerialization; only reduced to 37.00GiB (39730967133 bytes), down from 37.36GiB (40112345117 bytes) originally
This memory warning has caused a crash in a previous run, so I consulted oligomer predictions and trimmed off the low confident region from input sequence before feature generation. Is there anything I missed that caused the large memory (GPU) usage? I thought 2135 residues is not absurdly large.
You may try to reduce the MSA input size like to 5000:
--max_mono_msa_depth=5000
Or use less number of structure templates such as 2 if necessary:
--max_template_hits=2
Also, disable intermediate recycle metric calculations by
--save_recycled=0
If it runs successfully, try longer recycles such as 8 or above, which could give you a better model.
Thank you! I can see that with those settings the OOM problem is alleviated. I also set TF_FORCE_UNIFIED_MEMORY=1 so that tf is not squeezing the VRAM at the same time, hopefully. I'd like to have some more information regarding the first two examples. The example1 used multimer_np and the example2 used monomer_ptm for model_preset. They nevertheless both works to predict a complex structure. Does the usage in example2 reduce the computing resources, i.e. suitable for folding larger complex structures? Also I'm curious, does the two ways of prediction in general give the same answer for the same targets? Another question is related to the 'preset' variable. Of all of them, which one is recommended in terms of the ability to catch any possible interactions?
I'm thinking of modifying the script so that the variables that can potentially contribute to different prediction results can be tested sequentially and automatically. Would you mind pointing out a list of variables, including modes, presets etc. that should be included for a batch test? Thank you
Use expert
preset if you would like to explore different configurations. For complex modeling, try the latest AF2 multimer models (v3 version) first, which was trained with more complexes and also computationally more efficient than previous multimer models. The MSA input is the most important, and make sure that your sequences have species specifiers added for pairing if you could find them. Also, try multiple runs, longer recycles, etc. If you know specific domains that interact, try these domains instead of full length is also a good idea.
For some challenging cases, the odds of getting a good model could be really small, like < 1%. But if you have enough computing resources and keep trying, you could be rewarded with a surprising success.
First thanks for this great resource! I encountered a problem that my GPU is not utilized. I configured af2complex in the same conda env as the AlphaFold. I run examples and my own complex predictions with no problem, except that it seems the GPU is not utilized.
May I know how to get the GPU into play?