run_multimer_jobs issue

J-Held commented 4 months ago

I am trying to run the run_multimer_jobs script on GPU using this command:

run_multimer_jobs.py \ --mode=all_vs_all \ --num_cycle=3 \ --num_predictions_per_model=1 \ --output_path=/storage/home/jbh249/scratch/output/models/ --data_dir=/storage/home/jbh249/scratch/alphaDatabase/ \ --protein_lists=/storage/home/jbh249/scratch/candidates.txt \ --monomer_objects_dir=/storage/home/jbh249/scratch/output/features

The job terminates almost immediately with this error:

/storage/home/jbh249/micromamba/envs/AlphaPulldown/lib/python3.10/site-packages/Bio/Data/SCOPData.py:18: BiopythonDeprecationWarning: The 'Bio.Data.SCOPData' module will be deprecated in a future release of Biopython in favor of 'Bio.Data.PDBData. warnings.warn( 2024-05-20 16:09:31.137214: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 2024-05-20 16:09:35.260966: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2251] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices... I0520 16:09:35.261087 23134891243328 utils.py:271] checking if output_dir exists /storage/home/jbh249/scratch/output/models/ Traceback (most recent call last): File "/storage/home/jbh249/micromamba/envs/AlphaPulldown/bin/run_multimer_jobs.py", line 462, in app.run(main) File "/storage/home/jbh249/micromamba/envs/AlphaPulldown/lib/python3.10/site-packages/absl/app.py", line 308, in run _run_main(main, args) File "/storage/home/jbh249/micromamba/envs/AlphaPulldown/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main sys.exit(main(argv)) File "/storage/home/jbh249/micromamba/envs/AlphaPulldown/bin/run_multimer_jobs.py", line 437, in main all_proteins = read_all_proteins(FLAGS.protein_lists[0]) TypeError: 'NoneType' object is not subscriptable

Qrouger commented 4 months ago

Hi @J-Held, the first part of your errors says that can't use GPU cause you have a problem with your TensorRT. But the script don't crash cause of that, but probably cause of yours command. Take care of your backslash and personally I prefer write the command in line with one space to avoid writing errors. Like this : run_multimer_jobs.py --mode=all_vs_all --num_cycle=3 --num_predictions_per_model=1 --output_path=/storage/home/jbh249/scratch/output/models/ --data_dir=/storage/home/jbh249/scratch/alphaDatabase/ --protein_lists=/storage/home/jbh249/scratch/candidates.txt --monomer_objects_dir=/storage/home/jbh249/scratch/output/features

Quentin

dingquanyu commented 4 months ago

Hi @J-Held

I agree with @Qrouger 's suggestion. It's likely that your command is not correctly formatted so that protein_lists wasn't parsed correctly. What you wrote after the \ is not parsed at all.

Yours Dingquan

J-Held commented 4 months ago

Yes, that was it. Thank you @Qrouger and @dingquanyu!

Regarding the GPU, it looks like I'm getting many of the error messages brought up in #339, but the job appears to still be running. Is it just going to time out? Output log below:

I0521 10:54:40.655257 22582644975424 run_multimer_jobs.py:389] Modeling new interaction for /storage/home/jbh249/scratch/output/models/HrpN_and_WAK3 I0521 10:54:41.184001 22582644975424 xla_bridge.py:660] Unable to initialize backend 'cuda': Unable to load cuDNN. Is it installed? I0521 10:54:41.203725 22582644975424 xla_bridge.py:660] Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: CUDA I0521 10:54:41.204897 22582644975424 xla_bridge.py:660] Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory W0521 10:54:41.205006 22582644975424 xla_bridge.py:724] CUDA backend failed to initialize: Unable to load cuDNN. Is it installed? (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.) I0521 10:54:43.223712 22582644975424 utils.py:378] Model model_1_multimer_v3 is running 0 prediction with default MSA depth I0521 10:54:44.160407 22582644975424 utils.py:378] Model model_2_multimer_v3 is running 0 prediction with default MSA depth I0521 10:54:45.103848 22582644975424 utils.py:378] Model model_3_multimer_v3 is running 0 prediction with default MSA depth I0521 10:54:46.035488 22582644975424 utils.py:378] Model model_4_multimer_v3 is running 0 prediction with default MSA depth I0521 10:54:46.962665 22582644975424 utils.py:378] Model model_5_multimer_v3 is running 0 prediction with default MSA depth I0521 10:54:46.962839 22582644975424 utils.py:384] Using random seed 1682205902281770834 for the data pipeline I0521 10:54:47.012253 22582644975424 run_multimer_jobs.py:323] now running prediction on HrpN_and_WAK3 I0521 10:54:47.012355 22582644975424 run_multimer_jobs.py:324] output_path is /storage/home/jbh249/scratch/output/models/HrpN_and_WAK3 I0521 10:54:47.012434 22582644975424 predict_structure.py:125] Checking for existing results I0521 10:54:47.012791 22582644975424 predict_structure.py:139] Running model model_1_multimer_v3_pred_0 on HrpN_and_WAK3 I0521 10:54:47.013137 22582644975424 model.py:165] Running predict with shape(feat) = {'aatype': (1144,), 'residue_index': (1144,), 'seq_length': (), 'msa': (2257, 1144), 'num_alignments': (), 'template_aatype': (4, 1144), 'template_all_atom_mask': (4, 1144, 37), 'template_all_atom_positions': (4, 1144, 37, 3), 'asym_id': (1144,), 'sym_id': (1144,), 'entity_id': (1144,), 'deletion_matrix': (2257, 1144), 'deletion_mean': (1144,), 'all_atom_mask': (1144, 37), 'all_atom_positions': (1144, 37, 3), 'assembly_num_chains': (), 'entity_mask': (1144,), 'num_templates': (), 'cluster_bias_mask': (2257,), 'bert_mask': (2257, 1144), 'seq_mask': (1144,), 'msa_mask': (2257, 1144)}

Qrouger commented 4 months ago

No, he just run slowly on CPU.

Quentin.

dingquanyu commented 4 months ago

Available platform names are: CUDA

Hi @J-Held

Glad it worked. These messages are not actually errors but some logs that reflect the status of you modelling job. Since you have this Available platform names are: CUDA printed out, it should be successfully running on you GPU. But I would still suggest running nvidia-smi just to double check if the programme is actually consuming your GPU RAM.

Yours Dingquan

KosinskiLab / AlphaPulldown

run_multimer_jobs issue #342