KosinskiLab / AlphaPulldown

https://doi.org/10.1093/bioinformatics/btac749
GNU General Public License v3.0
176 stars 39 forks source link

GPU not working when run_multimer_jobs.py #339

Open polya18 opened 1 month ago

polya18 commented 1 month ago

Hello, I have Alphafold2 on another conda envs. The GPU works fine when I was using AF2. But in AlphaPulldown conda env, GPU was not working when the run_multimer_jobs.py script was running. The output of run_multimer_jobs.py is slow ( 4 pairs of PPI were produced in one week).

There were some warnings, would you please help me to check what happened? Thanks very much.

run_multimer_jobs.py --mode=pulldown --num_cycle=3 --num_predictions_per_model=1 --output_path=../alphaIP_out/ --data_dir=../alphafold_db/ --protein_lists=./baits.txt,./candidates_reduced.txt --monomer_objects_dir=../alphaIP_out/

2024-05-06 14:41:02.899909: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT 2024-05-06 14:41:03.811569: W tensorflow/core/common_runtime/gpu/gpu_device.cc:2251] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform. Skipping registering GPU devices... I0506 14:41:03.811740 139891085715264 utils.py:271] checking if output_dir exists ../alphaIP_out/ I0506 14:41:03.812948 139891085715264 run_multimer_jobs.py:229] All pickle files have been found I0506 14:41:05.139197 139891085715264 run_multimer_jobs.py:236] done creating multimer P78344_and_O76094 I0506 14:41:06.071167 139891085715264 run_multimer_jobs.py:236] done creating multimer P78344_and_P05386 I0506 14:41:07.021520 139891085715264 run_multimer_jobs.py:236] done creating multimer P78344_and_P05387 I0506 14:41:08.148681 139891085715264 run_multimer_jobs.py:236] done creating multimer P78344_and_P05388 I0506 14:41:09.929722 139891085715264 run_multimer_jobs.py:236] done creating multimer P78344_and_P08240 I0506 14:41:10.864954 139891085715264 run_multimer_jobs.py:236] done creating multimer P78344_and_P08708 I0506 14:41:11.798123 139891085715264 run_multimer_jobs.py:236] done creating multimer P78344_and_P09132 I0506 14:41:13.058462 139891085715264 run_multimer_jobs.py:236] done creating multimer P78344_and_P15880 ...... I0506 14:42:29.906417 139891085715264 run_multimer_jobs.py:387] object: P78344_and_O76094 I0506 14:42:29.906486 139891085715264 run_multimer_jobs.py:389] Modeling new interaction for ../alphaIP_out/P78344_and_O76094 I0506 14:42:30.114167 139891085715264 xla_bridge.py:660] Unable to initialize backend 'cuda': Unable to load cuDNN. Is it installed? I0506 14:42:30.114634 139891085715264 xla_bridge.py:660] Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: CUDA I0506 14:42:30.115345 139891085715264 xla_bridge.py:660] Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory W0506 14:42:30.115505 139891085715264 xla_bridge.py:724] CUDA backend failed to initialize: Unable to load cuDNN. Is it installed? (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.) I0506 14:42:29.906417 139891085715264 run_multimer_jobs.py:387] object: P78344_and_O76094 I0506 14:42:29.906486 139891085715264 run_multimer_jobs.py:389] Modeling new interaction for ../alphaIP_out/P78344_and_O76094 I0506 14:42:30.114167 139891085715264 xla_bridge.py:660] Unable to initialize backend 'cuda': Unable to load cuDNN. Is it installed? I0506 14:42:30.114634 139891085715264 xla_bridge.py:660] Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: CUDA I0506 14:42:30.115345 139891085715264 xla_bridge.py:660] Unable to initialize backend 'tpu': INTERNAL: Failed to open libtpu.so: libtpu.so: cannot open shared object file: No such file or directory W0506 14:42:30.115505 139891085715264 xla_bridge.py:724] CUDA backend failed to initialize: Unable to load cuDNN. Is it installed? (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)

Qrouger commented 1 month ago

Hi @polya18, It's looks like a problem of compatibility between TensorRT and cuda. What it's yours cuda version please ? (you can check it with command : nvidia-smi) Like in #237

Quentin

polya18 commented 1 month ago

It's CUDA 11.4.

Qrouger commented 1 month ago

Alpha Pulldown recommended to use CUDA 11.8.0 who work with all version of package download on the "classic" conda environnement. If you don't want to change your CUDA you could try to download all packages compatible with your CUDA version in the conda environnements. (GPU driver, Jax, cudnn,cuda toolkit, tensorRT)

If all of yours packages are compatible with your CUDA, be sure that the command does not take into account packages installed locally but rather those in the environment. This can happen with conda.

Quentin.