aqlaboratory / openfold

Trainable, memory-efficient, and GPU-friendly PyTorch reproduction of AlphaFold 2
Apache License 2.0
2.84k stars 550 forks source link

CUDA error during inference #342

Open Baldwin-disso opened 1 year ago

Baldwin-disso commented 1 year ago

I installed openfold on a local server machine using recommanded installation steps

scripts/install_third_party_dependencies.sh
source scripts/deactivate_conda_env.sh
python3 setup.py install
scripts/install_hh_suite.sh
bash scripts/download_alphafold_dbs.sh data/

Then when using inference with

python3 run_pretrained_openfold.py \
    ../data/fastas/nrt14 \
    data/flattened/ \
    --uniref90_database_path data/uniref90/uniref90.fasta \
    --mgnify_database_path data/mgnify/mgy_clusters_2018_12.fa \
    --pdb70_database_path data/pdb70/pdb70 \
    --uniclust30_database_path data/uniclust30/uniclust30_2018_08/uniclust30_2018_08 \
    --output_dir ../results/pdbs_openfold_predicted/ \
    --bfd_database_path data/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
    --jackhmmer_binary_path lib/conda/envs/openfold_venv/bin/jackhmmer \
    --hhblits_binary_path lib/conda/envs/openfold_venv/bin/hhblits \
    --hhsearch_binary_path lib/conda/envs/openfold_venv/bin/hhsearch \
    --kalign_binary_path lib/conda/envs/openfold_venv/bin/kalign \
    --config_preset "model_1_ptm" \
    --jax_param_path openfold/resources/params/params_model_1.npz \
    --model_device "cuda:0" 

I get the following error :

INFO:/media/honeypot/baldwin/openfold/openfold/utils/script_utils.py:Successfully loaded JAX parameters at openfold/resources/params/params_model_1.npz...
INFO:/media/honeypot/baldwin/openfold/run_pretrained_openfold.py:Using precomputed alignments for nrt14 at ../results/pdbs_openfold_predicted/alignments...
INFO:/media/honeypot/baldwin/openfold/openfold/utils/script_utils.py:Running inference for nrt14...
Traceback (most recent call last):
  File "/media/honeypot/baldwin/openfold/run_pretrained_openfold.py", line 401, in <module>
    main(args)
  File "/media/honeypot/baldwin/openfold/run_pretrained_openfold.py", line 254, in main
    out = run_model(model, processed_feature_dict, tag, args.output_dir)
  File "/media/honeypot/baldwin/openfold/openfold/utils/script_utils.py", line 159, in run_model
    out = model(batch)
  File "/media/honeypot/baldwin/openfold/lib/conda/envs/openfold_venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/media/honeypot/baldwin/openfold/openfold/model/model.py", line 512, in forward
    outputs, m_1_prev, z_prev, x_prev = self.iteration(
  File "/media/honeypot/baldwin/openfold/openfold/model/model.py", line 245, in iteration
    m, z = self.input_embedder(
  File "/media/honeypot/baldwin/openfold/lib/conda/envs/openfold_venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/media/honeypot/baldwin/openfold/openfold/model/embedders.py", line 116, in forward
    tf_emb_i = self.linear_tf_z_i(tf)
  File "/media/honeypot/baldwin/openfold/lib/conda/envs/openfold_venv/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/media/honeypot/baldwin/openfold/lib/conda/envs/openfold_venv/lib/python3.9/site-packages/torch/nn/modules/linear.py", line 114, in forward
    return F.linear(input, self.weight, self.bias)
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

I am using cuda 11.7. I tried to reinstall openfold but it didn't work. Any idea ?

Paulie-ai commented 11 months ago

i also meet this error when using cuda11.6 with gcc11.6 and a100 cards, have you fixed this error?

quailwwk commented 11 months ago

i also meet this error when using cuda11.6 with gcc11.6 and a100 cards, have you fixed this error?

In my experience, there may be dimension mismatches when performing nn.Linear or nn.Embedding. You can run the script on the CPU to see what happened.

John-D-Boom commented 9 months ago

I've run into this issue, and there's a few steps you can take to troubleshoot.

First, make sure you always export library paths before running openfold. I would put this in your .bashrc or .bash_profile to make sure it runs every time.

export LIBRARY_PATH=$CONDA_PREFIX/lib:$LIBRARY_PATH export LD_LIBRARY_PATH=$CONDA_PREFIX/lib:$LD_LIBRARY_PATH

Second, make sure you've run the install_third_party_dependencies script as it says in the .readme. If you've done that and the issue persists, try running python setup.py install. After running either, make sure to restart your session so that the changes are implemented, remembering to export the library paths after restarting.

You should now be able to pass the unit tests in bash scripts/run_unit_tests.sh.

This has worked for me to resolved this issue. Hope this helps!