CUDA error - Githubissues

sbassi commented 1 year ago

Bug description

I try the esmfold_inference.py script with the included data and got a CUDA error.

Reproduction steps After installing the program following the instructions in the repo, I run this:

python scripts/esmfold_inference.py -i examples/data/few_proteins.fasta -o /home/ubuntu/

Expected behavior Get no error.

Logs

(esm) ubuntu@ip-10-0-0-77:~/esm$ python scripts/esmfold_inference.py -i examples/data/few_proteins.fasta -o /home/ubuntu/
23/02/15 22:31:53 | INFO | root | Reading sequences from examples/data/few_proteins.fasta
23/02/15 22:31:53 | INFO | root | Loaded 3 sequences from examples/data/few_proteins.fasta
23/02/15 22:31:53 | INFO | root | Loading model
23/02/15 22:31:55 | INFO | torch.distributed.nn.jit.instantiator | Created a temporary directory at /tmp/tmpwt8fglax
23/02/15 22:31:55 | INFO | torch.distributed.nn.jit.instantiator | Writing /tmp/tmpwt8fglax/_remote_module_non_scriptable.py
23/02/15 22:33:43 | INFO | root | Starting Predictions
Traceback (most recent call last):
  File "/home/ubuntu/esm/scripts/esmfold_inference.py", line 157, in <module>
    output = model.infer(sequences, num_recycles=args.num_recycles)
  File "/home/ubuntu/miniconda3/envs/esm/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/esm/lib/python3.9/site-packages/esm/esmfold/v1/esmfold.py", line 282, in infer
    output = self.forward(
  File "/home/ubuntu/miniconda3/envs/esm/lib/python3.9/site-packages/esm/esmfold/v1/esmfold.py", line 163, in forward
    esm_s = self._compute_language_model_representations(esmaa)
  File "/home/ubuntu/miniconda3/envs/esm/lib/python3.9/site-packages/esm/esmfold/v1/esmfold.py", line 108, in _compute_language_model_representations
    res = self.esm(
  File "/home/ubuntu/miniconda3/envs/esm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/miniconda3/envs/esm/lib/python3.9/site-packages/esm/model/esm2.py", line 112, in forward
    x, attn = layer(
  File "/home/ubuntu/miniconda3/envs/esm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/miniconda3/envs/esm/lib/python3.9/site-packages/esm/modules.py", line 125, in forward
    x, attn = self.self_attn(
  File "/home/ubuntu/miniconda3/envs/esm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/miniconda3/envs/esm/lib/python3.9/site-packages/esm/multihead_attention.py", line 357, in forward
    attn_weights = torch.bmm(q, k.transpose(1, 2))
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasGemmStridedBatchedExFix( handle, opa, opb, m, n, k, (void*)(&falpha), a, CUDA_R_16F, lda, stridea, b, CUDA_R_16F, ldb, strideb, (void*)(&fbeta), c, CUDA_R_16F, ldc, stridec, num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`

Additional context Ran in a g4dn.2xl EC2 instance that has 32Gb RAM and everything installed without any error. Also:

/usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

sbassi commented 1 year ago

Same when running one of the sample programs. This is the code:

import torch
import esm

model = esm.pretrained.esmfold_v1()
model = model.eval().cuda()

# Optionally, uncomment to set a chunk size for axial attention. This can help reduce memory.
# Lower sizes will have lower memory requirements at the cost of increased speed.
# model.set_chunk_size(128)

sequence = "MKTVRQERLKSIVRILERSKEPVSGAQLAEELSVSRQVIVQDIAYLRSLGYNIVATPRGYVLAGG"
# Multimer prediction can be done with chains separated by ':'

with torch.no_grad():
    output = model.infer_pdb(sequence)

with open("result.pdb", "w") as f:
    f.write(output)

import biotite.structure.io as bsio
struct = bsio.load_structure("result.pdb", extra_fields=["b_factor"])
print(struct.b_factor.mean())  # this will be the pLDDT
# 88.3

And here is the run:

(esm) ubuntu@ip-10-0-0-77:~/esm$ python p2.py
Traceback (most recent call last):
  File "/home/ubuntu/esm/p2.py", line 15, in <module>
    output = model.infer_pdb(sequence)
  File "/home/ubuntu/esm/esm/esmfold/v1/esmfold.py", line 305, in infer_pdb
    return self.infer_pdbs([sequence], *args, **kwargs)[0]
  File "/home/ubuntu/esm/esm/esmfold/v1/esmfold.py", line 300, in infer_pdbs
    output = self.infer(seqs, *args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/esm/lib/python3.9/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/ubuntu/esm/esm/esmfold/v1/esmfold.py", line 277, in infer
    output = self.forward(
  File "/home/ubuntu/esm/esm/esmfold/v1/esmfold.py", line 156, in forward
    esm_s = self._compute_language_model_representations(esmaa)
  File "/home/ubuntu/esm/esm/esmfold/v1/esmfold.py", line 103, in _compute_language_model_representations
    res = self.esm(
  File "/home/ubuntu/miniconda3/envs/esm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/esm/esm/model/esm2.py", line 112, in forward
    x, attn = layer(
  File "/home/ubuntu/miniconda3/envs/esm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/esm/esm/modules.py", line 125, in forward
    x, attn = self.self_attn(
  File "/home/ubuntu/miniconda3/envs/esm/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1194, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/ubuntu/esm/esm/multihead_attention.py", line 357, in forward
    attn_weights = torch.bmm(q, k.transpose(1, 2))
RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling `cublasGemmStridedBatchedExFix( handle, opa, opb, m, n, k, (void*)(&falpha), a, CUDA_R_16F, lda, stridea, b, CUDA_R_16F, ldb, strideb, (void*)(&fbeta), c, CUDA_R_16F, ldc, stridec, num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`

nikitos9000 commented 1 year ago

Hi @sbassi , could you please provide your CUDA and pytorch version numbers?

sbassi commented 1 year ago

Here is according to the AMI:

PyTorch 1.11.0 (Ubuntu 20.04) CUDA version: 11.5 NVIDIA driver version: 510.47.03

And this is what I actually using:

CUDA:

(esm) ubuntu@ip-10-xx:~$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0

Pytorch:

(esm) ubuntu@ip-10-0-0-77:~$ python
Python 3.9.16 | packaged by conda-forge | (main, Feb  1 2023, 21:39:03)
[GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.__version__)
1.13.1+cu117

From this listing, which one do you recommend? https://docs.aws.amazon.com/dlami/latest/devguide/appendix-ami-release-notes.html

MOTD:

sbassi commented 1 year ago

Looks like it was my error since I made my own conda env instead of using the pre built environment that is ran with "source activate pytorch". Now I am using a new AMI (Deep Learning AMI GPU PyTorch 1.12.1 (Ubuntu 20.04) 20220926) and trying this conda env, and worked, so I will close this issue. Thank you very much for your help.

facebookresearch / esm

CUDA error #479