fhalab / MLDE

A machine-learning package for navigating combinatorial protein fitness landscapes.
Other
117 stars 27 forks source link

Blas GEMM launch failed #4

Open Zyj061 opened 2 years ago

Zyj061 commented 2 years ago

When test the scirtp `generate_encoding.py', it always interrupt with the error: Blas GEMM launch failed: a.shape=(56, 512), b.shape](512,512), m=56, k=512. We test the code with GPU 3090 that has 24G free memory, so I think it is not due to the out of memory.

brucejwittmann commented 2 years ago

Could you provide a little more information on the way you are running the script? I haven't seen this error before, so it is likely a problem with one of the dependencies (tensorflow most likely). Information that would be helpful for me to assist:

  1. Conda environment used to run the code (including CUDA version)
  2. Specific commands sent to the script
  3. The full traceback
Zyj061 commented 2 years ago

Thank you for your reply. Please see below for information about the script we run:

  1. The Conda environment is mlde, which is created by your tutorial using mlde.yml.
  2. The comman is

python generate_encoding.py transformer GB1_T2Q --fasta ./code/validation/basic_test_data/2GI9.fasta --positions V39 D40 --batches 1

  1. The full trackback: image image image
brucejwittmann commented 2 years ago

Ok, so it looks like the error begins while calling run_embed.py from TAPE. The error that you're seeing is well documented on the tensorflow GitHub (Linke here). The most recommended solutions seem to be:

  1. Restarting the computer you're seeing the issue on.
  2. Closing all other programs that are using the same GPU. You can check what other programs are using your GPU by calling nvidia-smi.
  3. Allowing GPU memory to grow during code execution.

I always find that 1 is a good strategy whenever something strange is going on with CUDA, pytorch, or tensorflow. I do not believe that 2 and 3 should be a problem in MLDE and TAPE (otherwise many other people would have run into this problem already), but adding code to allow memory growth is always an option to try.

Some other tests that you can run to try and narrow down where the problem is coming from:

  1. MLDE should have produced the fasta files for embedding before it crashed. Try running tape-embed outside of MLDE to see if you get the same error. Instructions for TAPE and the tape-embed command are here. You should run this in the mlde environment.
  2. Try embedding with different TAPE models. See if you get the same error for all models.
  3. Try embedding with the ESM models instead. ESM relies on torch, so you would not expect to see any problems if tensorflow is indeed the culprit. To do this, run in the mlde2 environment and run
python generate_encoding.py esm1_t6_43M_UR50S GB1_T2Q --fasta ./code/validation/basic_test_data/2GI9.fasta --positions V39 D40