Question: Could not run on a cluster

hypothalamus01 commented 1 year ago

What is your question? I have successfully run localcolabfold on our workstation with A4000 GPU. I tried to install it on a cluster with A100 80GB GPU to have more GPU memory. The installation was success but I got the error message below when I tried to predict a 156 a.a protein. Could anyone help me to troubleshoot this? Many thanks!

2023-04-04 08:00:06,146 Running colabfold 1.5.2 (05c0cb38d002180da3b58cdc53ea45a6b2a62d31) 2023-04-04 08:00:43,447 Running on GPU 2023-04-04 08:00:46,987 Found 4 citations for tools or databases 2023-04-04 08:00:46,988 Query 1/1: R1 (length 156) 2023-04-04 08:00:47,211 Setting max_seq=512, max_extra_seq=2735 2023-04-04 08:00:55,676 Could not predict R1. Not Enough GPU memory? FAILED_PRECONDITION: Couldn't get ptxas version string: INTERNAL: Running ptxas --version returned 32512 2023-04-04 08:00:55,678 Done

Computational environment

OS: CentOS
CUDA version if Linux: Not sure how check it
GPU: A100 80GB
Memory: 60GB

To Reproduce module load conda/latest conda activate colab export PATH="/home/c845w740/colab/localcolabfold/colabfold-conda/bin:$PATH"

colabfold_batch R1.fasta R1

hypothalamus01 commented 1 year ago

I added "module load cuda/11.2" and get a different error message.

2023-04-04 09:39:01,305 Running colabfold 1.5.2 (05c0cb38d002180da3b58cdc53ea45a6b2a62d31) 2023-04-04 09:40:01,679 Running on GPU 2023-04-04 09:40:03,580 Found 4 citations for tools or databases 2023-04-04 09:40:03,581 Query 1/1: R1 (length 156)

RUNNING: 5%|▍ | 7/150 [elapsed: 00:08 remaining: 02:50] COMPLETE: 5%|▍ | 7/150 [elapsed: 00:16 remaining: 02:50] COMPLETE: 100%|██████████| 150/150 [elapsed: 00:16 remaining: 00:00] COMPLETE: 100%|██████████| 150/150 [elapsed: 00:18 remaining: 00:00] 2023-04-04 09:40:22.173961: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:767] failed to alloc 47988080640 bytes unified memory; result: CUDA_ERROR_INVALID_VALUE: invalid argument 2023-04-04 09:40:22.174004: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:767] failed to alloc 43189272576 bytes unified memory; result: CUDA_ERROR_INVALID_VALUE: invalid argument 2023-04-04 09:40:22.174025: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:767] failed to alloc 38870343680 bytes unified memory; result: CUDA_ERROR_INVALID_VALUE: invalid argument 2023-04-04 09:40:22.174042: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:767] failed to alloc 34983309312 bytes unified memory; result: CUDA_ERROR_INVALID_VALUE: invalid argument 2023-04-04 09:40:22.174058: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:767] failed to alloc 31484977152 bytes unified memory; result: CUDA_ERROR_INVALID_VALUE: invalid argument 2023-04-04 09:40:22.174082: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:767] failed to alloc 28336478208 bytes unified memory; result: CUDA_ERROR_INVALID_VALUE: invalid argument 2023-04-04 09:40:22.174099: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:767] failed to alloc 25502830592 bytes unified memory; result: CUDA_ERROR_INVALID_VALUE: invalid argument 2023-04-04 09:40:22.174114: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:767] failed to alloc 22952546304 bytes unified memory; result: CUDA_ERROR_INVALID_VALUE: invalid argument 2023-04-04 09:40:22.174129: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:767] failed to alloc 20657291264 bytes unified memory; result: CUDA_ERROR_INVALID_VALUE: invalid argument 2023-04-04 09:40:22.174144: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:767] failed to alloc 18591561728 bytes unified memory; result: CUDA_ERROR_INVALID_VALUE: invalid argument 2023-04-04 09:40:25.540195: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:767] failed to alloc 16732404736 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2023-04-04 09:40:29.437371: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:767] failed to alloc 15059164160 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2023-04-04 09:40:33.333199: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:767] failed to alloc 13553247232 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2023-04-04 09:40:37.230327: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:767] failed to alloc 12197921792 bytes unified memory; result: CUDA_ERROR_OUT_OF_MEMORY: out of memory 2023-04-04 09:41:45.306115: W external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gemm_algorithm_picker.cc:219] Failed to find best cuBLAS algorithm, GEMM performance might be suboptimal: INTERNAL: All algorithms tried for %cublas-gemm.1 = bf16[624,256]{1,0} custom-call(bf16[624,64]{1,0} %bitcast.653, bf16[64,256]{1,0} %concatenate.66), custom_call_target="__cublas$gemm", metadata={op_name="jit(apply_fn)/jit(main)/alphafold/alphafold_iteration/evoformer/__layer_stack_no_state/while/body/extra_msa_stack/msa_row_attention_with_pair_bias/while/body/attention/bqa,ahc->bqhc/jit(_einsum)/dot_general[dimension_numbers=(((2,), (0,)), ((), ())) precision=None preferred_element_type=None]" source_file="/home/c845w740/colab/localcolabfold/colabfold-conda/bin/colabfold_batch" source_line=8}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"1\"],\"rhs_contracting_dimensions\":[\"0\"],\"lhs_batch_dimensions\":[],\"rhs_batch_dimensions\":[]},\"precision_config\":{\"operand_precision\":[\"DEFAULT\",\"DEFAULT\"]},\"epilogue\":\"DEFAULT\"}" failed. Falling back to default algorithm. Per-algorithm errors:

2023-04-04 09:41:45.310335: W external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gemm_algorithm_picker.cc:219] Failed to find best cuBLAS algorithm, GEMM performance might be suboptimal: INTERNAL: All algorithms tried for %cublas-batch-gemm.1 = bf16[4,8,156,156]{3,2,1,0} custom-call(bf16[4,8,156,8]{3,2,1,0} %transpose.405, bf16[4,8,8,156]{3,2,1,0} %transpose.406), custom_call_target="__cublas$gemm", metadata={op_name="jit(apply_fn)/jit(main)/alphafold/alphafold_iteration/evoformer/__layer_stack_no_state/while/body/extra_msa_stack/msa_row_attention_with_pair_bias/while/body/attention/bqhc,bkhc->bhqk/jit(_einsum)/dot_general[dimension_numbers=(((3,), (3,)), ((0, 2), (0, 2))) precision=None preferred_element_type=None]" source_file="/home/c845w740/colab/localcolabfold/colabfold-conda/bin/colabfold_batch" source_line=8}, backend_config="{\"alpha_real\":1,\"alpha_imag\":0,\"beta\":0,\"dot_dimension_numbers\":{\"lhs_contracting_dimensions\":[\"3\"],\"rhs_contracting_dimensions\":[\"2\"],\"lhs_batch_dimensions\":[\"0\",\"1\"],\"rhs_batch_dimensions\":[\"0\",\"1\"]},\"precision_config\":{\"operand_precision\":[\"DEFAULT\",\"DEFAULT\"]},\"epilogue\":\"DEFAULT\"}" failed. Falling back to default algorithm. Per-algorithm errors:

hypothalamus01 commented 1 year ago

I solved it by includes these two line. module load cuda/11.7 module load compiler/gcc/11.1

GriffinYoung commented 11 months ago

I also ran into this issue and solved it by adding cuda-nvcc to my conda environment.

YoshitakaMo / localcolabfold

Question: Could not run on a cluster #147