CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered

Computational environment

OS: Ubuntu 20.04.1
CUDA version if Linux Cuda compilation tools, release 12.4, V12.4.131 Build cuda_12.4.r12.4/compiler.34097967_0

Hi,

My job failed with an error message like "CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered".

Is this due to a shortage of GPU memory?

I ran the job on a server with two Quadro RTX 8000. Because I was allowed to use only one of the two GPUs, I ran the command below before running colabfold_batch. export CUDA_VISIBLE_DEVICES=0

My main command is below. nohup colabfold_batch Hexamer.faa Hexamer.ColabFold --num-recycle 3 > nohup.log 2>&1 &

Below is the whole "log.txt" file created within "Hexamer.ColabFold" directory.

2024-06-01 13:39:23,688 Running colabfold 1.5.5 (1648d2335943f9a483b6a803ebaea3e76162c788)
2024-06-01 13:39:23,887 Running on GPU
2024-06-01 13:39:24,307 Found 5 citations for tools or databases
2024-06-01 13:39:24,307 Query 1/1: Hexamer (length 5856)
2024-06-01 13:39:25,934 Setting max_seq=508, max_extra_seq=828
2024-06-01 13:59:21,926 Could not predict Hexamer. Not Enough GPU memory? INTERNAL: Failed to enqueue async memset operation: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2024-06-01 13:59:21,926 Done

Below is the whole "nohup log" file.

nohup: ignoring input

WARNING: You are welcome to use the default MSA server, however keep in mind that it's a
limited shared resource only capable of processing a few thousand MSAs per day. Please
submit jobs only from a single IP address. We reserve the right to limit access to the
server case-by-case when usage exceeds fair use. If you require more MSAs: You can 
precompute all MSAs with `colabfold_search` or host your own API and pass it to `--host-url`

2024-06-01 13:39:23,688 Running colabfold 1.5.5 (1648d2335943f9a483b6a803ebaea3e76162c788)
2024-06-01 13:39:23,887 Running on GPU
2024-06-01 13:39:24,307 Found 5 citations for tools or databases
2024-06-01 13:39:24,307 Query 1/1: Hexamer (length 5856)

  0%|          | 0/150 [elapsed: 00:00 remaining: ?]
SUBMIT:   0%|          | 0/150 [elapsed: 00:00 remaining: ?]
COMPLETE:   0%|          | 0/150 [elapsed: 00:00 remaining: ?]
COMPLETE: 100%|██████████| 150/150 [elapsed: 00:00 remaining: 00:00]
COMPLETE: 100%|██████████| 150/150 [elapsed: 00:00 remaining: 00:00]
E0601 13:59:21.874842   93333 gpu_timer.cc:156] INTERNAL: Could not synchronize CUDA stream: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
E0601 13:59:21.874864   93333 gpu_timer.cc:162] INTERNAL: Error destroying CUDA event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
E0601 13:59:21.874866   93333 gpu_timer.cc:168] INTERNAL: Error destroying CUDA event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
E0601 13:59:21.895475   93333 se_gpu_pjrt_client.cc:634] Failed to query available memory for GPU 0
E0601 13:59:21.895972   93333 se_gpu_pjrt_client.cc:634] Failed to query available memory for GPU 1
2024-06-01 13:39:25,934 Setting max_seq=508, max_extra_seq=828
2024-06-01 13:59:21,926 Could not predict 2901385346_Hexamer. Not Enough GPU memory? INTERNAL: Failed to enqueue async memset operation: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2024-06-01 13:59:21,926 Done

The input was homohexamer with a total length of 5,856 aa. A job with homopentamer of the same protein (4,880 aa) was finished successfully.

Thanks.

YoshitakaMo / localcolabfold

CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered #239