Hello, I met the "Not Enough GPU memory" problem after I solved the problem of jax not recognition the GPU device.
The following is the error process.
I install Localcolabfold using "install_colabbatch_linux.sh". When I run the "colabfold_batch", the error of "no GPU detected, will be using CPU" occured. Then I checked whether the jax could recognize the GPU device (refer to #209).
$HOME/software/localcolabfold/colabfold-conda/bin/python3.10
# Python 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] on linux
>>> import jax
>>> print(jax.local_devices()[0].platform)
# CUDA backend failed to initialize: Unable to load cuDNN. Is it installed? (Set TF_CPP_MIN_LOG_LEVEL=0 and rerun for more info.)
# cpu
I degraded the jax version to "jax==0.4.7, jaxlib==0.4.7+cuda11.cudnn86" (refer to #209). Then the jax can recognize the GPU device.
$HOME/software/localcolabfold/colabfold-conda/bin/python3.10
# Python 3.10.13 | packaged by conda-forge | (main, Dec 23 2023, 15:36:39) [GCC 12.3.0] on linux
>>> import jax
>>> print(jax.local_devices()[0].platform)
gpu
Then the colabfold_batch met the problem "No module named 'jax.extend'" (refer to #224). I reinstalled the "dm-haiku==0.0.10". And the colabfold_batch could run on the GPU device. However, I met a new problem "Could not predict HNUJ.ctg90.87. Not Enough GPU memory? FAILED_PRECONDITION: DNN library initialization failed. Look at the errors above for more details.".
I have two 2080 Ti (11GB * 2).
$ nvidia-smi
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.113.01 Driver Version: 535.113.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 2080 Ti Off | 00000000:17:00.0 Off | N/A |
| 38% 41C P0 52W / 250W | 0MiB / 11264MiB | 1% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce RTX 2080 Ti Off | 00000000:25:00.0 Off | N/A |
| 25% 30C P0 21W / 250W | 0MiB / 11264MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
I have 450 amino acids in the fasta file. Is this problem caused by insufficient video memory? It seems that 40 GB of video memory still has this problem? (refer to #90)
In addition, given that I added CUDA 12.1 in my $PATH, I also tried to modify the "install_colabbatch_linux.sh" as suggested by A-Talavera (refer to #210).
I changed "$COLABFOLDDIR/colabfold-conda/bin/pip" install --upgrade "jax[cuda11_pip]==0.4.23"
to "$COLABFOLDDIR/colabfold-conda/bin/pip" install --upgrade "jax[cuda12_pip]==0.4.23"
And the jaxlib-0.4.23+cuda12.cudnn89 will be installed by default. Then I tried to degrade the jax to "jaxlib-0.4.7+cuda12.cudnn88" just following the same process as above. I can run colabfold_batch on GPU. But it still tell me "Could not predict HNUJ.ctg90.87. Not Enough GPU memory? FAILED_PRECONDITION: DNN library initialization failed. Look at the errors above for more details". And in #224, you said "jax-0.4.23+cuda11.cudnn86" was also ok for CUDA 12.1.
What is your installation issue?
Hello, I met the "Not Enough GPU memory" problem after I solved the problem of jax not recognition the GPU device.
The following is the error process.
I install Localcolabfold using "install_colabbatch_linux.sh". When I run the "colabfold_batch", the error of "no GPU detected, will be using CPU" occured. Then I checked whether the jax could recognize the GPU device (refer to #209).
Then I checked the jax and jaxlib version,
I degraded the jax version to "jax==0.4.7, jaxlib==0.4.7+cuda11.cudnn86" (refer to #209). Then the jax can recognize the GPU device.
Then the colabfold_batch met the problem "No module named 'jax.extend'" (refer to #224). I reinstalled the "dm-haiku==0.0.10". And the colabfold_batch could run on the GPU device. However, I met a new problem "Could not predict HNUJ.ctg90.87. Not Enough GPU memory? FAILED_PRECONDITION: DNN library initialization failed. Look at the errors above for more details.".
I have two 2080 Ti (11GB * 2).
I have 450 amino acids in the fasta file. Is this problem caused by insufficient video memory? It seems that 40 GB of video memory still has this problem? (refer to #90)
In addition, given that I added CUDA 12.1 in my $PATH, I also tried to modify the "install_colabbatch_linux.sh" as suggested by A-Talavera (refer to #210).
And the jaxlib-0.4.23+cuda12.cudnn89 will be installed by default. Then I tried to degrade the jax to "jaxlib-0.4.7+cuda12.cudnn88" just following the same process as above. I can run colabfold_batch on GPU. But it still tell me "Could not predict HNUJ.ctg90.87. Not Enough GPU memory? FAILED_PRECONDITION: DNN library initialization failed. Look at the errors above for more details". And in #224, you said "jax-0.4.23+cuda11.cudnn86" was also ok for CUDA 12.1.
Computational environment
/usr/local/cuda/bin/nvcc --version
.)Since LocalColabFold requires CUDA 11.8+, I added CUDA 12.1 to the environment variable $PATH.
Is it because the program calls CUDA 11.3 (/usr/local/cuda/bin/nvcc) instead of CUDA 12.1 in $PATH by default?
Looking forward to your reply. Thank you.
Yulong Li