cuda issues - Githubissues

DerLorenz commented 1 year ago

Hi,

I installed model angelo on our HPC following the README instruction using Anaconda3. When trying to run a prediction (submitted via slurm), the C-alpha prediction takes very long. Checking the slurm.out file, I see that cuda would not be available. Thus, i have no GPU usage during C-alpha prediction.

/path/to/Anaconda3/envs/model_angelo/lib/python3.9/site-packages/torch/amp/autocast_mode.py:204: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn('User provided device_type of \'cuda\', but CUDA is not available. Disabling')

Checking torch availability seems ok.

Python 3.9.17 (main, Jul  5 2023, 20:41:20) 
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available
<function is_available at 0x7ff00b1e5280>

I reinstalled it several times following the personal use or the sahred computational enviroment instructions. Also, I tried different workarounds suggested in previous threads (https://github.com/3dem/model-angelo/issues/24), but nothing works.

I am a bit confused and not sure what I could do to fix it. Thus, any help would be very much appreciated!

Best, Lorenz

jamaliki commented 1 year ago

Hi Lorenz,

This is strange. The check you did, however, is incorrect. Do you mind running the following and letting me know the output?

import torch
print(torch.cuda.is_available())

Note how torch.cuda.is_available() has to be run as a function.

Best, Kiarash.

DerLorenz commented 1 year ago

Dear Kiarash,

Thanks for your quick reply! Oh sorry for my mistake here. Here is the correct check.

>>> import torch
>>> print(torch.cuda.is_available())
False

I see there is an issue with torch, though I am not sure why. I was able to succesfully install model angelo on my workstation before. Here is a list of all packages in my conda enviroment.

# packages in environment at /path/to/Anaconda3/envs/model_angelo:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_openmp_mutex             5.1                       1_gnu  
biopython                 1.81                     pypi_0    pypi
blas                      1.0                         mkl  
brotlipy                  0.7.0           py39h27cfd23_1003  
bzip2                     1.0.8                h7b6447c_0  
ca-certificates           2023.05.30           h06a4308_0  
certifi                   2023.7.22        py39h06a4308_0  
cffi                      1.15.1           py39h5eee18b_3  
charset-normalizer        2.0.4              pyhd3eb1b0_0  
contourpy                 1.1.0                    pypi_0    pypi
cryptography              41.0.2           py39h22a60cf_0  
cuda-cudart               11.7.99                       0    nvidia
cuda-cupti                11.7.101                      0    nvidia
cuda-libraries            11.7.1                        0    nvidia
cuda-nvrtc                11.7.99                       0    nvidia
cuda-nvtx                 11.7.91                       0    nvidia
cuda-runtime              11.7.1                        0    nvidia
cudatoolkit               11.7.0              hd8887f6_10    nvidia
cycler                    0.11.0                   pypi_0    pypi
einops                    0.6.1                    pypi_0    pypi
fair-esm                  2.0.0                    pypi_0    pypi
ffmpeg                    4.3                  hf484d3e_0    pytorch
filelock                  3.9.0            py39h06a4308_0  
fonttools                 4.41.1                   pypi_0    pypi
freetype                  2.12.1               h4a9f257_0  
giflib                    5.2.1                h5eee18b_3  
gmp                       6.2.1                h295c915_3  
gmpy2                     2.1.2            py39heeb90bb_0  
gnutls                    3.6.15               he1e5248_0  
idna                      3.4              py39h06a4308_0  
importlib-resources       6.0.0                    pypi_0    pypi
intel-openmp              2023.1.0         hdb19cb5_46305  
jinja2                    3.1.2            py39h06a4308_0  
jpeg                      9e                   h5eee18b_1  
kiwisolver                1.4.4                    pypi_0    pypi
lame                      3.100                h7b6447c_0  
lcms2                     2.12                 h3be6417_0  
ld_impl_linux-64          2.38                 h1181459_1  
lerc                      3.0                  h295c915_0  
libcublas                 11.10.3.66                    0    nvidia
libcufft                  10.7.2.124           h4fbf590_0    nvidia
libcufile                 1.7.1.12                      0    nvidia
libcurand                 10.3.3.129                    0    nvidia
libcusolver               11.4.0.1                      0    nvidia
libcusparse               11.7.4.91                     0    nvidia
libdeflate                1.17                 h5eee18b_0  
libffi                    3.4.4                h6a678d5_0  
libgcc-ng                 11.2.0               h1234567_1  
libgomp                   11.2.0               h1234567_1  
libiconv                  1.16                 h7f8727e_2  
libidn2                   2.3.4                h5eee18b_0  
libnpp                    11.7.4.75                     0    nvidia
libnvjpeg                 11.8.0.2                      0    nvidia
libpng                    1.6.39               h5eee18b_0  
libstdcxx-ng              11.2.0               h1234567_1  
libtasn1                  4.19.0               h5eee18b_0  
libtiff                   4.5.0                h6a678d5_2  
libunistring              0.9.10               h27cfd23_0  
libwebp                   1.2.4                h11a3e52_1  
libwebp-base              1.2.4                h5eee18b_1  
loguru                    0.7.0                    pypi_0    pypi
lz4-c                     1.9.4                h6a678d5_0  
markupsafe                2.1.1            py39h7f8727e_0  
matplotlib                3.7.2                    pypi_0    pypi
mkl                       2023.1.0         h6d00ec8_46342  
mkl-service               2.4.0            py39h5eee18b_1  
mkl_fft                   1.3.6            py39h417a72b_1  
mkl_random                1.2.2            py39h417a72b_1  
model-angelo              1.0.1                    pypi_0    pypi
mpc                       1.1.0                h10f8cd9_1  
mpfr                      4.0.2                hb69a4c5_1  
mpmath                    1.3.0            py39h06a4308_0  
mrcfile                   1.4.3                    pypi_0    pypi
ncurses                   6.4                  h6a678d5_0  
nettle                    3.7.3                hbbd107a_1  
networkx                  3.1              py39h06a4308_0  
numpy                     1.25.0           py39h5f9d8c6_0  
numpy-base                1.25.0           py39hb5e798b_0  
openh264                  2.1.1                h4ff587b_0  
openssl                   3.0.9                h7f8727e_0  
packaging                 23.1                     pypi_0    pypi
pandas                    2.0.3                    pypi_0    pypi
pillow                    9.4.0            py39h6a678d5_0  
pip                       23.2.1           py39h06a4308_0  
psutil                    5.9.5                    pypi_0    pypi
pycparser                 2.21               pyhd3eb1b0_0  
pyhmmer                   0.8.2                    pypi_0    pypi
pyopenssl                 23.2.0           py39h06a4308_0  
pyparsing                 3.0.9                    pypi_0    pypi
pysocks                   1.7.1            py39h06a4308_0  
python                    3.9.17               h955ad1f_0  
python-dateutil           2.8.2                    pypi_0    pypi
pytorch                   2.0.1           py3.9_cuda11.7_cudnn8.5.0_0    pytorch
pytorch-cuda              11.7                 h778d358_5    pytorch
pytorch-mutex             1.0                        cuda    pytorch
pytz                      2023.3                   pypi_0    pypi
readline                  8.2                  h5eee18b_0  
requests                  2.31.0           py39h06a4308_0  
scipy                     1.11.1                   pypi_0    pypi
setuptools                68.0.0           py39h06a4308_0  
six                       1.16.0                   pypi_0    pypi
sqlite                    3.41.2               h5eee18b_0  
sympy                     1.11.1           py39h06a4308_0  
tbb                       2021.8.0             hdb19cb5_0  
tk                        8.6.12               h1ccaba5_0  
torchaudio                2.0.2                py39_cu117    pytorch
torchtriton               2.0.0                      py39    pytorch
torchvision               0.15.2               py39_cu117    pytorch
tqdm                      4.65.0                   pypi_0    pypi
typing_extensions         4.7.1            py39h06a4308_0  
tzdata                    2023.3                   pypi_0    pypi
urllib3                   1.26.16          py39h06a4308_0  
wheel                     0.38.4           py39h06a4308_0  
xz                        5.4.2                h5eee18b_0  
zipp                      3.16.2                   pypi_0    pypi
zlib                      1.2.13               h5eee18b_0  
zstd                      1.5.5                hc292b87_0

This one I installed as suggested in isssuehttps://github.com/3dem/model-angelo/issues/24 with some adaption:

$ conda create -n model_angelo python=3.9 -y
$ conda activate model_angelo
(model_angelo) $ conda install -y pytorch pytorch-cuda=11.7 torchvision torchaudio cudatoolkit=11.7 -c nvidia -c pytorch
(model_angelo) $ python3 -m pip install -r requirements.txt
(model_angelo) $ python3 setup.py install
(model_angelo) $ export TORCH_HOME=/path/to/weights
(model_angelo) $ conda env config vars set TORCH_HOME="$TORCH_HOME"
(model_angelo) $ conda deactivate && conda activate model_angelo

I already tried before to install it the way(s) suggested in the README with the same outcome of the C-alpha prediction taking forever (minutes/iteration) and CUDA not being available.

Best, Lorenz

DerLorenz commented 1 year ago

I reinstalled model_angelo again following the readme instructions and just changed the install script to check for and create the enviroment model_angelo_1. Again here is the check:

`>>> import torch

print(torch.cuda.is_available()) False`

Also here is the list of installed packages:

`# packages in environment at /software/extra/the_real_lorenz/Anaconda3/envs/model_angelo_1:
#
# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main  
_openmp_mutex             5.1                       1_gnu  
biopython                 1.81                     pypi_0    pypi
blas                      1.0                         mkl  
brotlipy                  0.7.0           py310h7f8727e_1002  
bzip2                     1.0.8                h7b6447c_0  
ca-certificates           2023.05.30           h06a4308_0  
certifi                   2023.7.22       py310h06a4308_0  
cffi                      1.15.1          py310h5eee18b_3  
charset-normalizer        2.0.4              pyhd3eb1b0_0  
contourpy                 1.1.0                    pypi_0    pypi
cryptography              41.0.2          py310h22a60cf_0  
cuda-cudart               11.7.99                       0    nvidia
cuda-cupti                11.7.101                      0    nvidia
cuda-libraries            11.7.1                        0    nvidia
cuda-nvrtc                11.7.99                       0    nvidia
cuda-nvtx                 11.7.91                       0    nvidia
cuda-runtime              11.7.1                        0    nvidia
cycler                    0.11.0                   pypi_0    pypi
einops                    0.6.1                    pypi_0    pypi
fair-esm                  2.0.0                    pypi_0    pypi
ffmpeg                    4.3                  hf484d3e_0    pytorch
filelock                  3.9.0           py310h06a4308_0  
fonttools                 4.41.1                   pypi_0    pypi
freetype                  2.12.1               h4a9f257_0  
giflib                    5.2.1                h5eee18b_3  
gmp                       6.2.1                h295c915_3  
gmpy2                     2.1.2           py310heeb90bb_0  
gnutls                    3.6.15               he1e5248_0  
idna                      3.4             py310h06a4308_0  
intel-openmp              2023.1.0         hdb19cb5_46305  
jinja2                    3.1.2           py310h06a4308_0  
jpeg                      9e                   h5eee18b_1  
kiwisolver                1.4.4                    pypi_0    pypi
lame                      3.100                h7b6447c_0  
lcms2                     2.12                 h3be6417_0  
ld_impl_linux-64          2.38                 h1181459_1  
lerc                      3.0                  h295c915_0  
libcublas                 11.10.3.66                    0    nvidia
libcufft                  10.7.2.124           h4fbf590_0    nvidia
libcufile                 1.7.1.12                      0    nvidia
libcurand                 10.3.3.129                    0    nvidia
libcusolver               11.4.0.1                      0    nvidia
libcusparse               11.7.4.91                     0    nvidia
libdeflate                1.17                 h5eee18b_0  
libffi                    3.4.4                h6a678d5_0  
libgcc-ng                 11.2.0               h1234567_1  
libgomp                   11.2.0               h1234567_1  
libiconv                  1.16                 h7f8727e_2  
libidn2                   2.3.4                h5eee18b_0  
libnpp                    11.7.4.75                     0    nvidia
libnvjpeg                 11.8.0.2                      0    nvidia
libpng                    1.6.39               h5eee18b_0  
libstdcxx-ng              11.2.0               h1234567_1  
libtasn1                  4.19.0               h5eee18b_0  
libtiff                   4.5.0                h6a678d5_2  
libunistring              0.9.10               h27cfd23_0  
libuuid                   1.41.5               h5eee18b_0  
libwebp                   1.2.4                h11a3e52_1  
libwebp-base              1.2.4                h5eee18b_1  
loguru                    0.7.0                    pypi_0    pypi
lz4-c                     1.9.4                h6a678d5_0  
markupsafe                2.1.1           py310h7f8727e_0  
matplotlib                3.7.2                    pypi_0    pypi
mkl                       2023.1.0         h6d00ec8_46342  
mkl-service               2.4.0           py310h5eee18b_1  
mkl_fft                   1.3.6           py310h1128e8f_1  
mkl_random                1.2.2           py310h1128e8f_1  
model-angelo              1.0.1                    pypi_0    pypi
mpc                       1.1.0                h10f8cd9_1  
mpfr                      4.0.2                hb69a4c5_1  
mpmath                    1.3.0           py310h06a4308_0  
mrcfile                   1.4.3                    pypi_0    pypi
ncurses                   6.4                  h6a678d5_0  
nettle                    3.7.3                hbbd107a_1  
networkx                  3.1             py310h06a4308_0  
numpy                     1.25.0          py310h5f9d8c6_0  
numpy-base                1.25.0          py310hb5e798b_0  
openh264                  2.1.1                h4ff587b_0  
openssl                   3.0.9                h7f8727e_0  
packaging                 23.1                     pypi_0    pypi
pandas                    2.0.3                    pypi_0    pypi
pillow                    9.4.0           py310h6a678d5_0  
pip                       23.2.1          py310h06a4308_0  
psutil                    5.9.5                    pypi_0    pypi
pycparser                 2.21               pyhd3eb1b0_0  
pyhmmer                   0.8.2                    pypi_0    pypi
pyopenssl                 23.2.0          py310h06a4308_0  
pyparsing                 3.0.9                    pypi_0    pypi
pysocks                   1.7.1           py310h06a4308_0  
python                    3.10.12              h955ad1f_0  
python-dateutil           2.8.2                    pypi_0    pypi
pytorch                   2.0.1           py3.10_cuda11.7_cudnn8.5.0_0    pytorch
pytorch-cuda              11.7                 h778d358_5    pytorch
pytorch-mutex             1.0                        cuda    pytorch
pytz                      2023.3                   pypi_0    pypi
readline                  8.2                  h5eee18b_0  
requests                  2.31.0          py310h06a4308_0  
scipy                     1.11.1                   pypi_0    pypi
setuptools                68.0.0          py310h06a4308_0  
six                       1.16.0                   pypi_0    pypi
sqlite                    3.41.2               h5eee18b_0  
sympy                     1.11.1          py310h06a4308_0  
tbb                       2021.8.0             hdb19cb5_0  
tk                        8.6.12               h1ccaba5_0  
torchaudio                2.0.2               py310_cu117    pytorch
torchtriton               2.0.0                     py310    pytorch
torchvision               0.15.2              py310_cu117    pytorch
tqdm                      4.65.0                   pypi_0    pypi
typing_extensions         4.7.1           py310h06a4308_0  
tzdata                    2023.3                   pypi_0    pypi
urllib3                   1.26.16         py310h06a4308_0  
wheel                     0.38.4          py310h06a4308_0  
xz                        5.4.2                h5eee18b_0  
zlib                      1.2.13               h5eee18b_0  
zstd                      1.5.5                hc292b87_0  `

Maybe that helps. CUDA normally works fine on our cluster with other softwares.

jamaliki commented 1 year ago

RIght, so the new environment definitely has pytorch with cuda installed. When you run the check, in the same environment, are you able to run nvidia-smi? Does it show your GPUs?

DerLorenz commented 1 year ago

Just with the activated enviroment doesnt show me the gpus.

nvidia-smi
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

When activating the enviroment with gpus allocated using tmux, I get a different error:

nvidia-smi
No devices were found

DerLorenz commented 1 year ago

Wait, I am an idiot. I found the issue in my slurm-submission script. While asking for resources allocated on the gpu nodes, I didn't actually get a gpu allocated. I fixed ths now and I get all correct outputs for the checks.

>>> import torch
>>> print(torch.cuda.is_available())
True

and

nvidia-smi
Thu Aug  3 10:06:03 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:00:08.0 Off |                    0 |
| N/A   25C    P0    23W / 250W |      0MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------

I restarted the predictions and it should now run as expected. I will give a short update once the job started running.

jamaliki commented 1 year ago

Excellent! Was just typing a reply and saw your message changed :)

Let me know when to mark this issue as resolved!

DerLorenz commented 1 year ago

Yes, it works now. Sorry for my stupidity and bothering you in the first place. Thank you very much (and congratulations) for this awesome software and the support! The initial version already served me well and I am now looking forward seeing the rRNA prediction results.

jamaliki commented 1 year ago

No, problem! I'm glad things are running now :)

3dem / model-angelo

cuda issues #60