DerLorenz commented 1 year ago


I installed model angelo on our HPC following the README instruction using Anaconda3. When trying to run a prediction (submitted via slurm), the C-alpha prediction takes very long. Checking the slurm.out file, I see that cuda would not be available. Thus, i have no GPU usage during C-alpha prediction.

/path/to/Anaconda3/envs/model_angelo/lib/python3.9/site-packages/torch/amp/autocast_mode.py:204: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling
  warnings.warn('User provided device_type of \'cuda\', but CUDA is not available. Disabling')

Checking torch availability seems ok.

Python 3.9.17 (main, Jul  5 2023, 20:41:20) 
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available
<function is_available at 0x7ff00b1e5280>

I reinstalled it several times following the personal use or the sahred computational enviroment instructions. Also, I tried different workarounds suggested in previous threads (https://github.com/3dem/model-angelo/issues/24), but nothing works.

I am a bit confused and not sure what I could do to fix it. Thus, any help would be very much appreciated!

Best, Lorenz

jamaliki commented 1 year ago

Hi Lorenz,

This is strange. The check you did, however, is incorrect. Do you mind running the following and letting me know the output?

import torch

Note how torch.cuda.is_available() has to be run as a function.

Best, Kiarash.

DerLorenz commented 1 year ago

Dear Kiarash,

Thanks for your quick reply! Oh sorry for my mistake here. Here is the correct check.

>>> import torch
>>> print(torch.cuda.is_available())

I see there is an issue with torch, though I am not sure why. I was able to succesfully install model angelo on my workstation before. Here is a list of all packages in my conda enviroment.

This one I installed as suggested in isssuehttps://github.com/3dem/model-angelo/issues/24 with some adaption:

$ conda create -n model_angelo python=3.9 -y
$ conda activate model_angelo
(model_angelo) $ conda install -y pytorch pytorch-cuda=11.7 torchvision torchaudio cudatoolkit=11.7 -c nvidia -c pytorch
(model_angelo) $ python3 -m pip install -r requirements.txt
(model_angelo) $ python3 setup.py install
(model_angelo) $ export TORCH_HOME=/path/to/weights
(model_angelo) $ conda env config vars set TORCH_HOME="$TORCH_HOME"
(model_angelo) $ conda deactivate && conda activate model_angelo 

I already tried before to install it the way(s) suggested in the README with the same outcome of the C-alpha prediction taking forever (minutes/iteration) and CUDA not being available.

Best, Lorenz

DerLorenz commented 1 year ago

I reinstalled model_angelo again following the readme instructions and just changed the install script to check for and create the enviroment model_angelo_1. Again here is the check:

`>>> import torch

print(torch.cuda.is_available()) False`

Also here is the list of installed packages:

Maybe that helps. CUDA normally works fine on our cluster with other softwares.

jamaliki commented 1 year ago

RIght, so the new environment definitely has pytorch with cuda installed. When you run the check, in the same environment, are you able to run nvidia-smi? Does it show your GPUs?

DerLorenz commented 1 year ago

Just with the activated enviroment doesnt show me the gpus.

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.

When activating the enviroment with gpus allocated using tmux, I get a different error:

No devices were found
DerLorenz commented 1 year ago

Wait, I am an idiot. I found the issue in my slurm-submission script. While asking for resources allocated on the gpu nodes, I didn't actually get a gpu allocated. I fixed ths now and I get all correct outputs for the checks.

>>> import torch
>>> print(torch.cuda.is_available())


Thu Aug  3 10:06:03 2023       
| NVIDIA-SMI 525.85.12    Driver Version: 525.85.12    CUDA Version: 12.0     |
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla P100-PCIE...  On   | 00000000:00:08.0 Off |                    0 |
| N/A   25C    P0    23W / 250W |      0MiB / 12288MiB |      0%      Default |
|                               |                      |                  N/A |

| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|  No running processes found                                                 |

I restarted the predictions and it should now run as expected. I will give a short update once the job started running.

jamaliki commented 1 year ago

Excellent! Was just typing a reply and saw your message changed :)

Let me know when to mark this issue as resolved!

DerLorenz commented 1 year ago

Yes, it works now. Sorry for my stupidity and bothering you in the first place. Thank you very much (and congratulations) for this awesome software and the support! The initial version already served me well and I am now looking forward seeing the rRNA prediction results.

jamaliki commented 1 year ago

No, problem! I'm glad things are running now :)