ModelAngelo not using GPUs (2080 Ti, CUDA 11.4)

marinegor commented 1 year ago

Hi everyone,

I installed modelangelo on our server (as described in README), and noticed that Ca building took ~5 hours, and GPU utilization was 0 at that time. After that, I ran python manually, and checked torch availability:

>>> import torch
torch.cuda.is_available()
>>> torch.cuda.is_available()
False
>>>

which seems to be the reason. CUDA, however, seems to be working fine on the server:

$ nvcc --vesrion
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Mon_Oct_11_21:27:02_PDT_2021
Cuda compilation tools, release 11.4, V11.4.152
Build cuda_11.4.r11.4/compiler.30521435_0

(also for other packages, e.g. relion/cryosparc/you name it).

Could you please point me to what I should do to fix that?

jamaliki commented 1 year ago

Hi,

Thank you for your report. This is interesting and should not be happening. I think something may have gone wrong in the torch installation. Could you please run this command with the after activating conda environment.

conda install -y pytorch torchvision torchaudio cudatoolkit=11.4 -c pytorch

I don't know why this would be an issue, but maybe it is the slight mismatch in the CUDA version.

jasonkey commented 1 year ago

I ran into this issue as well. I don't think conda is managing the dependencies for torch + cuda correctly at installation. It is not installing the required "torch_cuda*.so" libraries, just the libraries for CPU.

I was able to work around this by using mamba instead of conda which seems to handle the dependencies correctly.

jamaliki commented 1 year ago

Interesting! @jasonkey do you have the diffs you made to the installation script somewhere?

marinegor commented 1 year ago

@jamaliki 11.4 is not yet available in conda:

PackagesNotFoundError: The following packages are not available from current channels:

  - cudatoolkit=11.4

Current channels:

  - https://conda.anaconda.org/pytorch/linux-64
  - https://conda.anaconda.org/pytorch/noarch
  - https://repo.anaconda.com/pkgs/main/linux-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/r/linux-64
  - https://repo.anaconda.com/pkgs/r/noarch

jasonkey commented 1 year ago

No, I just run the installation manually without using the script.

conda install mamba 
mamba install cudatoolkit=11.3 pytorch torchvision torchaudio -c pytorch

worked for me and installed the missing "libtorch_cuda*.so" libraries.

I happen to still have it in my scrollback. In this case I downgraded pytorch intentionally, but you can see that the packages conda included are the cpu packages. These are replaced with the correct cuda versions with mamba.

  - pytorch            1.13.0  py3.9_cpu_0
  + pytorch            1.12.1  py3.9_cuda11.3_cudnn8.3.2_0
  - torchaudio         0.13.0  py39_cpu
  + torchaudio         0.12.1  py39_cu113
  - torchvision        0.14.0  py39_cpu
  + torchvision        0.13.1  py39_cu113

I don't know why conda and mamba behave differently here.

marinegor commented 1 year ago

@jasonkey I tried mamba -- didn't work for me either, surprisingly.

Namely, I did:

conda install mamba  # btw, this one took sooo long
mamba install cudatoolkit=11.3 pytorch torchvision torchaudio -c pytorch

and then still have this:

$ python
>>> import torch
torch.cuda.is_available()
>>> torch.cuda.is_available()
False
>>>

jamaliki commented 1 year ago

Hmpf. This is really strange. Are you able to install PyTorch with GPU normally?

marinegor commented 1 year ago

Hmpf. This is really strange. Are you able to install PyTorch with GPU normally?

Yes. If I install fresh virtual environment, I can see that CUDA is available:

$ python3 -m venv venv
$ source venv/bin/activate
$ # source: pytorch documentation: https://pytorch.org/get-started/locally/
(venv) $ python3 -m pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu114
# long output
(venv) $ python3
>>> import torch
>>> torch.cuda.is_available()
True

jamaliki commented 1 year ago

Interesting, does it just not work with the conda install? Maybe that is the issue

zhihao-2022 commented 1 year ago

hi, where can I find position to choose the GPU?

jamaliki commented 1 year ago

hi, where can I find position to choose the GPU?

You can specify the GPU with the --device flag. If you type model_angelo --help it should give you all of the options.

marinegor commented 1 year ago

Interesting, does it just not work with the conda install? Maybe that is the issue

Sorry, I didn't quite understand what you mean here :)

marinegor commented 1 year ago

Ok, so it seems that there's no need for conda here -- installation with python venv works just fine:

$ # in model_angelo github folder
$ python3 -m venv env
$ source env/bin/activate
(env) $ python3 -m pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu114
(env) $ python3 -m pip install -r requirements.txt
(env) $ python3 setup.py install

jamaliki commented 1 year ago

That's awesome! Yeah that's what I meant :)

It is strange that the conda install did not work, but I'm glad you were able to install it anyway!

marinegor commented 1 year ago

Ah, my bad -- after installing into virtual environment, I got:

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED

when model_angelo entered the GNN refinement stage (I assume because pip couldn't install cudatoolkit properly).

After few hours, the solution that worked was this:

$ conda create -n model_angelo python=3.9 -y
$ conda activate model_angelo
(model_angelo) $ conda install -y pytorch pytorch-cuda=11.6 torchvision torchaudio cudatoolkit=11.6 -c nvidia -c pytorch
(model_angelo) $ python3 -m pip install -r requirements.txt
(model_angelo) $ python3 setup.py install
(model_angelo) $ export TORCH_HOME=/path/to/weights
(model_angelo) $ conda env config vars set TORCH_HOME="$TORCH_HOME"
(model_angelo) $ conda deactivate && conda activate model_angelo  # necessary to enable TORCH_HOME in current session

after that, model_angelo works as expected (at least goes into GNN refinement stage, which wasn't happening before).

jamaliki commented 1 year ago

Interesting, so was it the cudatoolkit=11.6?

marinegor commented 1 year ago

Yep. There's no 11.4 on pytorch channel, and also pytorch kept being installed without GPU support. I needed to specifically ask for pytorch-cuda (and hence -c nvidia), and also for higher cudatoolkit version for compatibility with pytorch (since pytorch-cuda=11.3 is afair unavailable there).

jamaliki commented 1 year ago

Thank you this is very useful. I will test to see if this change works on our cluster and then I will push it to the repo!

zhihao-2022 commented 1 year ago

My gpu also can't work at that moment by following the readme.

jamaliki commented 1 year ago

@zhihao-2022 could you make a new issue and add some information about how you installed the program, what kind of machine with what kind of operating system you have, and also whether pytorch is able to see the GPU?

DmitrySemchonok commented 9 months ago

Hello @jamaliki ,

I have similar issue -CUDA is not available - but this fix doesn't work for me.

So more in the details - the modelangelo runs but without cuda. (I have cuda 11.8; Centos 7)

The error

_(model_angelo) [caroline@lvx0862 model-angelo]$ model_angelo build -v /home/caroline/Documents/Phenix/new_heptamer_extraction_maskwithbest__J1427/classes_ofheptamersfrom_cl_J1514fromJ1513/J1631__cl0/cryosparc_P4_J1631_007_volume_map.mrc -pf /home/caroline/Documents/Phenix/sequence/P11076.fasta -o output ---------------------------- ModelAngelo ----------------------------- By Kiarash Jamali, Scheres Group, MRC Laboratory of Molecular Biology --------------------- Initial C-alpha prediction --------------------- 0%| | 0/9261 [00:00<?, ?it/s]/home/caroline/miniconda3/envs/model_angelo/lib/python3.9/site-packages/torch/amp/autocast_mode.py:250: UserWarning: User provided device_type of 'cuda', but CUDA is not available. Disabling warnings.warn( /home/caroline/miniconda3/envs/model_angelo/lib/python3.9/site-packages/torch/utils/checkpoint.py:429: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use usereentrant=False. Refer to docs for more details on the differences between the two variants. warnings.warn( ^Z [2]+ Stopped

Reistallation of torch with cuda goes fine. But the error is still the same.
Command import torch leads to nowhere

(model_angelo) [caroline@lvx0862 model-angelo]$ python Python 3.9.18 (main, Sep 11 2023, 13:41:44) [GCC 11.2.0] :: Anaconda, Inc. on linux Type "help", "copyright", "credits" or "license" for more information.

import torch torch.cuda.is_available() False

Moreover - when I type import torch command inside modelangelo conda env it doesn't respond but the cross appears instead of cursor. Until I click the mouse - nothing changes. When I click the mouse - the command disappeares.

Could you please help?

thank you.

sincerely, Dmitry

jamaliki commented 9 months ago

Hi @DmitrySemchonok ,

Are you able to verify that the server has access to GPUs?

When you install torch alone in an environment, does torch.cuda.is_available() give you True?

If that is the case, could you try installing torch with CUDA and then installing ModelAngelo with the new pip install command and let me know if you still have issues?

3dem / model-angelo

ModelAngelo not using GPUs (2080 Ti, CUDA 11.4) #24