Gaius-Augustus / learnMSA

Learning and Aligning Large Protein Families with support of protein language models.
MIT License
19 stars 3 forks source link

It seems like no GPU is installed. Running on CPU instead. #4

Closed No-drama-llama closed 4 months ago

No-drama-llama commented 1 year ago

Hi,

First of all, thank you for this wonderful tool. I love it, it works great for my sequences and it is fast and aligns them correctly.

However, I keep having a problem since the program doesn't recognize available GPUs on our cluster. I have a large number of sequences (around 11,000) and using GPU would really cut my computation time. I am attaching the output from the nvidia-smi command: +-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.85.12 Driver Version: 525.85.12 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100-SXM... On | 00000000:01:00.0 Off | 0 | | N/A 27C P0 50W / 400W | 0MiB / 40960MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A100-SXM... On | 00000000:C1:00.0 Off | 0 | | N/A 25C P0 50W / 400W | 0MiB / 40960MiB | 0% Default | | | | Disabled | +-------------------------------+----------------------+----------------------+

As you can see there are GPUs available, we have cuda and cudnn installed (module load) and I run this in the dedicated gpu partition on slurm (srun --partition=gpu --qos=gpu --gres=gpu:A100:2 --mem=30G learnMSA -i input.fasta -d 0,1 -o output.fasta). I am a bit lost as to what the problem might be or what I am doing wrong.

Can you help me with this, please?

Thanks again for such a wonderful tool and have a great day! Anamarija

felbecker commented 1 year ago

Hi!

From what you wrote I assume that you don't get any error message and learnMSA runs correctly other than the fact that it doesn't recognize the A100.

Could you try to use a single GPU instead (-d 0)? There might be a problem with multi-GPU support that I have to fix first.

If a single GPU is still not detected, can you verify that tensorflow and CUDA/cuDNN version are compatible? (They have to be, unfortunately, and it can be a hassle). There is a table here that will help.

Since in your case CUDA Version: 12.0 seems to be installed, upgrading tensorflow to the newest version could help.

No-drama-llama commented 1 year ago

Hi,

Yes, learnMSA runs correctly otherwise. I tried with one GPU but still the same error. I think the problem might be the version of tensorflow and CUDA on our system, I am trying to fix it now. Hopefully it doesn't give me too much trouble.

felbecker commented 1 year ago

We have by chance also A100 GPUs with CUDA 12.0 installed on our cluster. I can confirm that with tensorflow==2.13.0 GPUs are not recognized, but everything works well under tensorflow==2.10.0.

We should be able to solve your problem this way:

conda install mamba mamba create -n learnMSA_env tensorflow==2.10.0 learnMSA mamba activate learnMSA_env

You can also skip the first step and replace mamba with conda, but usually solving the environment is much slower then...

No-drama-llama commented 1 year ago

Hi again,

I tried what you suggested but it does not work for me, I get this error: `/pasteur/appa/homes/abutkovi/anaconda3/envs/learmsa2/lib/python3.11/site-packages/conda_package_streaming/package_streaming.py:19: UserWarning: zstandard could not be imported. Running without .conda support. warnings.warn("zstandard could not be imported. Running without .conda support.") /pasteur/appa/homes/abutkovi/anaconda3/envs/learmsa2/lib/python3.11/site-packages/conda_package_handling/api.py:29: UserWarning: Install zstandard Python bindings for .conda support _warnings.warn("Install zstandard Python bindings for .conda support")

Looking for: ['tensorflow==2.10.0', 'learnmsa', 'famsa']

pkgs/main/noarch 851.6kB @ 4.4MB/s 0.2s pkgs/r/linux-64 1.4MB @ 4.7MB/s 0.3s pkgs/r/noarch 1.3MB @ 2.8MB/s 0.3s pkgs/main/linux-64 5.9MB @ 5.5MB/s 1.1s conda-forge/noarch 13.4MB @ 5.2MB/s 2.6s conda-forge/linux-64 33.2MB @ 5.5MB/s 6.3s Could not solve for environment specs Encountered problems while solving:

The environment can't be solved, aborting the operation`

I also tried pip install tensorflow==2.10.0 logomaker networkx seaborn but it doesn't find tensorflow 2.10. ERROR: Could not find a version that satisfies the requirement tensorflow==2.10.0 (from versions: 2.12.0rc0, 2.12.0rc1, 2.12.0, 2.12.1, 2.13.0rc0, 2.13.0rc1, 2.13.0rc2, 2.13.0) ERROR: No matching distribution found for tensorflow==2.10.0. Do you maybe have an idea why mamba doesn't find learnmsa?

No-drama-llama commented 1 year ago

I try installing the zstandard library but it conda says it is already installed.

felbecker commented 1 year ago

Have you set up Bioconda channels (see here)? You only have to do this once.

No-drama-llama commented 1 year ago

Yes, there was a different problem with conda but I managed to fix it. Let's see if the program uses the GPUs now. Thanks for your help!

No-drama-llama commented 1 year ago

Hi,

Unfortunately, I couldn't achieve learnMSA to run with GPU option but it seems to be an issue of our cluster. Since everybody is on vacation I will see if I manage to fix it when people come back but so far I am running it on CPU.

Thanks for your help and have a lovely August, Anamarija

felbecker commented 1 year ago

Thanks for reporting back!

Let me know how it turns out and if I can further assist with anything.

Have a great summer as well!

felbecker commented 4 months ago

I will close this issue. Feel free to reopen it if necessary.