"MisconfigurationException: No supported gpu backend found!" with multi gpu training in jupyter notebooks

vacmar01 commented 2 years ago

Bug description

When trying to train on two GPUs in a jupyter notebooks environment on jarvislabs.ai with ddp_notebooks I get the following error "MisconfigurationException: No supported gpu backend found!".

I'm trying to train on two RTX 5000 GPUs. On a Kaggle GPU the same code runs without any problem.

Any ideas?

How to reproduce the bug

trainer = pl.Trainer(
    max_epochs=2, 
    accelerator="gpu",
    devices=2,
    precision=16,
    accumulate_grad_batches=2
)
trainer.fit(model, train_dl, val_dl)

Error messages and logs

"MisconfigurationException: No supported gpu backend found!"

Environment


#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0): 1.7.7
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 1.10): 1.11
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version: V11.6.55
#- GPU models and configuration: 2x RTX 5000
#- How you installed Lightning(`conda`, `pip`, source): pip
#- Running environment of LightningApp (e.g. local, cloud): jarvislabs.ai

More info

No response

cc @justusschock @awaelchli

rohitgr7 commented 2 years ago

cc: @awaelchli

awaelchli commented 2 years ago

@vacmar01 Was your PyTorch installed with GPU support? I suspect that it was not. Please check what

import torch
print(torch.cuda.is_available())

returns for you. If not, please install it like so: pip3 install torch --extra-index-url https://download.pytorch.org/whl/cu116.

vacmar01 commented 2 years ago

It is installed with GPU support: torch.cuda.is_available() returns True and torch.cuda.device_count() returns 2. But after I call these functions I can't start training anymore because of the "process forking" error.

awaelchli commented 2 years ago

@vacmar01 I cannot reproduce on our multi-GPU machine with Jupyter notebook. I haven't tried with jarvislabs.ai but I don't think it is related. There was some issue in 1.7.7 with precision=16 and the "process forking" error that you described that we fixed.

Would you mind checking again by installing our development version from master to see if your GPUs get properly detected? To install from master, simply run:

pip install https://github.com/Lightning-AI/lightning/archive/refs/heads/master.zip -U

vacmar01 commented 2 years ago

I suppose it has something to do with the GPU or the CUDA version, since on Kaggle the exact same code ran with no problem.

I will install the development version of lightning and try again. Thank you!

vacmar01 commented 2 years ago

Okay so I installed the dev version of lightning and I get a new error (with precision=32 or precision=16, doesn't matter):

RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!

Whats funny is that now starting the trainer logs the following:

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
[W CUDAFunctions.cpp:112] Warning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (function operator())

So it seems to recognize the GPU but it still doesn't work.

vacmar01 commented 1 year ago

Any updates on this?

awaelchli commented 1 year ago

[W CUDAFunctions.cpp:112] Warning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (function operator())

This warning is from torch. Can you make sure that you have the latest driver installed? Perhaps a different cuda version works?

Have you run any pytorch examples? I'm sure you will get the same error there too.

RylanSchaeffer commented 1 year ago

@awaelchli I'm getting the same error, albeit with plain Python (no jupyter).

$ python3
Python 3.8.13 (default, Mar 28 2022, 11:38:47) 
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'1.12.1+cu113'
>>> import pytorch_lightning as pl
>>> pl.__version__
'1.8.6'

Your suggested command print(torch.cuda.is_available()) prints True, but I still get:

lightning_lite.utilities.exceptions.MisconfigurationException: No supported gpu backend found!

I'm using Cuda 11.3. Any suggestions for identifying the cause?

RylanSchaeffer commented 1 year ago

Downgrading to PL Lightning 1.7.7 works for me. I don't know what the cause of the problem is!

awaelchli commented 1 year ago

@RylanSchaeffer I honestly have no idea. I suspect that our cuda_available check returns false on your system, for whatever reason. If you want to help investigate, you could set a breakpoint in the debugger at this line of code:

https://github.com/Lightning-AI/lightning/blob/fc195b95405e9e2629466e5b28c6a9243209d596/src/pytorch_lightning/trainer/connectors/accelerator_connector.py#L533-L538

And jump into CUDAAccelerator.is_available() to see what the conditions are that make it return False. It could potentially be that this function returns 0.

keeganq commented 1 year ago

I've been having the same MisconfigurationException('No supported gpu backend found!') error with 1.9.0 and 1.8.0. I have this error when running with the hydra submitit-launcher plugin on a slurm cluster. When running on a single node with two gpus and without the plugin, pytorch_lightning works fine, and doesn't throw the Misconfiguration error.

Downgrading to 1.7.7 was also able to fix the problem for me, and I can train using the plugin.

So I'm guessing this problem might have something to do with the way the plugin works with newer versions of pytorch lightning, although I don't know if that's the only failure case. @RylanSchaeffer have you also been using the submitit-launcher plugin?

trias702 commented 1 year ago

I am also experiencing this issue all of a sudden after migrating from PTL 1.6.5 to 1.9.0

However, my colleagues and I have solved it by exporting CUDA_VISIBLE_DEVICES=XXX as an environment variable on each of our nodes (we use 4 nodes with 8 GPUs each, combined with mpirun), where XXX is the GPU config for that node, so in my case, each node has 8 GPUs it's export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7. Make sure you export this env var on each node, including the primary node.

SerezD commented 1 year ago

I am also experiencing this issue all of a sudden after migrating from PTL 1.6.5 to 1.9.0

However, my colleagues and I have solved it by exporting CUDA_VISIBLE_DEVICES=XXX as an environment variable on each of our nodes (we use 4 nodes with 8 GPUs each, combined with mpirun), where XXX is the GPU config for that node, so in my case, each node has 8 GPUs it's export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7. Make sure you export this env var on each node, including the primary node.

This worked for me too!

trias702 commented 1 year ago

Great to hear it!

I would like to add that there is definitely a bug somewhere in PTL 1.9.0 which is causing this. I have used PTL for years for multi-node training and never needed to specify the CUDA_VISIBLE_DEVICES before for it to work, so I please urge the developers of PTL to take a look at what they may have changed in 1.9.0 which is causing this.

awaelchli commented 1 year ago

@trias702 In 1.9 we changed the detection of cuda a bit to support forking: #14631 The latest code is here: https://github.com/Lightning-AI/lightning/blob/1b1241ceb12fce0e30b4eb8bdb54779995a42e0a/src/lightning/fabric/accelerators/cuda.py#L158

A change was also made in PyTorch: https://github.com/pytorch/pytorch/pull/84879 There is a follow-up bug in pytorch regarding the parsing of CUDA_VISIBLE_DEVICES. Maybe you are affected by this if your multi-node cluster sets this env var in advance: https://github.com/pytorch/pytorch/issues/90543

Saydemr commented 9 months ago

I've been having the same MisconfigurationException('No supported gpu backend found!') error with 1.9.0 and 1.8.0. I have this error when running with the hydra submitit-launcher plugin on a slurm cluster. When running on a single node with two gpus and without the plugin, pytorch_lightning works fine, and doesn't throw the Misconfiguration error.

Downgrading to 1.7.7 was also able to fix the problem for me, and I can train using the plugin.

So I'm guessing this problem might have something to do with the way the plugin works with newer versions of pytorch lightning, although I don't know if that's the only failure case. @RylanSchaeffer have you also been using the submitit-launcher plugin?

Can confirm this. Same error is thrown when hydra launcher (submitit or local) is used. Removing the plugin works fine. Setting CUDA_VISIBLE_DEVICES does not work...

Env:

# Python 3.10, pip dependencies
hydra-core==1.3.2
hydra-submitit-launcher==1.2.0
lightning==2.1.2
lightning-utilities==0.10.0
pytorch-lightning==2.1.2
submitit==1.5.1
torch==2.1.1+cu118

Lightning-AI / pytorch-lightning