Open vacmar01 opened 2 years ago
cc: @awaelchli
@vacmar01 Was your PyTorch installed with GPU support? I suspect that it was not. Please check what
import torch
print(torch.cuda.is_available())
returns for you. If not, please install it like so: pip3 install torch --extra-index-url https://download.pytorch.org/whl/cu116
.
It is installed with GPU support: torch.cuda.is_available()
returns True
and torch.cuda.device_count()
returns 2
. But after I call these functions I can't start training anymore because of the "process forking" error.
@vacmar01 I cannot reproduce on our multi-GPU machine with Jupyter notebook. I haven't tried with jarvislabs.ai but I don't think it is related. There was some issue in 1.7.7 with precision=16 and the "process forking" error that you described that we fixed.
Would you mind checking again by installing our development version from master to see if your GPUs get properly detected? To install from master, simply run:
pip install https://github.com/Lightning-AI/lightning/archive/refs/heads/master.zip -U
I suppose it has something to do with the GPU or the CUDA version, since on Kaggle the exact same code ran with no problem.
I will install the development version of lightning and try again. Thank you!
Okay so I installed the dev version of lightning and I get a new error (with precision=32 or precision=16, doesn't matter):
RuntimeError: ProcessGroupNCCL is only supported with GPUs, no GPUs found!
Whats funny is that now starting the trainer logs the following:
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/2
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2
[W CUDAFunctions.cpp:112] Warning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (function operator())
So it seems to recognize the GPU but it still doesn't work.
Any updates on this?
[W CUDAFunctions.cpp:112] Warning: CUDA initialization: CUDA driver initialization failed, you might not have a CUDA gpu. (function operator())
This warning is from torch. Can you make sure that you have the latest driver installed? Perhaps a different cuda version works?
Have you run any pytorch examples? I'm sure you will get the same error there too.
@awaelchli I'm getting the same error, albeit with plain Python (no jupyter).
$ python3
Python 3.8.13 (default, Mar 28 2022, 11:38:47)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'1.12.1+cu113'
>>> import pytorch_lightning as pl
>>> pl.__version__
'1.8.6'
Your suggested command print(torch.cuda.is_available())
prints True
, but I still get:
lightning_lite.utilities.exceptions.MisconfigurationException: No supported gpu backend found!
I'm using Cuda 11.3. Any suggestions for identifying the cause?
Downgrading to PL Lightning 1.7.7
works for me. I don't know what the cause of the problem is!
@RylanSchaeffer I honestly have no idea. I suspect that our cuda_available check returns false on your system, for whatever reason. If you want to help investigate, you could set a breakpoint in the debugger at this line of code:
And jump into CUDAAccelerator.is_available()
to see what the conditions are that make it return False. It could potentially be that this function returns 0.
I've been having the same MisconfigurationException('No supported gpu backend found!')
error with 1.9.0
and 1.8.0
. I have this error when running with the hydra submitit-launcher plugin on a slurm cluster. When running on a single node with two gpus and without the plugin, pytorch_lightning works fine, and doesn't throw the Misconfiguration error.
Downgrading to 1.7.7
was also able to fix the problem for me, and I can train using the plugin.
So I'm guessing this problem might have something to do with the way the plugin works with newer versions of pytorch lightning, although I don't know if that's the only failure case. @RylanSchaeffer have you also been using the submitit-launcher plugin?
I am also experiencing this issue all of a sudden after migrating from PTL 1.6.5 to 1.9.0
However, my colleagues and I have solved it by exporting CUDA_VISIBLE_DEVICES=XXX
as an environment variable on each of our nodes (we use 4 nodes with 8 GPUs each, combined with mpirun), where XXX is the GPU config for that node, so in my case, each node has 8 GPUs it's export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
. Make sure you export this env var on each node, including the primary node.
I am also experiencing this issue all of a sudden after migrating from PTL 1.6.5 to 1.9.0
However, my colleagues and I have solved it by exporting
CUDA_VISIBLE_DEVICES=XXX
as an environment variable on each of our nodes (we use 4 nodes with 8 GPUs each, combined with mpirun), where XXX is the GPU config for that node, so in my case, each node has 8 GPUs it'sexport CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
. Make sure you export this env var on each node, including the primary node.
This worked for me too!
Great to hear it!
I would like to add that there is definitely a bug somewhere in PTL 1.9.0 which is causing this. I have used PTL for years for multi-node training and never needed to specify the CUDA_VISIBLE_DEVICES before for it to work, so I please urge the developers of PTL to take a look at what they may have changed in 1.9.0 which is causing this.
@trias702 In 1.9 we changed the detection of cuda a bit to support forking: #14631 The latest code is here: https://github.com/Lightning-AI/lightning/blob/1b1241ceb12fce0e30b4eb8bdb54779995a42e0a/src/lightning/fabric/accelerators/cuda.py#L158
A change was also made in PyTorch: https://github.com/pytorch/pytorch/pull/84879 There is a follow-up bug in pytorch regarding the parsing of CUDA_VISIBLE_DEVICES. Maybe you are affected by this if your multi-node cluster sets this env var in advance: https://github.com/pytorch/pytorch/issues/90543
I've been having the same
MisconfigurationException('No supported gpu backend found!')
error with1.9.0
and1.8.0
. I have this error when running with the hydra submitit-launcher plugin on a slurm cluster. When running on a single node with two gpus and without the plugin, pytorch_lightning works fine, and doesn't throw the Misconfiguration error.Downgrading to
1.7.7
was also able to fix the problem for me, and I can train using the plugin.So I'm guessing this problem might have something to do with the way the plugin works with newer versions of pytorch lightning, although I don't know if that's the only failure case. @RylanSchaeffer have you also been using the submitit-launcher plugin?
Can confirm this. Same error is thrown when hydra launcher (submitit or local) is used. Removing the plugin works fine. Setting CUDA_VISIBLE_DEVICES does not work...
Env:
# Python 3.10, pip dependencies
hydra-core==1.3.2
hydra-submitit-launcher==1.2.0
lightning==2.1.2
lightning-utilities==0.10.0
pytorch-lightning==2.1.2
submitit==1.5.1
torch==2.1.1+cu118
Bug description
When trying to train on two GPUs in a jupyter notebooks environment on jarvislabs.ai with
ddp_notebooks
I get the following error "MisconfigurationException: No supported gpu backend found!".I'm trying to train on two RTX 5000 GPUs. On a Kaggle GPU the same code runs without any problem.
Any ideas?
How to reproduce the bug
Error messages and logs
"MisconfigurationException: No supported gpu backend found!"
Environment
More info
No response
cc @justusschock @awaelchli