Lightning-AI / pytorch-lightning

Pretrain, finetune ANY AI model of ANY size on multiple GPUs, TPUs with zero code changes.
https://lightning.ai
Apache License 2.0
28.19k stars 3.37k forks source link

`CUDAAccelerator` can not run on your system since the accelerator is not available. #16590

Closed shenoynikhil closed 1 year ago

shenoynikhil commented 1 year ago

Bug description

So, in my environment torch.cuda.is_available() is True but torch.cuda.device_count() is 0. This issue is probably linked with a pytorch issue. Since I was planning on using lightning for a new project, I am unable to use GPU using the pl.Trainer(accelerator='cuda', devices=1).

Not sure if this is a bug on your end. Any suggestion to go about this would be great.

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment ``` #- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow): #- PyTorch Lightning Version (e.g., 1.5.0): #- Lightning App Version (e.g., 0.5.2): #- PyTorch Version (e.g., 1.10): #- Python version (e.g., 3.9): #- OS (e.g., Linux): #- CUDA/cuDNN version: #- GPU models and configuration: #- How you installed Lightning(`conda`, `pip`, source): #- Running environment of LightningApp (e.g. local, cloud): ```

More info

No response

cc @tchaton

carmocca commented 1 year ago

You could use an earlier PyTorch version that does not have that issue until they publish a fix

shenoynikhil commented 1 year ago

Sounds good. Feel free to close the issue.

shenoynikhil commented 1 year ago

I also had to downgrade pl to pytorch-lightning==1.7.0 for it to work. Was there a change in how cuda backend gets checked in >1.7.0 versions?

awaelchli commented 1 year ago

Yes. The issue you referenced on the pytorch side is regarding a piece of parsing logic that was introduced in torch >= 1.13. We then took this code over into Lightning to support this new way of parsing also for Lightning users with torch <= 1.13. This is likely the reason why you saw the issue go away when downgrading lightning.

We need to take the fix here and apply it to our code as well in https://github.com/Lightning-AI/lightning/blob/ccd2a481d0fdcf757124e43e58cf0bffc8d68594/src/lightning/fabric/accelerators/cuda.py#L229

shenoynikhil commented 1 year ago

For existing pytorch-lightning versions and pytorch (where the change https://github.com/pytorch/pytorch/issues/90543 has not taken place), is there a way to still use GPUs. I can use GPU by doing .to(torch.device('cuda')) but I want to be able to use pytorch_lightning.

awaelchli commented 1 year ago

@shenoynikhil #16795 should fix this issue. It will be included in 1.9.4 in 1-2 days. Thanks for the patience!