Closed laserkelvin closed 2 months ago
Hi @laserkelvin
The DDPStrategy does not support XLA, nor does the DDP implementation in PyTorch. For distributed training with XLA, please use
Trainer(accelerator="tpu", devices=8)
Docs: https://lightning.ai/docs/pytorch/stable/accelerators/tpu.html
We won't be able to support XLA+DDP like you requested.
Bug description
When configuring a
DDPStrategy
with multiple devices that do not use thetorch.cuda
API, we trigger the following exception:The
_setup_model
method ofDDPStrategy
triggers this exception, astorch.cuda.stream
is hardcoded ifdevice_ids
are passed. I've reproduced the snippet below, but here is a permalink.A potential solution could be checking the target device, or even just checking
torch.cuda.is_available()
for the condition. Removing thetorch.cuda.Stream()
call and just using thenullcontext()
functions perfectly fine otherwise.The snippet provided below relies on an
XPUAccelerator
registered here, but I would assume this might trigger for other accelerators as well.What version are you seeing the problem on?
v2.1, v2.2
How to reproduce the bug
Error messages and logs
Environment
Current environment
``` #- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow): #- PyTorch Lightning Version (e.g., 1.5.0): 2.2.1 #- Lightning App Version (e.g., 0.5.2): #- PyTorch Version (e.g., 2.0): 2.0.1 #- Python version (e.g., 3.9): 3.10 #- OS (e.g., Linux): Linux #- CUDA/cuDNN version: N/A #- GPU models and configuration: Intel 1550 Data Center GPU Max #- How you installed Lightning(`conda`, `pip`, source): pip #- Running environment of LightningApp (e.g. local, cloud): Managed Slurm cluster ```More info
No response
cc @justusschock @awaelchli