Open waynemystir opened 4 months ago
Did you set difference NODE_RANK to each node ? I currently run multi-node training with lightning v2.2.0 + deepspeed on azure's gpu cluster successfully, without manual set any env varable, (maybe it's set by the cluster system)
Bug description
I am trying to run a very simple training script for 2 nodes and I always get this error:
Output:
What version are you seeing the problem on?
v2.2
How to reproduce the bug
Error messages and logs
Environment
Current environment
``` #- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow): #- PyTorch Lightning Version (e.g., 1.5.0): #- Lightning App Version (e.g., 0.5.2): #- PyTorch Version (e.g., 2.0): #- Python version (e.g., 3.9): #- OS (e.g., Linux): #- CUDA/cuDNN version: #- GPU models and configuration: #- How you installed Lightning(`conda`, `pip`, source): #- Running environment of LightningApp (e.g. local, cloud): ```More info
No response