Open vasudev-sharma opened 8 months ago
I've a usecase where on a SLURM cluster, I am planning to use torchrun or torch.distributed with submitit The purpose is to do distributed training using torchrun or torch.distributed with submitit + PyTorch Lightning
torchrun
torch.distributed
submitit
How should I go ahead with it?
Thanks for the help in advance!!!
Hey @vasudev-sharma, have you found a solution for this setup? I'm also interested in this. Any insights you could share would be greatly appreciated!
I've a usecase where on a SLURM cluster, I am planning to use
torchrun
ortorch.distributed
withsubmitit
The purpose is to do distributed training usingtorchrun
ortorch.distributed
withsubmitit
+ PyTorch LightningHow should I go ahead with it?
Thanks for the help in advance!!!