Do you encounter this issue? Any suggestions?
Training after one epoch, somewhere in the middle of the 2nd epoch training, all GPU stucked at 100% without error. Training is also stucked. It seems like a common bug of pytorch_lightning using DDP. But I still did not find a solution.
Do you encounter this issue? Any suggestions? Training after one epoch, somewhere in the middle of the 2nd epoch training, all GPU stucked at 100% without error. Training is also stucked. It seems like a common bug of pytorch_lightning using DDP. But I still did not find a solution.
https://github.com/Lightning-AI/lightning/issues/11242