Closed turbosonics closed 3 months ago
After I reading the docs (https://pytorch.org/docs/stable/elastic/rendezvous.html#torch.distributed.elastic.rendezvous.api.RendezvousTimeoutError), I think the error caused by your local cluster or pytorch itself. Sorry, I can not help you in more detail. Multi-GPU training of SevenNet is not an experimental or prototype feature. We have used it rigorously, even with multi-node, multi-GPU setups.
I recommend you debug with a small model, batch, and data size. Note that SevenNet uses only one CPU core, one task, per GPU.
Hello,
I compiled SevenNet using virtual environment with cuda 11.8 and pytorch 2.3.0 to our local GPU cluster. Training with single GPU runs well, but when I attempt to perform multiple-GPU training, the job crashes with following error:
How can I escape from this crash?