Closed Anderbone closed 5 years ago
All of our experiments using SGP to train a transformer used a multi-node setup (multiple servers). I don't believe we tested it with multiple GPUs on a single node and I'm not sure it would run. Thanks for asking this question - we're going to update transformer/Readme.md to include this info. Also our experiments using SGP to train a transformer were using the previous version of PyTorch (0.4, not 1.0), so some other things may need to be changed besides what's listed in the readme to get this to work with the latest SGP code.
I'm not certain, but my guess is that this could be hanging at the point where torch.distributed.init_process_group()
is called because it believes the world size is 4 but you've only launched one process.
I'm a novice and I could train a transformer using original fairesq toolkit. Now I want to use this stochastic gradient push code, I followed all setup in readme here, from step 1 to 4.
Since I only have a single machine with 8 8 GTX1080Ti , I use this command here try to run it on 4 gpus with sgp large code, I copied this command from submit_sgp_large.sh I add distributed-init-method here , and removed 'distributed-port' since I read the main code and it seems only fit for SLURM.
It stuck here forever.
I guess this related to args no_c10d, no I removed it and then it said 'tcp is deprecated'. so I changed tcp to nccl, but it also stuck here forever, am I wrong on some basic command or usage here?
still stuck:
Thanks a lot if you read to here, and if someone could help me it would be highly appreciated. Python 3.6.8 torch 1.2.0 CUDA 9.0 fairseq 0.7.2 apex 0.1 Nvidia Driver: 384.130