How to train a transformer with Stochastic Gradient Push on a single machine with multiple GPUs?

I'm a novice and I could train a transformer using original fairesq toolkit. Now I want to use this stochastic gradient push code, I followed all setup in readme here, from step 1 to 4.

Since I only have a single machine with 8 8 GTX1080Ti , I use this command here try to run it on 4 gpus with sgp large code, I copied this command from submit_sgp_large.sh I add distributed-init-method here , and removed 'distributed-port' since I read the main code and it seems only fit for SLURM.

CUDA_VISIBLE_DEVICES=4,5,6,7 python train.py data-bin/wmt16_en_de_bpe32k \
> --max-tokens 3500 --ddp-backend no_c10d  --dist_avg 'gossip' --distributed-backend tcp --update-freq 16 \
> --arch transformer_vaswani_wmt_en_de_big --share-all-embeddings \
> --optimizer adam --adam-betas '(0.9, 0.98)' \
> --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 --warmup-updates 4000 \
> --lr 0.0005 --min-lr 1e-09 --clip-norm 0.0 \
> --dropout 0.3 --weight-decay 0.0 --criterion label_smoothed_cross_entropy \
> --label-smoothing 0.1 --log-format simple  --distributed-init-method tcp://localhost:16344

It stuck here forever.

-1 tcp://localhost:16344
gossip tcp tcp://localhost:16344 4 0
| distributed init (pci-SYS-4028GR-TR rank 0 device_id 0): tcp://localhost:16344

I guess this related to args no_c10d, no I removed it and then it said 'tcp is deprecated'. so I changed tcp to nccl, but it also stuck here forever, am I wrong on some basic command or usage here?

 CUDA_VISIBLE_DEVICES=4,5,6,7 python train.py  data-bin/wmt16_en_de_bpe32k \
--max-tokens 3500  --dist_avg 'gossip' --distributed-backend nccl  --update-freq 16 \
--arch transformer_vaswani_wmt_en_de_big --share-all-embeddings --optimizer adam\
 --adam-betas '(0.9, 0.98)' --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 \
 --warmup-updates 4000 --lr 0.0005 --min-lr 1e-09 --clip-norm 0.0 --dropout 0.3  \
--weight-decay 0.0 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
 --log-format simple --distributed-init-method tcp://localhost:16344

still stuck:

-1 tcp://localhost:16344
gossip nccl tcp://localhost:16344 4 0
| distributed init (pci-SYS-4028GR-TR rank 0 device_id 0): tcp://localhost:16344

Thanks a lot if you read to here, and if someone could help me it would be highly appreciated. Python 3.6.8 torch 1.2.0 CUDA 9.0 fairseq 0.7.2 apex 0.1 Nvidia Driver: 384.130

All of our experiments using SGP to train a transformer used a multi-node setup (multiple servers). I don't believe we tested it with multiple GPUs on a single node and I'm not sure it would run. Thanks for asking this question - we're going to update transformer/Readme.md to include this info. Also our experiments using SGP to train a transformer were using the previous version of PyTorch (0.4, not 1.0), so some other things may need to be changed besides what's listed in the readme to get this to work with the latest SGP code.

I'm not certain, but my guess is that this could be hanging at the point where torch.distributed.init_process_group() is called because it believes the world size is 4 but you've only launched one process.

facebookresearch / stochastic_gradient_push

How to train a transformer with Stochastic Gradient Push on a single machine with multiple GPUs? #3