Closed ruipeterpan closed 3 years ago
@pleasantrabbit Can you please take a look? This seems related to byteps.torch.parallel.
Hey, so I looked a bit into this and realized that some changes were made to byteps/byteps/torch/ops.cc
for averaging in bps.push_pull (push_pull_async) after I forked the repo, so I updated my byteps fork and also updated the byteps training curve. The training curve still looks weird, though...
@ymjiang @ruipeterpan I am looking into it.
@ruipeterpan
For all_reduce, start four containers on each of the 4 nodes and run the training scripts.
Just wanted to make sure, you used 4 containers in total, only 1 container was started on each node, and each container has 1 GPU, right?
I ran your script with slight modifications (here), here's the loss I got. The loss from byteps is fairly close to pytorch ddp. It's expected that they don't produce exactly the same numbers.
In both tests I used 4 worker containers, each container has 1 GPU.
# byteps
Worker None job_id 53706 epoch 0 loss 0.97005
Worker None job_id 53706 epoch 1 loss 0.44175
Worker None job_id 53706 epoch 2 loss 0.35767
Worker None job_id 53706 epoch 3 loss 0.30529
Worker None job_id 53706 epoch 4 loss 0.26517
Worker None job_id 53706 epoch 5 loss 0.25007
Worker None job_id 53706 epoch 6 loss 0.23308
Worker None job_id 53706 epoch 7 loss 0.22774
Worker None job_id 53706 epoch 8 loss 0.21719
Worker None job_id 53706 epoch 9 loss 0.20118
Worker None job_id 53706 epoch 10 loss 0.19615
Worker None job_id 53706 epoch 11 loss 0.18659
Worker None job_id 53706 epoch 12 loss 0.18569
Worker None job_id 53706 epoch 13 loss 0.18023
Worker None job_id 53706 epoch 14 loss 0.17755
Worker None job_id 53706 epoch 15 loss 0.16427
Worker None job_id 53706 epoch 16 loss 0.15958
Worker None job_id 53706 epoch 17 loss 0.1618
Worker None job_id 53706 epoch 18 loss 0.15625
Worker None job_id 53706 epoch 19 loss 0.15476
Worker None job_id 53706 epoch 20 loss 0.15157
Worker None job_id 53706 epoch 21 loss 0.1415
Worker None job_id 53706 epoch 22 loss 0.14043
Worker None job_id 53706 epoch 23 loss 0.13888
Worker None job_id 53706 epoch 24 loss 0.14122
Worker None job_id 53706 epoch 25 loss 0.13685
Worker None job_id 53706 epoch 26 loss 0.13594
Worker None job_id 53706 epoch 27 loss 0.13268
Worker None job_id 53706 epoch 28 loss 0.1333
Worker None job_id 53706 epoch 29 loss 0.13674
pytorch ddp
Worker 0 job_id 53706 epoch 0 loss 0.97005
Worker 0 job_id 53706 epoch 1 loss 0.44009
Worker 0 job_id 53706 epoch 2 loss 0.36083
Worker 0 job_id 53706 epoch 3 loss 0.30743
Worker 0 job_id 53706 epoch 4 loss 0.26728
Worker 0 job_id 53706 epoch 5 loss 0.24862
Worker 0 job_id 53706 epoch 6 loss 0.23519
Worker 0 job_id 53706 epoch 7 loss 0.22722
Worker 0 job_id 53706 epoch 8 loss 0.21581
Worker 0 job_id 53706 epoch 9 loss 0.19869
Worker 0 job_id 53706 epoch 10 loss 0.19582
Worker 0 job_id 53706 epoch 11 loss 0.18607
Worker 0 job_id 53706 epoch 12 loss 0.18921
Worker 0 job_id 53706 epoch 13 loss 0.17617
Worker 0 job_id 53706 epoch 14 loss 0.17317
Worker 0 job_id 53706 epoch 15 loss 0.16502
Worker 0 job_id 53706 epoch 16 loss 0.16283
Worker 0 job_id 53706 epoch 17 loss 0.15986
Worker 0 job_id 53706 epoch 18 loss 0.15734
Worker 0 job_id 53706 epoch 19 loss 0.1613
Worker 0 job_id 53706 epoch 20 loss 0.15552
Worker 0 job_id 53706 epoch 21 loss 0.14287
Worker 0 job_id 53706 epoch 22 loss 0.14508
Worker 0 job_id 53706 epoch 23 loss 0.13977
Worker 0 job_id 53706 epoch 24 loss 0.14282
Worker 0 job_id 53706 epoch 25 loss 0.14104
Worker 0 job_id 53706 epoch 26 loss 0.13794
Worker 0 job_id 53706 epoch 27 loss 0.14073
Worker 0 job_id 53706 epoch 28 loss 0.13636
Worker 0 job_id 53706 epoch 29 loss 0.12924
Just wanted to make sure, you used 4 containers in total, only 1 container was started on each node, and each container has 1 GPU, right?
Ah, sorry for the wording -- yes, 4 worker containers (one on each node) are started and each container is exposed to 1 GPU only.
Thanks for the script, I was able to replicate the result! So the key reason behind this issue is the removal of broadcast_parameters()
, i.e. not synchronizing the model parameters & optimizer state before training begins. This is really interesting... any idea why we needed them in the first place?
Thanks again!
Just wanted to make sure, you used 4 containers in total, only 1 container was started on each node, and each container has 1 GPU, right?
Ah, sorry for the wording -- yes, 4 worker containers (one on each node) are started and each container is exposed to 1 GPU only.
Thanks for the script, I was able to replicate the result! So the key reason behind this issue is the removal of
broadcast_parameters()
, i.e. not synchronizing the model parameters & optimizer state before training begins. This is really interesting... any idea why we needed them in the first place?Thanks again!
@ruipeterpan ah that's where you got them. In the DDP API there's no need to manually broadcast model parameters, the forward() step already does the broadcast. This behavior is the same as PyTorch DDP.
I see, thanks for the explanation! Closing this issue.
Hey @pleasantrabbit sorry if this is nitpicking, but here's a weird thing: I was not able to replicate the result when I used the exact same setup from four days ago. Previously, the difference of training loss in each epoch between PyTorch DDP and bps DDP is ~0.01, but now it's around ~0.03. This is not that big of an issue, but if you want to take a look, can you confirm that the difference between multiple benchmarks due to randomness is smaller than 0.01?
Worker 1 job_id 53706 epoch 0 loss 0.96994
Worker 1 job_id 53706 epoch 1 loss 0.44412
Worker 1 job_id 53706 epoch 2 loss 0.36226
Worker 1 job_id 53706 epoch 3 loss 0.30746
Worker 1 job_id 53706 epoch 4 loss 0.2686
Worker 1 job_id 53706 epoch 5 loss 0.24376
Worker 1 job_id 53706 epoch 6 loss 0.23426
Worker 1 job_id 53706 epoch 7 loss 0.22783
Worker 1 job_id 53706 epoch 8 loss 0.21471
Worker 1 job_id 53706 epoch 9 loss 0.2
Worker 1 job_id 53706 epoch 10 loss 0.1967
Worker 1 job_id 53706 epoch 11 loss 0.19032
Worker 1 job_id 53706 epoch 12 loss 0.1842
Worker 1 job_id 53706 epoch 13 loss 0.1791
Worker 1 job_id 53706 epoch 14 loss 0.167
Worker 1 job_id 53706 epoch 15 loss 0.16194
Worker 1 job_id 53706 epoch 16 loss 0.15595
Worker 1 job_id 53706 epoch 17 loss 0.15577
Worker 1 job_id 53706 epoch 18 loss 0.16591
Worker 1 job_id 53706 epoch 19 loss 0.15847
Worker 1 job_id 53706 epoch 20 loss 0.14782
Worker 1 job_id 53706 epoch 21 loss 0.14284
Worker 1 job_id 53706 epoch 22 loss 0.14065
Worker 1 job_id 53706 epoch 23 loss 0.14049
Worker 1 job_id 53706 epoch 24 loss 0.13772
Worker 1 job_id 53706 epoch 25 loss 0.13597
Worker 1 job_id 53706 epoch 26 loss 0.13846
Worker 1 job_id 53706 epoch 27 loss 0.14307
Worker 1 job_id 53706 epoch 28 loss 0.13831
Worker 1 job_id 53706 epoch 29 loss 0.13162
Worker 1 job_id 53706 epoch 0 loss 0.99761
Worker 1 job_id 53706 epoch 1 loss 0.46318
Worker 1 job_id 53706 epoch 2 loss 0.38711
Worker 1 job_id 53706 epoch 3 loss 0.33093
Worker 1 job_id 53706 epoch 4 loss 0.29031
Worker 1 job_id 53706 epoch 5 loss 0.26667
Worker 1 job_id 53706 epoch 6 loss 0.25335
Worker 1 job_id 53706 epoch 7 loss 0.25247
Worker 1 job_id 53706 epoch 8 loss 0.23936
Worker 1 job_id 53706 epoch 9 loss 0.22375
Worker 1 job_id 53706 epoch 10 loss 0.21695
Worker 1 job_id 53706 epoch 11 loss 0.20835
Worker 1 job_id 53706 epoch 12 loss 0.20624
Worker 1 job_id 53706 epoch 13 loss 0.20149
Worker 1 job_id 53706 epoch 14 loss 0.20252
Worker 1 job_id 53706 epoch 15 loss 0.19436
Worker 1 job_id 53706 epoch 16 loss 0.18817
Worker 1 job_id 53706 epoch 17 loss 0.18962
Worker 1 job_id 53706 epoch 18 loss 0.18058
Worker 1 job_id 53706 epoch 19 loss 0.18404
Worker 1 job_id 53706 epoch 20 loss 0.17468
Worker 1 job_id 53706 epoch 21 loss 0.16521
Worker 1 job_id 53706 epoch 22 loss 0.17198
Worker 1 job_id 53706 epoch 23 loss 0.16653
Worker 1 job_id 53706 epoch 24 loss 0.17104
Worker 1 job_id 53706 epoch 25 loss 0.16565
Worker 1 job_id 53706 epoch 26 loss 0.15876
Worker 1 job_id 53706 epoch 27 loss 0.16959
Worker 1 job_id 53706 epoch 28 loss 0.16414
Worker 1 job_id 53706 epoch 29 loss 0.17172
Describe the bug
I'm trying to do a cross-comparison of the performance between PS in PyTorch and torch.distributed.all_reduce. The throughput is matching up as expected, but I noticed that the training loss curve for bps isn't even close to that for all_reduce. Is this expected?
Expected behavior
The expected behavior is for the training loss curves for bps and all_reduce to match up. At least the training curve for bps should be dropping, not being stuck (and even going up) after a few epochs.
Training loss curve
Allreduce
BytePS
To Reproduce
Steps to reproduce the behavior:
building the docker image
allreduce set up
byteps set up
Environment (please complete the following information):
A few other things