bytedance / byteps

A high performance and generic framework for distributed DNN training
Other
3.62k stars 487 forks source link

[Question] Does replacing torch.distributed.all_reduce with BytePS impact the training curve? #356

Closed ruipeterpan closed 3 years ago

ruipeterpan commented 3 years ago

Describe the bug

I'm trying to do a cross-comparison of the performance between PS in PyTorch and torch.distributed.all_reduce. The throughput is matching up as expected, but I noticed that the training loss curve for bps isn't even close to that for all_reduce. Is this expected?

Expected behavior

The expected behavior is for the training loss curves for bps and all_reduce to match up. At least the training curve for bps should be dropping, not being stuck (and even going up) after a few epochs.

Training loss curve

Allreduce

Worker 1 job_id 53706 epoch 0 loss 0.96994
Worker 1 job_id 53706 epoch 1 loss 0.4443
Worker 1 job_id 53706 epoch 2 loss 0.36194
Worker 1 job_id 53706 epoch 3 loss 0.30613
Worker 1 job_id 53706 epoch 4 loss 0.26075
Worker 1 job_id 53706 epoch 5 loss 0.24754
Worker 1 job_id 53706 epoch 6 loss 0.2346
Worker 1 job_id 53706 epoch 7 loss 0.22846
Worker 1 job_id 53706 epoch 8 loss 0.21421
Worker 1 job_id 53706 epoch 9 loss 0.20123
Worker 1 job_id 53706 epoch 10 loss 0.20055
Worker 1 job_id 53706 epoch 11 loss 0.19156
Worker 1 job_id 53706 epoch 12 loss 0.18519
Worker 1 job_id 53706 epoch 13 loss 0.18175
Worker 1 job_id 53706 epoch 14 loss 0.17038
Worker 1 job_id 53706 epoch 15 loss 0.16778
Worker 1 job_id 53706 epoch 16 loss 0.15655
Worker 1 job_id 53706 epoch 17 loss 0.16151
Worker 1 job_id 53706 epoch 18 loss 0.16402
Worker 1 job_id 53706 epoch 19 loss 0.1625
Worker 1 job_id 53706 epoch 20 loss 0.15039
Worker 1 job_id 53706 epoch 21 loss 0.14519
Worker 1 job_id 53706 epoch 22 loss 0.14297
Worker 1 job_id 53706 epoch 23 loss 0.14169
Worker 1 job_id 53706 epoch 24 loss 0.14341
Worker 1 job_id 53706 epoch 25 loss 0.13347
Worker 1 job_id 53706 epoch 26 loss 0.13467
Worker 1 job_id 53706 epoch 27 loss 0.14464
Worker 1 job_id 53706 epoch 28 loss 0.13385
Worker 1 job_id 53706 epoch 29 loss 0.13195

BytePS

Worker 1 job_id 53706 epoch 0 loss 0.99562
Worker 1 job_id 53706 epoch 1 loss 0.47527
Worker 1 job_id 53706 epoch 2 loss 0.39485
Worker 1 job_id 53706 epoch 3 loss 0.33387
Worker 1 job_id 53706 epoch 4 loss 0.33007
Worker 1 job_id 53706 epoch 5 loss 0.32855
Worker 1 job_id 53706 epoch 6 loss 0.3048
Worker 1 job_id 53706 epoch 7 loss 0.32054
Worker 1 job_id 53706 epoch 8 loss 0.31929
Worker 1 job_id 53706 epoch 9 loss 0.31583
Worker 1 job_id 53706 epoch 10 loss 0.32275
Worker 1 job_id 53706 epoch 11 loss 0.31855
Worker 1 job_id 53706 epoch 12 loss 0.30443
Worker 1 job_id 53706 epoch 13 loss 0.29691
Worker 1 job_id 53706 epoch 14 loss 0.30042
Worker 1 job_id 53706 epoch 15 loss 0.30244
Worker 1 job_id 53706 epoch 16 loss 0.27807
Worker 1 job_id 53706 epoch 17 loss 0.27728
Worker 1 job_id 53706 epoch 18 loss 0.27495
Worker 1 job_id 53706 epoch 19 loss 0.26966
Worker 1 job_id 53706 epoch 20 loss 0.26847
Worker 1 job_id 53706 epoch 21 loss 0.26734
Worker 1 job_id 53706 epoch 22 loss 0.26443
Worker 1 job_id 53706 epoch 23 loss 0.25856
Worker 1 job_id 53706 epoch 24 loss 0.2549
Worker 1 job_id 53706 epoch 25 loss 0.2487
Worker 1 job_id 53706 epoch 26 loss 0.25067
Worker 1 job_id 53706 epoch 27 loss 0.26313
Worker 1 job_id 53706 epoch 28 loss 0.28409
Worker 1 job_id 53706 epoch 29 loss 0.30988

To Reproduce

Steps to reproduce the behavior:

  1. Build the docker image using a local Dockerfile. The Dockerfile is modified from the one provided in the repo with a few changes. The BytePS I used in the Dockerfile is vanilla BytePS (69a3d), only with some extra print statements.
  2. Set up a 4-node cluster. In my case, I used Azure ML for a 4-node cluster (STANDARD_NC24, 4 x NVIDIA Tesla K80).
  3. For all_reduce, start four containers on each of the 4 nodes and run the training scripts. For the exact steps, see below.
  4. For BytePS, similarly start 4 worker containers, 1 server container, and 1 scheduler container. For the exact steps, see below

building the docker image

# cd into the directory with Dockerfile and do
sudo docker build -t bytepsimage/pytorch . -f Dockerfile --build-arg FRAMEWORK=pytorch

allreduce set up

# start the docker containers, one on each node
sudo nvidia-docker run -dt --net=host --env NVIDIA_VISIBLE_DEVICES=0 --env CUDA_VISIBLE_DEVICES=0 --name worker bytepsimage/pytorch
sudo nvidia-docker run -dt --net=host --env NVIDIA_VISIBLE_DEVICES=1 --env CUDA_VISIBLE_DEVICES=0 --name worker bytepsimage/pytorch
sudo nvidia-docker run -dt --net=host --env NVIDIA_VISIBLE_DEVICES=2 --env CUDA_VISIBLE_DEVICES=0 --name worker bytepsimage/pytorch
sudo nvidia-docker run -dt --net=host --env NVIDIA_VISIBLE_DEVICES=3 --env CUDA_VISIBLE_DEVICES=0 --name worker bytepsimage/pytorch

# for each node, go inside the container
sudo docker exec -it worker bash

# specify the environment variables and start training. Here, replace the master_addr
MASTER_ADDR=10.0.0.4 MASTER_PORT=12345 RANK=0 WORLD_SIZE=4 python3.8 bps_issue.py --all_reduce -pindex 0 
MASTER_ADDR=10.0.0.4 MASTER_PORT=12345 RANK=1 WORLD_SIZE=4 python3.8 bps_issue.py --all_reduce -pindex 1 
MASTER_ADDR=10.0.0.4 MASTER_PORT=12345 RANK=2 WORLD_SIZE=4 python3.8 bps_issue.py --all_reduce -pindex 2 
MASTER_ADDR=10.0.0.4 MASTER_PORT=12345 RANK=3 WORLD_SIZE=4 python3.8 bps_issue.py --all_reduce -pindex 3 

byteps set up

# scheduler on node 0
sudo docker run -dt --net=host --env DMLC_NUM_WORKER=4 --env DMLC_ROLE=scheduler --env DMLC_NUM_SERVER=1 --env DMLC_PS_ROOT_URI=10.0.0.4 --env DMLC_PS_ROOT_PORT=12345 --env BYTEPS_ENABLE_ASYNC=0 --name scheduler bytepsimage/pytorch

# server on node 0
sudo docker run -dt --net=host --env DMLC_NUM_WORKER=4 --env DMLC_ROLE=server --env DMLC_NUM_SERVER=1 --env DMLC_PS_ROOT_URI=10.0.0.4 --env DMLC_PS_ROOT_PORT=12345 --env BYTEPS_ENABLE_ASYNC=0 --name server bytepsimage/pytorch

# 4 workers on node 0-3
sudo nvidia-docker run -dt --net=host --env NVIDIA_VISIBLE_DEVICES=0 --env CUDA_VISIBLE_DEVICES=0 --env DMLC_WORKER_ID=0 --env DMLC_NUM_WORKER=4 --env DMLC_ROLE=worker --env DMLC_NUM_SERVER=1 --env DMLC_PS_ROOT_URI=10.0.0.4 --env DMLC_PS_ROOT_PORT=12345 --env BYTEPS_ENABLE_ASYNC=0 --name worker0 bytepsimage/pytorch
sudo nvidia-docker run -dt --net=host --env NVIDIA_VISIBLE_DEVICES=1 --env CUDA_VISIBLE_DEVICES=0 --env DMLC_WORKER_ID=1 --env DMLC_NUM_WORKER=4 --env DMLC_ROLE=worker --env DMLC_NUM_SERVER=1 --env DMLC_PS_ROOT_URI=10.0.0.4 --env DMLC_PS_ROOT_PORT=12345 --env BYTEPS_ENABLE_ASYNC=0 --name worker1 bytepsimage/pytorch
sudo nvidia-docker run -dt --net=host --env NVIDIA_VISIBLE_DEVICES=2 --env CUDA_VISIBLE_DEVICES=0 --env DMLC_WORKER_ID=2 --env DMLC_NUM_WORKER=4 --env DMLC_ROLE=worker --env DMLC_NUM_SERVER=1 --env DMLC_PS_ROOT_URI=10.0.0.4 --env DMLC_PS_ROOT_PORT=12345 --env BYTEPS_ENABLE_ASYNC=0 --name worker2 bytepsimage/pytorch
sudo nvidia-docker run -dt --net=host --env NVIDIA_VISIBLE_DEVICES=3 --env CUDA_VISIBLE_DEVICES=0 --env DMLC_WORKER_ID=3 --env DMLC_NUM_WORKER=4 --env DMLC_ROLE=worker --env DMLC_NUM_SERVER=1 --env DMLC_PS_ROOT_URI=10.0.0.4 --env DMLC_PS_ROOT_PORT=12345 --env BYTEPS_ENABLE_ASYNC=0 --name worker3 bytepsimage/pytorch

# for each node, go inside the container (server, scheduler, worker 0, ..., worker 3)
sudo docker exec -it {replace_me_with_name_of_container} bash

# use bpslaunch to start the server/scheduler. For the workers, 
bpslaunch python3.8 bps_issue.py --bps -pindex 0
bpslaunch python3.8 bps_issue.py --bps -pindex 1
bpslaunch python3.8 bps_issue.py --bps -pindex 2
bpslaunch python3.8 bps_issue.py --bps -pindex 3

Environment (please complete the following information):

A few other things

ymjiang commented 3 years ago

@pleasantrabbit Can you please take a look? This seems related to byteps.torch.parallel.

ruipeterpan commented 3 years ago

Hey, so I looked a bit into this and realized that some changes were made to byteps/byteps/torch/ops.cc for averaging in bps.push_pull (push_pull_async) after I forked the repo, so I updated my byteps fork and also updated the byteps training curve. The training curve still looks weird, though...

pleasantrabbit commented 3 years ago

@ymjiang @ruipeterpan I am looking into it.

pleasantrabbit commented 3 years ago

@ruipeterpan

For all_reduce, start four containers on each of the 4 nodes and run the training scripts.

Just wanted to make sure, you used 4 containers in total, only 1 container was started on each node, and each container has 1 GPU, right?

I ran your script with slight modifications (here), here's the loss I got. The loss from byteps is fairly close to pytorch ddp. It's expected that they don't produce exactly the same numbers.

In both tests I used 4 worker containers, each container has 1 GPU.

# byteps

Worker None job_id 53706 epoch 0 loss 0.97005
Worker None job_id 53706 epoch 1 loss 0.44175
Worker None job_id 53706 epoch 2 loss 0.35767
Worker None job_id 53706 epoch 3 loss 0.30529
Worker None job_id 53706 epoch 4 loss 0.26517
Worker None job_id 53706 epoch 5 loss 0.25007
Worker None job_id 53706 epoch 6 loss 0.23308
Worker None job_id 53706 epoch 7 loss 0.22774
Worker None job_id 53706 epoch 8 loss 0.21719
Worker None job_id 53706 epoch 9 loss 0.20118
Worker None job_id 53706 epoch 10 loss 0.19615
Worker None job_id 53706 epoch 11 loss 0.18659
Worker None job_id 53706 epoch 12 loss 0.18569
Worker None job_id 53706 epoch 13 loss 0.18023
Worker None job_id 53706 epoch 14 loss 0.17755
Worker None job_id 53706 epoch 15 loss 0.16427
Worker None job_id 53706 epoch 16 loss 0.15958
Worker None job_id 53706 epoch 17 loss 0.1618
Worker None job_id 53706 epoch 18 loss 0.15625
Worker None job_id 53706 epoch 19 loss 0.15476
Worker None job_id 53706 epoch 20 loss 0.15157
Worker None job_id 53706 epoch 21 loss 0.1415
Worker None job_id 53706 epoch 22 loss 0.14043
Worker None job_id 53706 epoch 23 loss 0.13888
Worker None job_id 53706 epoch 24 loss 0.14122
Worker None job_id 53706 epoch 25 loss 0.13685
Worker None job_id 53706 epoch 26 loss 0.13594
Worker None job_id 53706 epoch 27 loss 0.13268
Worker None job_id 53706 epoch 28 loss 0.1333
Worker None job_id 53706 epoch 29 loss 0.13674
pytorch ddp

Worker 0 job_id 53706 epoch 0 loss 0.97005
Worker 0 job_id 53706 epoch 1 loss 0.44009
Worker 0 job_id 53706 epoch 2 loss 0.36083
Worker 0 job_id 53706 epoch 3 loss 0.30743
Worker 0 job_id 53706 epoch 4 loss 0.26728
Worker 0 job_id 53706 epoch 5 loss 0.24862
Worker 0 job_id 53706 epoch 6 loss 0.23519
Worker 0 job_id 53706 epoch 7 loss 0.22722
Worker 0 job_id 53706 epoch 8 loss 0.21581
Worker 0 job_id 53706 epoch 9 loss 0.19869
Worker 0 job_id 53706 epoch 10 loss 0.19582
Worker 0 job_id 53706 epoch 11 loss 0.18607
Worker 0 job_id 53706 epoch 12 loss 0.18921
Worker 0 job_id 53706 epoch 13 loss 0.17617
Worker 0 job_id 53706 epoch 14 loss 0.17317
Worker 0 job_id 53706 epoch 15 loss 0.16502
Worker 0 job_id 53706 epoch 16 loss 0.16283
Worker 0 job_id 53706 epoch 17 loss 0.15986
Worker 0 job_id 53706 epoch 18 loss 0.15734
Worker 0 job_id 53706 epoch 19 loss 0.1613
Worker 0 job_id 53706 epoch 20 loss 0.15552
Worker 0 job_id 53706 epoch 21 loss 0.14287
Worker 0 job_id 53706 epoch 22 loss 0.14508
Worker 0 job_id 53706 epoch 23 loss 0.13977
Worker 0 job_id 53706 epoch 24 loss 0.14282
Worker 0 job_id 53706 epoch 25 loss 0.14104
Worker 0 job_id 53706 epoch 26 loss 0.13794
Worker 0 job_id 53706 epoch 27 loss 0.14073
Worker 0 job_id 53706 epoch 28 loss 0.13636
Worker 0 job_id 53706 epoch 29 loss 0.12924
ruipeterpan commented 3 years ago

Just wanted to make sure, you used 4 containers in total, only 1 container was started on each node, and each container has 1 GPU, right?

Ah, sorry for the wording -- yes, 4 worker containers (one on each node) are started and each container is exposed to 1 GPU only.

Thanks for the script, I was able to replicate the result! So the key reason behind this issue is the removal of broadcast_parameters(), i.e. not synchronizing the model parameters & optimizer state before training begins. This is really interesting... any idea why we needed them in the first place?

Thanks again!

pleasantrabbit commented 3 years ago

Just wanted to make sure, you used 4 containers in total, only 1 container was started on each node, and each container has 1 GPU, right?

Ah, sorry for the wording -- yes, 4 worker containers (one on each node) are started and each container is exposed to 1 GPU only.

Thanks for the script, I was able to replicate the result! So the key reason behind this issue is the removal of broadcast_parameters(), i.e. not synchronizing the model parameters & optimizer state before training begins. This is really interesting... any idea why we needed them in the first place?

Thanks again!

@ruipeterpan ah that's where you got them. In the DDP API there's no need to manually broadcast model parameters, the forward() step already does the broadcast. This behavior is the same as PyTorch DDP.

ruipeterpan commented 3 years ago

I see, thanks for the explanation! Closing this issue.

ruipeterpan commented 3 years ago

Hey @pleasantrabbit sorry if this is nitpicking, but here's a weird thing: I was not able to replicate the result when I used the exact same setup from four days ago. Previously, the difference of training loss in each epoch between PyTorch DDP and bps DDP is ~0.01, but now it's around ~0.03. This is not that big of an issue, but if you want to take a look, can you confirm that the difference between multiple benchmarks due to randomness is smaller than 0.01?

pytorch ddp

Worker 1 job_id 53706 epoch 0 loss 0.96994
Worker 1 job_id 53706 epoch 1 loss 0.44412
Worker 1 job_id 53706 epoch 2 loss 0.36226
Worker 1 job_id 53706 epoch 3 loss 0.30746
Worker 1 job_id 53706 epoch 4 loss 0.2686
Worker 1 job_id 53706 epoch 5 loss 0.24376
Worker 1 job_id 53706 epoch 6 loss 0.23426
Worker 1 job_id 53706 epoch 7 loss 0.22783
Worker 1 job_id 53706 epoch 8 loss 0.21471
Worker 1 job_id 53706 epoch 9 loss 0.2
Worker 1 job_id 53706 epoch 10 loss 0.1967
Worker 1 job_id 53706 epoch 11 loss 0.19032
Worker 1 job_id 53706 epoch 12 loss 0.1842
Worker 1 job_id 53706 epoch 13 loss 0.1791
Worker 1 job_id 53706 epoch 14 loss 0.167
Worker 1 job_id 53706 epoch 15 loss 0.16194
Worker 1 job_id 53706 epoch 16 loss 0.15595
Worker 1 job_id 53706 epoch 17 loss 0.15577
Worker 1 job_id 53706 epoch 18 loss 0.16591
Worker 1 job_id 53706 epoch 19 loss 0.15847
Worker 1 job_id 53706 epoch 20 loss 0.14782
Worker 1 job_id 53706 epoch 21 loss 0.14284
Worker 1 job_id 53706 epoch 22 loss 0.14065
Worker 1 job_id 53706 epoch 23 loss 0.14049
Worker 1 job_id 53706 epoch 24 loss 0.13772
Worker 1 job_id 53706 epoch 25 loss 0.13597
Worker 1 job_id 53706 epoch 26 loss 0.13846
Worker 1 job_id 53706 epoch 27 loss 0.14307
Worker 1 job_id 53706 epoch 28 loss 0.13831
Worker 1 job_id 53706 epoch 29 loss 0.13162

bps ddp

Worker 1 job_id 53706 epoch 0 loss 0.99761
Worker 1 job_id 53706 epoch 1 loss 0.46318
Worker 1 job_id 53706 epoch 2 loss 0.38711
Worker 1 job_id 53706 epoch 3 loss 0.33093
Worker 1 job_id 53706 epoch 4 loss 0.29031
Worker 1 job_id 53706 epoch 5 loss 0.26667
Worker 1 job_id 53706 epoch 6 loss 0.25335
Worker 1 job_id 53706 epoch 7 loss 0.25247
Worker 1 job_id 53706 epoch 8 loss 0.23936
Worker 1 job_id 53706 epoch 9 loss 0.22375
Worker 1 job_id 53706 epoch 10 loss 0.21695
Worker 1 job_id 53706 epoch 11 loss 0.20835
Worker 1 job_id 53706 epoch 12 loss 0.20624
Worker 1 job_id 53706 epoch 13 loss 0.20149
Worker 1 job_id 53706 epoch 14 loss 0.20252
Worker 1 job_id 53706 epoch 15 loss 0.19436
Worker 1 job_id 53706 epoch 16 loss 0.18817
Worker 1 job_id 53706 epoch 17 loss 0.18962
Worker 1 job_id 53706 epoch 18 loss 0.18058
Worker 1 job_id 53706 epoch 19 loss 0.18404
Worker 1 job_id 53706 epoch 20 loss 0.17468
Worker 1 job_id 53706 epoch 21 loss 0.16521
Worker 1 job_id 53706 epoch 22 loss 0.17198
Worker 1 job_id 53706 epoch 23 loss 0.16653
Worker 1 job_id 53706 epoch 24 loss 0.17104
Worker 1 job_id 53706 epoch 25 loss 0.16565
Worker 1 job_id 53706 epoch 26 loss 0.15876
Worker 1 job_id 53706 epoch 27 loss 0.16959
Worker 1 job_id 53706 epoch 28 loss 0.16414
Worker 1 job_id 53706 epoch 29 loss 0.17172