bytedance / byteps

A high performance and generic framework for distributed DNN training
Other
3.63k stars 490 forks source link

how to reduce the overhead of bytescheduler? #370

Closed gbxu closed 2 years ago

gbxu commented 3 years ago

hardware:

2 NVidia 1080Ti GPUs connected by PCIe.

As mentioned in #339, "To see similar performance of ByteScheduler as baseline, just disable partitioning and scheduling by setting BYTESCHEDULER_CREDIT and BYTESCHEDULER_PARTITION to infinity.", I run the script as follows.

export BYTESCHEDULER_CREDIT_TUNING=0
export  BYTESCHEDULER_CREDIT=100000000000
export BYTESCHEDULER_PARTITION_TUNING=0
export BYTESCHEDULER_PARTITION=1000000000000
export BYTESCHEDULER_DEBUG=0
export USE_BYTESCHEDULER=1
horovodrun -np 2  python pytorch_horovod_benchmark.py

the throughtput is Img/sec per GPU: 169.2 +-5.9. While if no using bytescheduling ( like export USE_BYTESCHEDULER=0 && horovodrun -np 2 python pytorch_horovod_benchmark.py), I could get Img/sec per GPU: 182.4 +-1.0 Any idea to avoid the overhead?

To Reproduce Steps to reproduce the behavior:

  1. build the environment via this dockerfile
  2. run the aforementioned commands.

Expected behavior I notices that the overhead of bytescheduler will hurt the performance. Hi @pengyanghua , could you give me some advices?

Environment (please complete the following information): Just the dockerfile you provided.

pengyanghua commented 3 years ago

@gbxu Could you provide the horovod timeline trace of the two cases? Let's figure out where the overhead comes from.

gbxu commented 3 years ago

timeline.zip includes bsc.json for the case USE_BYTESCHEDULER=1 and hvd.json for USE_BYTESCHEDULER=0 respectively.

pengyanghua commented 3 years ago

@gbxu It seems that bytescheduler takes 10ms more to finish one iteration. Could you try set --cycle-time-ms to a small value (e.g., 1ms, 2ms) to see whether the performance can be improved?

pengyanghua commented 3 years ago

The main overhead comes from parameter update and locks. As you can see here, once a gradient all-reduce operation is finished, we call SGD update and we do so for each parameter. Instead, PyTorch calls optimizer.step() to update all parameters once. Besides, our implementation uses locks here to make sure correct dependencies, which can be reimplemented to avoid the overhead. Check here for a better implementation.

gbxu commented 3 years ago

@gbxu It seems that bytescheduler takes 10ms more to finish one iteration. Could you try set --cycle-time-ms to a small value (e.g., 1ms, 2ms) to see whether the performance can be improved?

it doesn't work. No obvious improvement.

gbxu commented 3 years ago

The main overhead comes from parameter update and locks. As you can see here, once a gradient all-reduce operation is finished, we call SGD update and we do so for each parameter. Instead, PyTorch calls optimizer.step() to update all parameters once. Besides, our implementation uses locks here to make sure correct dependencies, which can be reimplemented to avoid the overhead. Check here for a better implementation.

Thanks for your reply. Dose the analysis come from the timeline files? Could you please show some screenshots? I didn't get it actually.

pengyanghua commented 3 years ago

@gbxu From the timeline we can see the backward computation of bytescheduler takes 10ms more to finish than the baseline. So I guess the gradient update operations interfere with the backward computation and hence the backward computation takes longer time to finish.