Closed gbxu closed 2 years ago
@gbxu Could you provide the horovod timeline trace of the two cases? Let's figure out where the overhead comes from.
timeline.zip includes bsc.json for the case USE_BYTESCHEDULER=1
and hvd.json for USE_BYTESCHEDULER=0
respectively.
@gbxu It seems that bytescheduler takes 10ms more to finish one iteration. Could you try set --cycle-time-ms to a small value (e.g., 1ms, 2ms) to see whether the performance can be improved?
The main overhead comes from parameter update and locks. As you can see here, once a gradient all-reduce operation is finished, we call SGD update and we do so for each parameter. Instead, PyTorch calls optimizer.step() to update all parameters once. Besides, our implementation uses locks here to make sure correct dependencies, which can be reimplemented to avoid the overhead. Check here for a better implementation.
@gbxu It seems that bytescheduler takes 10ms more to finish one iteration. Could you try set --cycle-time-ms to a small value (e.g., 1ms, 2ms) to see whether the performance can be improved?
it doesn't work. No obvious improvement.
The main overhead comes from parameter update and locks. As you can see here, once a gradient all-reduce operation is finished, we call SGD update and we do so for each parameter. Instead, PyTorch calls optimizer.step() to update all parameters once. Besides, our implementation uses locks here to make sure correct dependencies, which can be reimplemented to avoid the overhead. Check here for a better implementation.
Thanks for your reply. Dose the analysis come from the timeline files? Could you please show some screenshots? I didn't get it actually.
@gbxu From the timeline we can see the backward computation of bytescheduler takes 10ms more to finish than the baseline. So I guess the gradient update operations interfere with the backward computation and hence the backward computation takes longer time to finish.
hardware:
2 NVidia 1080Ti GPUs connected by PCIe.
As mentioned in #339, "To see similar performance of ByteScheduler as baseline, just disable partitioning and scheduling by setting BYTESCHEDULER_CREDIT and BYTESCHEDULER_PARTITION to infinity.", I run the script as follows.
the throughtput is
Img/sec per GPU: 169.2 +-5.9
. While if no using bytescheduling ( likeexport USE_BYTESCHEDULER=0 && horovodrun -np 2 python pytorch_horovod_benchmark.py
), I could getImg/sec per GPU: 182.4 +-1.0
Any idea to avoid the overhead?To Reproduce Steps to reproduce the behavior:
Expected behavior I notices that the overhead of bytescheduler will hurt the performance. Hi @pengyanghua , could you give me some advices?
Environment (please complete the following information): Just the dockerfile you provided.