Open wuyujiji opened 3 years ago
The compute engines (e.g., TF/PyTorch) do not guarantee the orders, so they can also lead to the unordered start time of different tensors in BytePS.
Thanks for your relpy! I also get the horovod timeline with the same model. For horovod, it looks like the tensor start time is ordered. According to this phenomenon, I have two question: 1. the reason of horovod timeline is ordered is that the horovod enable the NEGOTIATE
mechanism? 2. How the the NEGOTIATE
mechanism of Byteps is implemented?
For intra-machine GPU negotiation, BytePS maintains a ready table to record which tensors are ready (see byteps/common/ready_table.cc
). The sequence can be diverse on different GPUs so you will find them quite unordered.
I am not quite sure about how Horovod negotiates. Perhaps you can ask the Horovod community about that.
Thanks for your detailed explanation! If I want to get pure backward computation time and pure communication time, how can I analysis it from byteps timeline? Or these two times can be obtained by other ways.
BytePS cannot capture the backward time. You should use the profilers of TF/PyTorch to get it.
What is your definition of pure communication time? I think you already have it in the timeline you dumped.
Okay,I will use the profilers of TF/PyTorch to get backward time.
The pure communication time means the cost time of executing nccl API. For example, For Horovod allreduce
op, it is composed of wait_for_data
, wait_for_other_data
, Queue
, ... and ncclAllreduce
in horovod timeline. I want to only get the ncclAllreduce
time.
Okay,I will use the profilers of TF/PyTorch to get backward time.
The pure communication time means the cost time of executing nccl API. For example, For Horovod
allreduce
op, it is composed ofwait_for_data
,wait_for_other_data
,Queue
, ... andncclAllreduce
in horovod timeline. I want to only get thencclAllreduce
time.
@joapolarbear: Is this supported in BytePS profiler?
@wuyujiji In BytePS, tensor communication can be divided into the following steps
"COORDINATE_REDUCE",
"REDUCE",
"COPYD2H",
"PCIE_REDUCE",
"COORDINATE_PUSH",
"PUSH",
"PULL",
"COPYH2D",
"COORDINATE_BROADCAST",
"BROADCAST"
Here REDUCE
and BROADCAST
correspond to the intra-machine GPU synchronization. Does this meet your requirement ?
BTW, horovod also can not guarantee the tensor order. With the default setting of HOROVOD_CYCLE_TIME
(5ms) and HOROVOD_FUSION_THRESHOLD
(64MB), I found that tensors are not always fused in the same order as backpropagation.
@joapolarbear Thanks for your explanation. In my opinion, the pure communication time of byteps should be summed between the communication time of intra-machine and inter-machine The communication time of intra-machine is equal to REDUCE time + BROADCAST time
, and the communication time of inter-machine is equal to PUSH time + PULL time
. Therefore, the total communication time is REDUCE time + BROADCAST time + PUSH time + PULL time
. Is my understanding correct?
@wuyujiji Right. For the GPU responsible for synchronizing with the PS, the communication time is REDUCE time + BROADCAST time + PUSH time + PULL time
, for the other GPUs, the communication time is REDUCE time + BROADCAST time
@joapolarbear Okay,thank you for replying so quickly! I have a final question. In byteps timeline, There are a blank gap
between these four op (REDUCE
, PUSH
, PULL
and BROADCAST
), should I add these blank gap time
into the final communication time? In addition, if a large tensor is partitioned into many slices,
each slice tensor has the cost time of four ops, and between any two slices has overlap
. In this case, how can I get the communication time the original large tensor?
@wuyujiji If you want pure communication time, then these gaps should not be counted. If you want to get the communication time of the original large tensor, maybe you can disable tensor partition by setting BYTEPS_PARTITION_BYTES
to a large value.
@wuyujiji FYI. You can refer to https://github.com/bytedance/byteps/blob/master/byteps/common/global.cc to see how BYTEPS_PARTITION_BYTES
works
@joapolarbear If I disable the tensor partition, is there a big performance loss for communication? In other words,Is the obtained pure communication time of large tensor between enable tensor partition and disabled tensor partition different a lot?
@joapolarbear @ymjiang Excuse me! When a large tensor is partitioned into many slices, why does some sliced tensor push time
take so long, and some push time
is so short? For example,the timeline shows the push time of the first sliced tensor is 11.007 ms
, and the push time of the fourth sliced tensor is 0.093 ms
.
@wuyujiji Actually current profiling method can only capture correct PUSH start timestamps and PULL end timestamps. That's why current gaps between PUSHs and PULLs are very small. Actually, we are digging into ps-list/ZMQ to get correct PUSH end timestamps and PULL start timestamps.
@joapolarbear Thanks for your explanation. If we regard the push time
and pull time
as a whole, the timeline shows that the time-consuming of the fourth sliced tensor is significantly shorter than the first sliced tensor. What is the reason for this phenomenon?
cc @joapolarbear
If we regard the push time and pull time as a whole, the timeline shows that the time-consuming of the fourth sliced tensor is significantly shorter than the first sliced tensor. What is the reason for this phenomenon?
@wuyujiji Maybe the fourth slice is aggregated indeed earlier in the PS. I am not very sure about how ps-lite and PS schedule tasks of these PUSH/PULL requests. Can you give any suggestion ? @ymjiang
the timeline shows that the time-consuming of the fourth sliced tensor is significantly shorter than the first sliced tensor. What is the reason for this phenomenon?
There are many possible reasons. Maybe the first tensor is delayed by other tensors at the server, or the network is congested when it is being sent by ps-lite.
Hello, according to https://github.com/bytedance/byteps/blob/master/docs/timeline.md, I get the timeline and find the grad tensor start time is unordered, which seems enable the priority scheduler. However, I check the code and find the default priority scheduler function is disabled.