the question about byteps's timeline

wuyujiji commented 3 years ago

Hello, according to https://github.com/bytedance/byteps/blob/master/docs/timeline.md, I get the timeline and find the grad tensor start time is unordered, which seems enable the priority scheduler. However, I check the code and find the default priority scheduler function is disabled.

ymjiang commented 3 years ago

The compute engines (e.g., TF/PyTorch) do not guarantee the orders, so they can also lead to the unordered start time of different tensors in BytePS.

wuyujiji commented 3 years ago

Thanks for your relpy! I also get the horovod timeline with the same model. For horovod, it looks like the tensor start time is ordered. According to this phenomenon, I have two question: 1. the reason of horovod timeline is ordered is that the horovod enable the NEGOTIATE mechanism？ 2. How the the NEGOTIATE mechanism of Byteps is implemented?

ymjiang commented 3 years ago

For intra-machine GPU negotiation, BytePS maintains a ready table to record which tensors are ready (see byteps/common/ready_table.cc). The sequence can be diverse on different GPUs so you will find them quite unordered.

I am not quite sure about how Horovod negotiates. Perhaps you can ask the Horovod community about that.

wuyujiji commented 3 years ago

Thanks for your detailed explanation! If I want to get pure backward computation time and pure communication time, how can I analysis it from byteps timeline? Or these two times can be obtained by other ways.

ymjiang commented 3 years ago

BytePS cannot capture the backward time. You should use the profilers of TF/PyTorch to get it.

What is your definition of pure communication time? I think you already have it in the timeline you dumped.

wuyujiji commented 3 years ago

Okay，I will use the profilers of TF/PyTorch to get backward time.

The pure communication time means the cost time of executing nccl API. For example, For Horovod allreduce op, it is composed of wait_for_data, wait_for_other_data, Queue, ... and ncclAllreduce in horovod timeline. I want to only get the ncclAllreducetime.

ymjiang commented 3 years ago

Okay，I will use the profilers of TF/PyTorch to get backward time.

The pure communication time means the cost time of executing nccl API. For example, For Horovod allreduce op, it is composed of wait_for_data, wait_for_other_data, Queue, ... and ncclAllreduce in horovod timeline. I want to only get the ncclAllreducetime.

@joapolarbear: Is this supported in BytePS profiler?

joapolarbear commented 3 years ago

@wuyujiji In BytePS, tensor communication can be divided into the following steps

  "COORDINATE_REDUCE",
  "REDUCE",
  "COPYD2H",
  "PCIE_REDUCE",
  "COORDINATE_PUSH",
  "PUSH",
  "PULL",
  "COPYH2D",
  "COORDINATE_BROADCAST",
  "BROADCAST"

Here REDUCE and BROADCAST correspond to the intra-machine GPU synchronization. Does this meet your requirement ? BTW, horovod also can not guarantee the tensor order. With the default setting of HOROVOD_CYCLE_TIME (5ms) and HOROVOD_FUSION_THRESHOLD (64MB), I found that tensors are not always fused in the same order as backpropagation.

wuyujiji commented 3 years ago

@joapolarbear Thanks for your explanation. In my opinion, the pure communication time of byteps should be summed between the communication time of intra-machine and inter-machine The communication time of intra-machine is equal to REDUCE time + BROADCAST time, and the communication time of inter-machine is equal to PUSH time + PULL time. Therefore, the total communication time is REDUCE time + BROADCAST time + PUSH time + PULL time. Is my understanding correct?

joapolarbear commented 3 years ago

@wuyujiji Right. For the GPU responsible for synchronizing with the PS, the communication time is REDUCE time + BROADCAST time + PUSH time + PULL time, for the other GPUs, the communication time is REDUCE time + BROADCAST time

wuyujiji commented 3 years ago

@joapolarbear Okay，thank you for replying so quickly! I have a final question. In byteps timeline, There are a blank gap between these four op (REDUCE, PUSH, PULL and BROADCAST), should I add these blank gap time into the final communication time? In addition, if a large tensor is partitioned into many slices, each slice tensor has the cost time of four ops, and between any two slices has overlap. In this case, how can I get the communication time the original large tensor?

joapolarbear commented 3 years ago

@wuyujiji If you want pure communication time, then these gaps should not be counted. If you want to get the communication time of the original large tensor, maybe you can disable tensor partition by setting BYTEPS_PARTITION_BYTES to a large value.

joapolarbear commented 3 years ago

@wuyujiji FYI. You can refer to https://github.com/bytedance/byteps/blob/master/byteps/common/global.cc to see how BYTEPS_PARTITION_BYTES works

wuyujiji commented 3 years ago

@joapolarbear If I disable the tensor partition, is there a big performance loss for communication? In other words，Is the obtained pure communication time of large tensor between enable tensor partition and disabled tensor partition different a lot?

wuyujiji commented 3 years ago

@joapolarbear @ymjiang Excuse me! When a large tensor is partitioned into many slices, why does some sliced tensor push time take so long, and some push time is so short? For example，the timeline shows the push time of the first sliced tensor is 11.007 ms, and the push time of the fourth sliced tensor is 0.093 ms.

joapolarbear commented 3 years ago

@wuyujiji Actually current profiling method can only capture correct PUSH start timestamps and PULL end timestamps. That's why current gaps between PUSHs and PULLs are very small. Actually, we are digging into ps-list/ZMQ to get correct PUSH end timestamps and PULL start timestamps.

wuyujiji commented 3 years ago

@joapolarbear Thanks for your explanation. If we regard the push time and pull time as a whole, the timeline shows that the time-consuming of the fourth sliced tensor is significantly shorter than the first sliced tensor. What is the reason for this phenomenon？

wuyujiji commented 3 years ago

cc @joapolarbear

joapolarbear commented 3 years ago

If we regard the push time and pull time as a whole, the timeline shows that the time-consuming of the fourth sliced tensor is significantly shorter than the first sliced tensor. What is the reason for this phenomenon？

@wuyujiji Maybe the fourth slice is aggregated indeed earlier in the PS. I am not very sure about how ps-lite and PS schedule tasks of these PUSH/PULL requests. Can you give any suggestion ? @ymjiang

ymjiang commented 3 years ago

the timeline shows that the time-consuming of the fourth sliced tensor is significantly shorter than the first sliced tensor. What is the reason for this phenomenon？

There are many possible reasons. Maybe the first tensor is delayed by other tensors at the server, or the network is congested when it is being sent by ps-lite.

bytedance / byteps

the question about byteps's timeline #349