[QUESTION] The gemm time on GPU of different rank under tp8 is very different , and cause low performance

Rainlin007 commented 2 months ago

Your question Ask a clear and concise question about Flux.

There is torch.Size([5120, 1024]) x torch.Size([8192, 1024]) gemm_rs op in my project,fp16.I made a benchmark on A100:

torch.Size([5120, 1024]) x torch.Size([8192, 1024]): torch #0: gemm 0.358 ms, comm 0.416 ms, total 0.774 ms torch #1: gemm 0.357 ms, comm 0.416 ms, total 0.773 ms torch #2: gemm 0.354 ms, comm 0.418 ms, total 0.772 ms torch #3: gemm 0.356 ms, comm 0.417 ms, total 0.773 ms torch #4: gemm 0.359 ms, comm 0.414 ms, total 0.773 ms torch #5: gemm 0.355 ms, comm 0.418 ms, total 0.772 ms torch #6: gemm 0.361 ms, comm 0.412 ms, total 0.773 ms torch #7: gemm 0.356 ms, comm 0.417 ms, total 0.773 ms

flux #0: gemm 0.386 ms, comm 0.138 ms, total 0.524 ms flux #1: gemm 0.386 ms, comm 0.138 ms, total 0.523 ms flux #2: gemm 0.382 ms, comm 0.142 ms, total 0.523 ms flux #3: gemm 0.384 ms, comm 0.139 ms, total 0.523 ms flux #4: gemm 0.387 ms, comm 0.136 ms, total 0.523 ms flux #5: gemm 0.383 ms, comm 0.140 ms, total 0.523 ms flux #6: gemm 0.388 ms, comm 0.135 ms, total 0.523 ms flux #7: gemm 0.386 ms, comm 0.138 ms, total 0.523 ms

but in my proj，flux elapsed time over 900us，and my nsys results are:

my proj

benchmark

We can see the bytedance::flux::CudaIpcBarrierAllKernel time not same，how can I solve the problem？

Rainlin007 commented 2 months ago

I saw that the gemm time on different GPUs was different, which led to subsequent increase in synchronization time, but I did not find the reason for the different gemm time.

zheng-ningxin commented 2 months ago

You can check whether the frequencies of different GPUs on the server are the same; some GPUs might have downclocked.

Rainlin007 commented 2 months ago

I have checked the frequency before, but it is indeed same. I can also find that the time is similar from other kernels. Only this kernel has a big difference. Have you encountered it before?

Rainlin007 commented 2 months ago

@zheng-ningxin

zheng-ningxin commented 2 months ago

How many times did Flux loop in this profile?

Rainlin007 commented 2 months ago

How many times did Flux loop in this profile?

about 500 @zheng-ningxin

zheng-ningxin commented 2 months ago

Would you also observe this phenomenon when you use torch.profile?

wenlei-bao commented 2 months ago

I have checked the frequency before, but it is indeed same. I can also find that the time is similar from other kernels. Only this kernel has a big difference. Have you encountered it before?

@Rainlin007 For long run, GPU might adjust the frequency. You can use tool like nvidia-sim to monitor the frequency and do some sampling to check the change.

The difference you showed in your profiling looks quite big indeed, so in your 500 runs, which iteration this screenshot belongs to? maybe check the later ones to see if this is stable or occasionally show up ?

bytedance / flux

[QUESTION] The gemm time on GPU of different rank under tp8 is very different , and cause low performance #36