Open Rainlin007 opened 2 months ago
I saw that the gemm time on different GPUs was different, which led to subsequent increase in synchronization time, but I did not find the reason for the different gemm time.
You can check whether the frequencies of different GPUs on the server are the same; some GPUs might have downclocked.
I have checked the frequency before, but it is indeed same. I can also find that the time is similar from other kernels. Only this kernel has a big difference. Have you encountered it before?
@zheng-ningxin
How many times did Flux loop in this profile?
How many times did Flux loop in this profile?
about 500 @zheng-ningxin
Would you also observe this phenomenon when you use torch.profile?
I have checked the frequency before, but it is indeed same. I can also find that the time is similar from other kernels. Only this kernel has a big difference. Have you encountered it before?
@Rainlin007 For long run, GPU might adjust the frequency. You can use tool like nvidia-sim to monitor the frequency and do some sampling to check the change.
The difference you showed in your profiling looks quite big indeed, so in your 500 runs, which iteration this screenshot belongs to? maybe check the later ones to see if this is stable or occasionally show up ?
Your question Ask a clear and concise question about Flux.
There is torch.Size([5120, 1024]) x torch.Size([8192, 1024]) gemm_rs op in my project,fp16.I made a benchmark on A100:
torch.Size([5120, 1024]) x torch.Size([8192, 1024]): torch #0: gemm 0.358 ms, comm 0.416 ms, total 0.774 ms torch #1: gemm 0.357 ms, comm 0.416 ms, total 0.773 ms torch #2: gemm 0.354 ms, comm 0.418 ms, total 0.772 ms torch #3: gemm 0.356 ms, comm 0.417 ms, total 0.773 ms torch #4: gemm 0.359 ms, comm 0.414 ms, total 0.773 ms torch #5: gemm 0.355 ms, comm 0.418 ms, total 0.772 ms torch #6: gemm 0.361 ms, comm 0.412 ms, total 0.773 ms torch #7: gemm 0.356 ms, comm 0.417 ms, total 0.773 ms
flux #0: gemm 0.386 ms, comm 0.138 ms, total 0.524 ms flux #1: gemm 0.386 ms, comm 0.138 ms, total 0.523 ms flux #2: gemm 0.382 ms, comm 0.142 ms, total 0.523 ms flux #3: gemm 0.384 ms, comm 0.139 ms, total 0.523 ms flux #4: gemm 0.387 ms, comm 0.136 ms, total 0.523 ms flux #5: gemm 0.383 ms, comm 0.140 ms, total 0.523 ms flux #6: gemm 0.388 ms, comm 0.135 ms, total 0.523 ms flux #7: gemm 0.386 ms, comm 0.138 ms, total 0.523 ms
but in my proj,flux elapsed time over 900us,and my nsys results are:
my proj
benchmark
We can see the bytedance::flux::CudaIpcBarrierAllKernel time not same,how can I solve the problem?