Closed nnaron closed 2 years ago
It has nothing to do using tensor core or not.
It is hard to run two short kernels in parallel in two different streams. The kernel launch overhead is not small and it is possible that when the 2nd kernel is launched, the 1st one is done. A100 is much faster than V100 which makes the kernel runtime much shorter and it is more difficult to run two short kernels in parallel.
It has nothing to do using tensor core or not.
It is hard to run two short kernels in parallel in two different streams. The kernel launch overhead is not small and it is possible that when the 2nd kernel is launched, the 1st one is done. A100 is much faster than V100 which makes the kernel runtime much shorter and it is more difficult to run two short kernels in parallel.
My first kernels executation time is 3.2ms and the other is 650us. And second one is sequential.
3.2 ms kernel sounds big, it may take all the SMs.
3.2 ms kernel sounds big, it may take all the SMs.
Yes, and for that reason I want to keep some resources free.
I have a question. How I can predict about improvment that may come out of runing 2 kernels? e.g. if I limit 5% of GEMM resources and run the other sequential kernel concurently, then can I expect improvment or I will distroy current performance.
Which mathematical modeling can help me to undestand the expected improvment?
for example, if you have 210 threadblocks and every SM can only run 1 threadblock a time. A100 has 108 SMs. You need ceil(210/108)=2
waves.
If you are restricted to 100 SMs, you will need ceil(210/100)=3 waves
. 150% slowdown.
If you are restricted to 105 SMs, you need ceil(210/105)=2 waves
. No perf difference.
This model ignores that waves can overlap, locality, power, etc.
for example, if you have 210 threadblocks and every SM can only run 1 threadblock a time. A100 has 108 SMs. You need
ceil(210/108)=2
waves.If you are restricted to 100 SMs, you will need
ceil(210/100)=3 waves
. 150% slowdown.If you are restricted to 105 SMs, you need
ceil(210/105)=2 waves
. No perf difference.This model ignores that waves can overlap, locality, power, etc.
Thanks. It was really useful. How can I distinguish threadblocks of each GEMM kernels (e.g. cutlass kerels for SGEMM, DGEMM and mixed_half_single_GEMM)?
Thanks. It was really useful. How can I distinguish threadblocks of each GEMM kernels (e.g. cutlass kerels for SGEMM, DGEMM and mixed_half_single_GEMM)?
What do you mean? Every types of kernel (dgemm, sgemm, etc.) has many different implementations using different tile sizes.
I'm closing this as concurrent kernels depend on available SM resources and should usually be managed by the HW scheduler.
I am writing again, because I did not understand well this sentence:
I'm closing this as concurrent kernels depend on available SM resources and should usually be managed by the HW scheduler.
Do you mean that by using CUTLASS for GEMM the programer is not able to keep some SMs free? for example keep free 2 SMs of 108 SMs of A100?
You can have some control over the number of SMs used with the launch configuration. I'm suggesting it's more efficient to put two kernels in separate CUDA streams and let the hardware scheduler manage resources.
Just I need one free SM for the rest of my work. With 2 streams just sometimes partially the GEMM and other kernel are running in parallel. I want to hide my critical path by running it in parallel with GEMM.
The critical path kernel needs very limited resource and problem is that when the second kernel is runing most part of the GPU is free.
Then try a launch configuration to use enough threadblocks to only use total SMs - 2, but there are other considerations like share memory. You may want to look into cuBLASLt which allows you to designate the number of SMs used. And you may want to use streams with different priorities to ensure the desire flow is achieved.
Thanks for suggestion.
Do you mean to use cublasGetSmCountTarget() for cuBLASLt to designate the number of SMs ?
Yes, but I wasn't implying mixing that with CUTLASS. I think you'll need to use cublasLt.
Thanks. I am seeing that we have this function also for cuBLAS. I applyed it to the GEMM and reduced the number of SMs.
original GEMM: grid: <<<1248, 10, 10>>> block: <<<128, 1, 1>>> occupancy: 12.5% GEMM with reduced SMs to 10: grid: <<<2496, 20, 10>>> block: <<<128, 1, 1>>> occupancy: 18.75%
But the behavior of GEMM is not changing to allow other kernel run in parallel with it. Just partially parallel like before.
Might be cublasLt different?
note that although an SM carveout is not possible for Ampere and Volta, CUTLASS 3.x Hopper persistent kernels do support an SM carveout. You can simply set the sm_count
of KernelHardwareInfo
to whatever number of SMs you want the grid to use, so if you are running on an H100, you can make use of this via cutlass directly.
Might be cublasLt different?
I don't see cublasLt being different as it's called under the hood of cublas. It's hard to say what's going on without more knowledge of the program. I suggest you use profiling tools Nsight Systems and Compute to better understand what resources are being used during kernel execution
note that although an SM carveout is not possible for Ampere and Volta, CUTLASS 3.x Hopper persistent kernels do support an SM carveout. You can simply set the
sm_count
ofKernelHardwareInfo
to whatever number of SMs you want the grid to use, so if you are running on an H100, you can make use of this via cutlass directly.
Thanks. I am running on A100. @thakkarV So based on what you mentioned the conclusion is that I am not able to make free some resources for other kernel to run in parallel with GEMM (with cuBLAS and cutlass)?
I have attached some parts of the trace. The second picture is showing 2 kernels that are the same. Just one of them is in parallel with GEMM. And the last one is related to 2 other kernels that have the same amount of resources.
Those execution times are so small, it's possible they are latency bound.
I agree that execution times are so small (also the first small kernel in upper trace (green rectangle) is not compute bound or memory bound). Those kernels are working on a 128x128 matrix with a block algorithm. So if it is latency bounded why the first small block (64x64) is runinng in parallel but the next kernel in the same stream that is working on a block (64x64) is not runing in parallel (we can see that part at the end of the GEMM).
I am thinking when the last kernels inside of the green rectangle are running, the GPU is not utilised well.
What you're seeing is called the tail effect or tail wave. You can analyze the long running kernel with Nsight Compute. It's possible there is one resource that it is hogging that doesn't let other kernels run in parallel. You might also increase the priority of the bottom stream and see what happens
I am seeing that with A100 GPU I can not run CUDA kernels in other stream parallely with Tensor Core GEMM but with V100 sometimes it will happen. How can I clarify this behavior?
Also some times I am seeing CUDA core GEMM in parallel with Tensor Core GEMM in A100 in some MAGMA algorithms. But I am not able to write a same thing in my program, and for me with diferent streams they are not runing in paralle.
Aha! Link: https://nvaiinfa.aha.io/features/CUTLASS-7