Tensor Core GEMM and CUDA core kernels

nnaron commented 3 years ago

I am seeing that with A100 GPU I can not run CUDA kernels in other stream parallely with Tensor Core GEMM but with V100 sometimes it will happen. How can I clarify this behavior?

Also some times I am seeing CUDA core GEMM in parallel with Tensor Core GEMM in A100 in some MAGMA algorithms. But I am not able to write a same thing in my program, and for me with diferent streams they are not runing in paralle.

Aha! Link: https://nvaiinfa.aha.io/features/CUTLASS-7

hwu36 commented 3 years ago

It has nothing to do using tensor core or not.

It is hard to run two short kernels in parallel in two different streams. The kernel launch overhead is not small and it is possible that when the 2nd kernel is launched, the 1st one is done. A100 is much faster than V100 which makes the kernel runtime much shorter and it is more difficult to run two short kernels in parallel.

nnaron commented 3 years ago

It has nothing to do using tensor core or not.

It is hard to run two short kernels in parallel in two different streams. The kernel launch overhead is not small and it is possible that when the 2nd kernel is launched, the 1st one is done. A100 is much faster than V100 which makes the kernel runtime much shorter and it is more difficult to run two short kernels in parallel.

My first kernels executation time is 3.2ms and the other is 650us. And second one is sequential.

hwu36 commented 3 years ago

3.2 ms kernel sounds big, it may take all the SMs.

nnaron commented 3 years ago

3.2 ms kernel sounds big, it may take all the SMs.

Yes, and for that reason I want to keep some resources free.

I have a question. How I can predict about improvment that may come out of runing 2 kernels? e.g. if I limit 5% of GEMM resources and run the other sequential kernel concurently, then can I expect improvment or I will distroy current performance.

Which mathematical modeling can help me to undestand the expected improvment?

hwu36 commented 3 years ago

for example, if you have 210 threadblocks and every SM can only run 1 threadblock a time. A100 has 108 SMs. You need ceil(210/108)=2 waves.

If you are restricted to 100 SMs, you will need ceil(210/100)=3 waves. 150% slowdown.

If you are restricted to 105 SMs, you need ceil(210/105)=2 waves. No perf difference.

This model ignores that waves can overlap, locality, power, etc.

nnaron commented 2 years ago

for example, if you have 210 threadblocks and every SM can only run 1 threadblock a time. A100 has 108 SMs. You need ceil(210/108)=2 waves.

If you are restricted to 100 SMs, you will need ceil(210/100)=3 waves. 150% slowdown.

If you are restricted to 105 SMs, you need ceil(210/105)=2 waves. No perf difference.

This model ignores that waves can overlap, locality, power, etc.

Thanks. It was really useful. How can I distinguish threadblocks of each GEMM kernels (e.g. cutlass kerels for SGEMM, DGEMM and mixed_half_single_GEMM)?

hwu36 commented 2 years ago

Thanks. It was really useful. How can I distinguish threadblocks of each GEMM kernels (e.g. cutlass kerels for SGEMM, DGEMM and mixed_half_single_GEMM)?

What do you mean? Every types of kernel (dgemm, sgemm, etc.) has many different implementations using different tile sizes.

mnicely commented 2 years ago

I'm closing this as concurrent kernels depend on available SM resources and should usually be managed by the HW scheduler.

nnaron commented 1 year ago

I am writing again, because I did not understand well this sentence:

I'm closing this as concurrent kernels depend on available SM resources and should usually be managed by the HW scheduler.

Do you mean that by using CUTLASS for GEMM the programer is not able to keep some SMs free? for example keep free 2 SMs of 108 SMs of A100?

mnicely commented 1 year ago

You can have some control over the number of SMs used with the launch configuration. I'm suggesting it's more efficient to put two kernels in separate CUDA streams and let the hardware scheduler manage resources.

nnaron commented 1 year ago

Just I need one free SM for the rest of my work. With 2 streams just sometimes partially the GEMM and other kernel are running in parallel. I want to hide my critical path by running it in parallel with GEMM.

The critical path kernel needs very limited resource and problem is that when the second kernel is runing most part of the GPU is free.

mnicely commented 1 year ago

Then try a launch configuration to use enough threadblocks to only use total SMs - 2, but there are other considerations like share memory. You may want to look into cuBLASLt which allows you to designate the number of SMs used. And you may want to use streams with different priorities to ensure the desire flow is achieved.

nnaron commented 1 year ago

Thanks for suggestion.

Do you mean to use cublasGetSmCountTarget() for cuBLASLt to designate the number of SMs ?

mnicely commented 1 year ago

Yes, but I wasn't implying mixing that with CUTLASS. I think you'll need to use cublasLt.

nnaron commented 1 year ago

Thanks. I am seeing that we have this function also for cuBLAS. I applyed it to the GEMM and reduced the number of SMs.

original GEMM: grid: <<<1248, 10, 10>>> block: <<<128, 1, 1>>> occupancy: 12.5% GEMM with reduced SMs to 10: grid: <<<2496, 20, 10>>> block: <<<128, 1, 1>>> occupancy: 18.75%

But the behavior of GEMM is not changing to allow other kernel run in parallel with it. Just partially parallel like before.

Might be cublasLt different?

thakkarV commented 1 year ago

note that although an SM carveout is not possible for Ampere and Volta, CUTLASS 3.x Hopper persistent kernels do support an SM carveout. You can simply set the sm_count of KernelHardwareInfo to whatever number of SMs you want the grid to use, so if you are running on an H100, you can make use of this via cutlass directly.

mnicely commented 1 year ago

Might be cublasLt different?

I don't see cublasLt being different as it's called under the hood of cublas. It's hard to say what's going on without more knowledge of the program. I suggest you use profiling tools Nsight Systems and Compute to better understand what resources are being used during kernel execution

nnaron commented 1 year ago

note that although an SM carveout is not possible for Ampere and Volta, CUTLASS 3.x Hopper persistent kernels do support an SM carveout. You can simply set the sm_count of KernelHardwareInfo to whatever number of SMs you want the grid to use, so if you are running on an H100, you can make use of this via cutlass directly.

Thanks. I am running on A100. @thakkarV So based on what you mentioned the conclusion is that I am not able to make free some resources for other kernel to run in parallel with GEMM (with cuBLAS and cutlass)?

I have attached some parts of the trace. The second picture is showing 2 kernels that are the same. Just one of them is in parallel with GEMM. And the last one is related to 2 other kernels that have the same amount of resources.

parallel_gemm_kernel Screenshot from 2023-04-05 21-30-48 trsm

mnicely commented 1 year ago

Those execution times are so small, it's possible they are latency bound.

nnaron commented 1 year ago

gap

I agree that execution times are so small (also the first small kernel in upper trace (green rectangle) is not compute bound or memory bound). Those kernels are working on a 128x128 matrix with a block algorithm. So if it is latency bounded why the first small block (64x64) is runinng in parallel but the next kernel in the same stream that is working on a block (64x64) is not runing in parallel (we can see that part at the end of the GEMM).

I am thinking when the last kernels inside of the green rectangle are running, the GPU is not utilised well.

mnicely commented 1 year ago

What you're seeing is called the tail effect or tail wave. You can analyze the long running kernel with Nsight Compute. It's possible there is one resource that it is hogging that doesn't let other kernels run in parallel. You might also increase the priority of the bottom stream and see what happens

NVIDIA / cutlass

Tensor Core GEMM and CUDA core kernels #335