Open irasin opened 1 month ago
you are likely thrashing the L2 locality and by not doing any block ID remapping / swizzling
Hi, @thakkarV, thanks for your reply The performance is still bad when I removed the thread block swizzle.
I guess this abnormal data reading must be related to the cp.async of g2s part , but I don't know the specific reason of the problem.
And I found if I change
constexpr int BN = 128;
to
constexpr int BN = 256;
, the abnormal data reading will disappear, and here's the result of ncu.
The data reading from gmem to L2 cache is still more than cublas version, but I think it should be acceptable here.
And the throughput result of different large input sizes looks good too.
However, I have no idea how the block tile shape here will affect the L2 cache locality. Is there something wrong in my code or some bugs here? can anyone help?
By "block ID remapping / swizzling", Vijay meant the Tile Schedulers that are part of CUTLASS and cuBLAS, not the swizzling on the data or threadblocks. CUTLASS and cuBLAS have many Tile Schedulers including Split-K and Stream-K and other skews on the block ID assignment to work tile. Many of these strategies are targeted at increasing L2 locality. So cuBLAS uses a heuristic to choose the best Tile Scheduler for your problem, which certainly changes with problem size.
For more information, I recommend the (full version of our) Stream-K paper: https://arxiv.org/pdf/2301.03598
Hi @ccecka can I ask a basic question? I skimmed through the stream-K paper and my (rudimentary) understanding is that it beats cublas by minimizing the tail effect better. In this situation where the GEMM size is large I am assuming there are lots of threadblocks so would we expect the performance of cuBLAS be more or less the same as stream-K?
In other words does streak-K outperform cuBLAS for GEMM with a large number of blocks?
Thanks!
Any update? Can I think it's the expected behavior for this cute kernel?
What is your question?
I am learning to use cute to build a hgemm kernel. Tested on A10 GPU, the cute kernel is good with small problem size such as m/n/k = 4096, but I found it's much slower than cublas kernel with problem size m/n/k=16384/16384/16384 as below
Here is the profile result from ncu for problem m/n/k=16384/16384/16384.![image](https://github.com/NVIDIA/cutlass/assets/25549893/557b0712-33ab-44e7-afc3-47ca4d28b762)
And I found the biggest difference between my cute kernel and cublas kernel is the memory chart
cublas kernel![image](https://github.com/NVIDIA/cutlass/assets/25549893/ad1259ed-535a-4981-9d0a-8f206d4afca0)
my cute kernel![image](https://github.com/NVIDIA/cutlass/assets/25549893/f445e637-ed3f-45b2-8ddd-bd70c4750064)
I was wondering why my cute kernel has so much gmem->L2 and L2->shared data movement compared to cublas kernel. And how should I modify the cute kernel to improve performance for big problem size.
Here is my cute kernel