NVIDIA / cutlass

CUDA Templates for Linear Algebra Subroutines
Other
5.54k stars 943 forks source link

[QST] Universal convolution supports for sm70/80 using Cute? #1785

Open Zxzzzzz opened 1 month ago

Zxzzzzz commented 1 month ago

What is your question?

I've read the example 59, it seems there is a easy and elegant way to assemble a conv kernel by using Cute, but the conv params are assumed to be known at complie time, and if I made these params determined during runtime, the tiling cannot work correctly when shape<n, p, q> cannot divisable by the tiler(E.g., npq are Shape<128, 14, 14>, while tiler is Shape<_128>(Actually, it should be qpn Shape<14, 14, 128> since the tiler always start tiling from the first dim)).

I would like to assemble a conv kernel using Cute to handle universal padding/stride, or in other words, R/S/P/Q is determined in runtime period. It seems only sm90 conv features can handle these universal cases, but I can't apply it on my workstation. I try to migrate it to sm70/80 but the TileCopy for im2col is complicated, I can't understand the parts that linearize the npq shape and strides, and seems it's related to sm90 hardware intrinsics.

I'm greatly appreciate if there's a approach to write a universal conv kernel using Cute while I can avoid using sm90 im2col intrinsics.

Thank you so much!

thakkarV commented 1 month ago

Pretty hard to handle dynamic shapes for conv performantly. You'll have to use all the tricks from our GTC2020 talk and the linearization performed for SM90 TMA based kernels. This is quite non trivial. Good luck!

github-actions[bot] commented 2 weeks ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.