NVIDIA / cutlass

CUDA Templates for Linear Algebra Subroutines
Other
5.31k stars 892 forks source link

[FEA] make_tma_copy doesnot support shared-to-shared copy #1657

Open ziyuhuang123 opened 1 month ago

ziyuhuang123 commented 1 month ago

Describe the bug In PTX, I noticed that

cp.async.bulk.dst.src.completion_mechanism [dstMem], [srcMem], size, [mbar]

.dst =                  { .shared::cluster }
.src =                  { .shared::cta }
.completion_mechanism = { .mbarrier::complete_tx::bytes }

supports shared-to-shared copy. But in cute, make_tma_copy has to be global to shared. Why? Could you modify it?

thakkarV commented 1 month ago

This is not a bug. We have not needed smem to smem TMA copies. Do you have a specific need for this?

github-actions[bot] commented 1 week ago

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.