Closed mmigdal-nv closed 1 year ago
ptxas
cannot do this optimization as the new version has the side effect of writing zeroes to the destination shared memory, when predicated. For matmuls it is fine, but we might need to make this into a separate function?
Reopened this PR to trigger a new CI run
This commits moves the predicate for the cp.async load to one of the arguments. Predicate arg is
ignore-src
thus why thene
toeq
switch.Here is the SASS generated (sm_80): Before
After
This changes skips the branching.
cp.async.cg
only works with 16 bytes accesses, so I changed the assertion too.