csarofeen / pytorch

Tensors and Dynamic neural networks in Python with strong GPU acceleration
http://pytorch.org
Other
26 stars 7 forks source link

Moving predicate to cp.async args #2484

Closed mmigdal-nv closed 1 year ago

mmigdal-nv commented 1 year ago

This commits moves the predicate for the cp.async load to one of the arguments. Predicate arg is ignore-src thus why the ne to eq switch.

Here is the SASS generated (sm_80): Before

 BSSY B0, `(.L_x_0) 
 SGXT R5, R5, 0x8 
 ISETP.NE.U32.AND P0, PT, R5, RZ, PT 
 @!P0 BRA `(.L_x_1) 
 ULDC.64 UR4, c[0x0][0x118] 
 @!PT LDS RZ, [RZ] 
 @!PT LDS RZ, [RZ] 
 @!PT LDS RZ, [RZ] 
 LDGSTS.E.128 [R4], [R6.64] 

After

 SGXT R5, R5, 0x8 
 ULDC.64 UR4, c[0x0][0x118] 
 ISETP.NE.U32.AND P0, PT, R5, RZ, PT 
 @!PT LDS RZ, [RZ] 
 @!PT LDS RZ, [RZ] 
 @!PT LDS RZ, [RZ] 
 LDGSTS.E.128 [R4], [R6.64], P0 

This changes skips the branching.

cp.async.cg only works with 16 bytes accesses, so I changed the assertion too.

mmigdal-nv commented 1 year ago

ptxas cannot do this optimization as the new version has the side effect of writing zeroes to the destination shared memory, when predicated. For matmuls it is fine, but we might need to make this into a separate function?

zasdfgbnm commented 1 year ago

Reopened this PR to trigger a new CI run