Closed htyu closed 7 months ago
Approved ~ @htyu I see you've done a lot of work regarding tt.trans(). Just out of curiosity, why cant you transpose the tensor before feeding it to the FA or gemm block?
Thanks for reviewing the patch! It's a good question. Our specific model runs an attention kernel, one forward and one backward. Transposition is only used in the backward kernel.
The change enables stream pipelining on dot operation with transposition involved, in particular
tt.dot(Q, tt.trans(K))
. The change also fixes a perf issue where incorrect swizzling code was issued based on pre-transpose layout.