Closed Oleg-Goncharov closed 3 months ago
/te-ci
/te-ci
Hi, I think templating DACT and DBIAS is a good idea. Great work!
I have little concern as now we have gated_act_cast_transpose.cu
but act_cast_transpose
is still secretly in cast_transpose.cu
. I think it is better to either keep all related cast transpose functions in one place OR split both gated_act
and act
related functions into two additional files.
Agree, to have it consistent across other cast-transpose files, I reverted the split. There is a single file cast_transpose_fusion.cu
as before.
/te-ci
Description
Existing code of the fused cast transpose kernels is replicated for different scenarios (i.e. +dbias, +dactivation) with only small specific modifications. Replacing it with a single function template makes the code easier to support and allows adding new features in a simple way (e.g., scaling, JIT compilation).
Type of change: code refactoring
Changes:
gated_act_cast_transpose.cu
n_warps_per_tile
parameter from 4 to 8, which slightly improves the performance on the H100 HBM3The following table provides the runtime of the previous and the new version of the fused cast transpose kernels on the H100 HBM3. The new version is benchmarked for two values of the
n_warps_per_tile
parameter (4 and 8):