Don't have time to add it to your systems in place, but this 3.5x the FLOPs for a very skinny matmul (cached KV inference) and should 4x decrease the model checkpoint size. Need to change it a bit more to add better absmax calculation (probably vectorwise instead of the obviously unoptimal global) but the MAE is very reasonable for the setup shown.
Don't have time to add it to your systems in place, but this 3.5x the FLOPs for a very skinny matmul (cached KV inference) and should 4x decrease the model checkpoint size. Need to change it a bit more to add better absmax calculation (probably vectorwise instead of the obviously unoptimal global) but the MAE is very reasonable for the setup shown.