In order to fully emulate hardware, I am adding RMSNorm recompute, which moves the division of RMS until after multiplying by W. This means that for pre-ln, the division happens after the Q, K, and V linears for attention and after the up linear for the MLP.
In order to fully emulate hardware, I am adding RMSNorm recompute, which moves the division of RMS until after multiplying by W. This means that for pre-ln, the division happens after the Q, K, and V linears for attention and after the up linear for the MLP.
I also added recompute quantization, as per https://drive.google.com/drive/u/1/folders/1tOjBEBoXytUgU7R95aqWI6deCwyPAWjl