LpNorm + ScaleNorm - Githubissues

ScaleNorm: grafik

LayerNorm: grafik

Otherwise, both models have identical settings, so purely by switching LayerNorm with ScaleNorm (both L2), the model becomes 25% faster while achieving the same (or better) convergence: grafik

L1-Norm has the same speed but worse loss initially: grafik grafik

Additionally, this PR makes grad checks tougher to pass, as all models don't just get random parameters but also a random output gradient. This changed input ensures that we use the output gradient correctly, as our custom_grad functions could otherwise ignore it.

HomebrewNLP / Olmax

LpNorm + ScaleNorm #77