HomebrewNLP / Olmax

HomebrewNLP in JAX flavour for maintable TPU-Training
BSD 2-Clause "Simplified" License
45 stars 5 forks source link

Don't decay mixer #100

Closed ClashLuke closed 1 year ago

ClashLuke commented 1 year ago

It's the only one we can't decay. No mixer decay performs the same as decaying only input and output, which itself performs better than decay all non-norm parameters. This is likely why we couldn't reach the performance of the previous shampoo run again. However, I'm still in favor of merging adam-square as it's simpler than shampoo and significantly faster.