It's the only one we can't decay. No mixer decay performs the same as decaying only input and output, which itself performs better than decay all non-norm parameters. This is likely why we couldn't reach the performance of the previous shampoo run again. However, I'm still in favor of merging adam-square as it's simpler than shampoo and significantly faster.
It's the only one we can't decay. No mixer decay performs the same as decaying only input and output, which itself performs better than decay all non-norm parameters. This is likely why we couldn't reach the performance of the previous shampoo run again. However, I'm still in favor of merging adam-square as it's simpler than shampoo and significantly faster.