HomebrewNLP / Olmax

HomebrewNLP in JAX flavour for maintable TPU-Training
BSD 2-Clause "Simplified" License
45 stars 6 forks source link

Optimizer Grafting #35

Closed ClashLuke closed 2 years ago

ClashLuke commented 2 years ago

Currently, we're grafting the Shampoo update onto SGD, which doesn't work well with other NLP models and transformers. However, anecdotal evidence suggests that grafting onto RMSProp improves convergence significantly. Unfortunately, RMSProp requires much more memory. Grafting onto SM3 could be a memory-efficient alternative.\ This issue is about exploring such grafting methods, benchmarking them and ideally improving upon the performance of the baseline.

ClashLuke commented 2 years ago

Relevant: #47

I just started a hyperparameter sweep of SM3#Shampoo. RMSProp#Shampoo would be next.

ClashLuke commented 2 years ago

SM3#Shampoo outperforms Shampoo and SM3