Currently, we're grafting the Shampoo update onto SGD, which doesn't work well with other NLP models and transformers. However, anecdotal evidence suggests that grafting onto RMSProp improves convergence significantly. Unfortunately, RMSProp requires much more memory. Grafting onto SM3 could be a memory-efficient alternative.\
This issue is about exploring such grafting methods, benchmarking them and ideally improving upon the performance of the baseline.
Currently, we're grafting the Shampoo update onto SGD, which doesn't work well with other NLP models and transformers. However, anecdotal evidence suggests that grafting onto RMSProp improves convergence significantly. Unfortunately, RMSProp requires much more memory. Grafting onto SM3 could be a memory-efficient alternative.\ This issue is about exploring such grafting methods, benchmarking them and ideally improving upon the performance of the baseline.