Maybe GaLore (#1192) should be changed from GaloreArgs to OptimizerArgs after all. Then we can also more easily consider other variants such as BAdam (BAdam: A Memory Efficient Full Parameter Training Method for Large Language Models, https://arxiv.org/abs/2404.02827).
The experiments from here look very compelling. And it only adds 1 hyperparameter:
Maybe GaLore (#1192) should be changed from
GaloreArgs
toOptimizerArgs
after all. Then we can also more easily consider other variants such as BAdam (BAdam: A Memory Efficient Full Parameter Training Method for Large Language Models, https://arxiv.org/abs/2404.02827).The experiments from here look very compelling. And it only adds 1 hyperparameter: