Closed RossM closed 5 months ago
A possible alternate name for the option is "Sigma Reparametrization", which is closer to the term used in some papers but seems less informative to me.
My testing so far indicates that this slows things down a fair amount (~30%) but enables learning rates of 1e-5 or higher without losing details. Theoretically it should mean that even if a model is trained at a high learning rate, lowering the learning rate and continuing training should let the model recover. I've got a longer training run going to test if that really works.
Describe your changes
See https://arxiv.org/abs/2303.06296
This adds an option to reparametrize the model weights using the spectral norm so that the overall norm of each weight can't change. This helps to stabilize training at high learning rates.
Issue ticket number and link (if applicable)
Checklist before requesting a review