HesitantlyHuman / autoclip

Implementation of adaptive gradient clipping for base pytorch
MIT License
12 stars 2 forks source link

Interaction between AutoClip and learning rate schedule #9

Open Permafacture opened 7 months ago

Permafacture commented 7 months ago

Cross-posting this question from pseeth's repo because in your example you do use a one-cycle LR schedule.

Has there been any research on how this strategy interacts with a learning rate schedule? Especially for something extreme like the one-cycle policy (super convergence). It seems like the history of the scale of the gradient would be dominated by changes in the learning rate. I found this paper that touches on the subject but doesn't propose any theory behind or solution to the interaction between the two.

Screen Shot 2024-01-16 at 12 04 15 PM

from this paper

As expected, AutoClip doesn't interact well with cosine annealing

HesitantlyHuman commented 4 months ago

That is a very interesting question. Unfortunately, since I am not the original authors of the technique or paper, I haven't done much thorough testing of autoclipping beyond my personal use.

That being said: I have seen autoclipping provide a small but significant increase in performance for the large models I have trained, most specifically Bert family models. The training setup is with a very aggressive learning rate using a one cycle policy and low numbers of training batches. I suspect that if you were to address the learning rate change issue, the benefit of clipping will increase, but for now I generally find that it allows me to increase my learning rate and lower the number of batches, which improves the generalization of the resulting model. (I have noticed, however, that clipping history lengths which are too long can be detrimental, and often performs even worse than baseline. I suspect that may be a reflection of this concern, since I rarely train without a policy of some sort.)

It might be worthwhile to try normalizing the gradients by some function of the learning rate at that batch, and then clipping based on that normalized metric. This should hopefully eliminate the effects of the changing learning rate (for both one cycle and annealing policies), especially if you are wanting to use long clipping histories.

Upshot? It is probably having a negative interaction, but I've found that for my use cases both clipping and one cycle are beneficial enough to use together anyways. A better clipping policy would probably change this.

I'd love to hear the thoughts of anyone who has done some more substantial testing on this.