Interaction between AutoClip and learning rate schedule

That is a very interesting question. Unfortunately, since I am not the original authors of the technique or paper, I haven't done much thorough testing of autoclipping beyond my personal use.

That being said: I have seen autoclipping provide a small but significant increase in performance for the large models I have trained, most specifically Bert family models. The training setup is with a very aggressive learning rate using a one cycle policy and low numbers of training batches. I suspect that if you were to address the learning rate change issue, the benefit of clipping will increase, but for now I generally find that it allows me to increase my learning rate and lower the number of batches, which improves the generalization of the resulting model. (I have noticed, however, that clipping history lengths which are too long can be detrimental, and often performs even worse than baseline. I suspect that may be a reflection of this concern, since I rarely train without a policy of some sort.)

It might be worthwhile to try normalizing the gradients by some function of the learning rate at that batch, and then clipping based on that normalized metric. This should hopefully eliminate the effects of the changing learning rate (for both one cycle and annealing policies), especially if you are wanting to use long clipping histories.

Upshot? It is probably having a negative interaction, but I've found that for my use cases both clipping and one cycle are beneficial enough to use together anyways. A better clipping policy would probably change this.

I'd love to hear the thoughts of anyone who has done some more substantial testing on this.

HesitantlyHuman / autoclip

Interaction between AutoClip and learning rate schedule #9