Allow for using other Learning Rate Schedulers and Optimizers

emanjavacas / pie

A fully-fledge PyTorch package for Morphological Analysis, tailored to morphologically rich and historical languages.

MIT License

22 stars 10 forks source link

Allow for using other Learning Rate Schedulers and Optimizers #76

Open PonteIneptique opened 3 years ago

PonteIneptique commented 3 years ago

Hey ! I started reading about some other optimizers, as things went through my news feed (stuff like this or that).

I ended up trying to implement it in pie but wanted to see first what would be the results. The test were done as follow: same training set (\~500k words), same learning rate, same testing set (\~ 63k tokens), cuda, 10 run per configuration. No optimization were done.

For optimizers, were tested Ranger and Adam. I did not try anything else For learning rate, were tested ReduceLROnPlaeau, CosineAnnealing, Delayed(CosineAnnealing). Patience overall is 15 steps with improvement. CosineAnnealing T0 is 40, Delay is 10.

Basically, Ranger does not outperform Adam (maybe with other parameters, who knows, as the beta is different from Adam) but Delay(CosineAnnealing) is reaching same results in 40% less time.

If you are okay, PR will be under way.

Results:

emanjavacas commented 3 years ago

We could include an option to select the lr scheduler. That's easy since it's just swapping the pytorch lr scheduler and adapting the step call. If you have the code around feel free to push a PR and we can see how to include it!

PonteIneptique commented 3 years ago

So, small update with my old branch, regarding Flat(Cosine)(Delay=10, CosineTmax=40, patience=11): I can definitely recommend it. On a corpus of 1.5M tokens (3 times the previous one), it's not only faster, it's a also scoring higher with less deviation:

PonteIneptique commented 3 years ago

Hey @emanjavacas :) I was very bugged by the results on Ranger on the first batch, because I remembered running small trainings and having better results than with Adam. Then I remembered I read that Ranger takes a higher learning rate to start with, and that I did use a higher one for my preliminary tests. So I did it as well with the LASLA corpus, and I scored better results (note that my Adam LR is fine tuned, after close to 100 run to find the best hyperparams), with a 10x higher LR than my Adam one:

PonteIneptique commented 3 years ago

I also found out I am using CosineAnnealing the wrong way, but it still perform better than Adam: instead of using T_max as the cycle for which you'd find a cosine curve of LR, I have been using it as a slope (the LR is badly offset, it should be 10 epochs on the right):

PonteIneptique commented 3 years ago

Coming back with new experiences, regarding Ranger vs Adam.

I have been playing with single tasks models (which indeed improve when fine tuned correctly), and Ranger clearly yields results that are more stable:

The second before last and the second are the same config, just the optimizer is changing (without finetuning optimizer hyperparams)