HomebrewNLP / Olmax

HomebrewNLP in JAX flavour for maintable TPU-Training
BSD 2-Clause "Simplified" License
45 stars 5 forks source link

Square LR-Schedule #70

Open ClashLuke opened 1 year ago

ClashLuke commented 1 year ago

Our learning rate scheduler currently uses a linear increase and exponential dropoff, so our learning rate curve looks like the following: grafik where the duration of the initial ramp-up and the decay are tuneable hyperparameters.

However, others pointed out that square ramp-up and square decay can perform significantly better, so we might also want to use them. The modified curve (orange) would look like the following: grafik