Warming up the optimizer states with learning rate = 0 for a few steps

awaelchli commented 8 months ago

We could consider doing this trick for finetuning, as it is quite inexpensive. Intuitively it makes sense to me.

https://x.com/StasBekman/status/1762197664454848693?s=20

When finetuning a model has anyone experimented with first running with LR=0 for some 100-1000 iterations to get the optimizer states tuned up and only then restarting again with a normal LR scheduler?

I'm thinking that this would be more efficient since when optim states are random it'll surely first mess up the pretrained weights even with a tiny LR and then will need time to re-correct itself. Starting the weight update with good optim states should save time I think, despite the initial non-stepping steps. And also this is likely to allow for a more aggressive LR scheduler which doesn't need a long warm up.

cc @rasbt

rasbt commented 8 months ago

This sounds interesting, but I would say let's not do that as a default because then it would become difficult to compare to other LLM frameworks. I do like the current warmup/decay we have implemented, which also matches what others are doing (like Llama and OLMo, except OLMo uses a linear instead of cosine decay)

But regarding this idea, this could potentially be an additional option.

fzyzcjy commented 1 week ago

Hi, is there any updates? Thanks!

rasbt commented 1 week ago

Sorry, but unfortunately this would be out of scope for now.

fzyzcjy commented 1 week ago

I see. Thank you all the same!

Lightning-AI / litgpt

Warming up the optimizer states with learning rate = 0 for a few steps #956