Open awaelchli opened 8 months ago
This sounds interesting, but I would say let's not do that as a default because then it would become difficult to compare to other LLM frameworks. I do like the current warmup/decay we have implemented, which also matches what others are doing (like Llama and OLMo, except OLMo uses a linear instead of cosine decay)
But regarding this idea, this could potentially be an additional option.
Hi, is there any updates? Thanks!
Sorry, but unfortunately this would be out of scope for now.
I see. Thank you all the same!
We could consider doing this trick for finetuning, as it is quite inexpensive. Intuitively it makes sense to me.
https://x.com/StasBekman/status/1762197664454848693?s=20
cc @rasbt