Closed brando90 closed 3 years ago
It is because the untuned linear warmup works well and is easy to implement.
It is because the untuned linear warmup works well and is easy to implement.
But RAdam requires no tuning...doesn't that make it better than warm up?
The untuned warmup dependent on beta2 requires no tuning too.
I've argued here https://github.com/LiyuanLucasLiu/RAdam/issues/62 that if warm up and RAdam are equivalent that using RAdam might be simpler - however, I'd be curious about arguments in favour of warm up presented in this repo and related paper.
What are reasons to choose warm up isntead of RAdam?