Are there plans to support AdamW Optimizer?
Today there is only Adam, but it results less performant than AdamW, as several papers report.
Pytorch supports it, I don't know if this is supported also by other engines.
NLP tasks will certainly benefit from that (but literature shows that also other fields could benefit).
For example, I'm trying to repeat a result shown in https://aclanthology.org/2021.naacl-industry.38.pdf with Distilbert and CLINC150 and I'm missing the 96.3 percentage for about a 4%. I tried changing other hyperparameters and config, but the only difference seems to be in the optimizer (and ok, there is not an exact description of how is the softmax classifier exactly, but I'm rather convinced that the issue could be in the Optimizer and, anyway, AdamW is quite used since many years, so it would be great to be able to use it; and ok, there is maybe another issue in the minibatch management for models trained with variable length - I'll report this apart, but it has minor impact - fraction of percentage, it seems, in the above task).
Description
Are there plans to support AdamW Optimizer? Today there is only Adam, but it results less performant than AdamW, as several papers report. Pytorch supports it, I don't know if this is supported also by other engines.
NLP tasks will certainly benefit from that (but literature shows that also other fields could benefit).
For example, I'm trying to repeat a result shown in https://aclanthology.org/2021.naacl-industry.38.pdf with Distilbert and CLINC150 and I'm missing the 96.3 percentage for about a 4%. I tried changing other hyperparameters and config, but the only difference seems to be in the optimizer (and ok, there is not an exact description of how is the softmax classifier exactly, but I'm rather convinced that the issue could be in the Optimizer and, anyway, AdamW is quite used since many years, so it would be great to be able to use it; and ok, there is maybe another issue in the minibatch management for models trained with variable length - I'll report this apart, but it has minor impact - fraction of percentage, it seems, in the above task).
References
Original Paper: https://arxiv.org/abs/1711.05101 Pytorch implementation: https://pytorch.org/docs/stable/optim.html Some results pratically showing why it should be used: https://github.com/egg-west/AdamW-pytorch