[Lecture10][0617] Benefit for weight decay in AdamW

kimjyan20221186 commented 2 weeks ago

AdamW is the addition of weight decay to optimizer Adam. Generally weight decay is effective at preventing overfitting, but what additional benefits are there in AdamW?

-Kim JiHyeon

yjyoo3312 commented 2 weeks ago

@kimjyan20221186 Thank you for the comment!

Besides the theoretical explanation, empirically, to train classification models (and also detection and segmentation models), we used SGD with momentum and weight decay because Adam could not outperform the SGD in those tasks, although Adam has been largely successful in many different domains (with stable convergence). We believe the reason for this lies in the weight decay (since Adam also provides momentum). As expected, AdamW, which combines Adam with weight decay, has demonstrated comparable or superior performance to SGD (with momentum and weight decay) while offering greater training stability, without compromising performance across various deep learning domains compared to Adam. This is why many experts in computer vision recommend AdamW as an initial option. We also believe that both weight decay and momentum are empirically useful for escaping local minima on the loss surface.

kimjyan20221186 commented 2 weeks ago

@yjyoo3312 Thanks for your reply;)

PiLab-CAU / ComputerVision-2401

[Lecture10][0617] Benefit for weight decay in AdamW #53