Decoupled Weight Decay?

facebookresearch / schedule_free

Schedule-Free Optimization in PyTorch

Apache License 2.0

1.91k stars 65 forks source link

Forgive me if this is an obvious question, but is there a reason that schedule free Adam uses l2 regularization and not decoupled weight decay? I was having some issues with noisy outputs on text-to-image diffusion models using a weight decay like what I would normally use with AdamW & I recall decoupled weight decay performing better with adaptive methods, so I was just curious.

If it is possible to implement, how would I do it? I suspect just decaying the param values wouldn't work, but would it suffice to decay the params, ckp1, and z? Would all the buffers have to be decayed? Any insight would be appreciated.

facebookresearch / schedule_free

Decoupled Weight Decay? #45