facebookresearch / schedule_free

Schedule-Free Optimization in PyTorch
Apache License 2.0
1.91k stars 65 forks source link

Decoupled Weight Decay? #45

Closed madman404 closed 3 months ago

madman404 commented 3 months ago

Forgive me if this is an obvious question, but is there a reason that schedule free Adam uses l2 regularization and not decoupled weight decay? I was having some issues with noisy outputs on text-to-image diffusion models using a weight decay like what I would normally use with AdamW & I recall decoupled weight decay performing better with adaptive methods, so I was just curious.

If it is possible to implement, how would I do it? I suspect just decaying the param values wouldn't work, but would it suffice to decay the params, ckp1, and z? Would all the buffers have to be decayed? Any insight would be appreciated.

adefazio commented 3 months ago

The current implementation does use decoupled weight decay, the way we write the update just makes it look uncoupled. The reference implementation AdamWScheduleFreeReference has a clearer implementation, which makes the decoupling more apparent.