Closed madman404 closed 3 months ago
The current implementation does use decoupled weight decay, the way we write the update just makes it look uncoupled. The reference implementation AdamWScheduleFreeReference has a clearer implementation, which makes the decoupling more apparent.
Forgive me if this is an obvious question, but is there a reason that schedule free Adam uses l2 regularization and not decoupled weight decay? I was having some issues with noisy outputs on text-to-image diffusion models using a weight decay like what I would normally use with AdamW & I recall decoupled weight decay performing better with adaptive methods, so I was just curious.
If it is possible to implement, how would I do it? I suspect just decaying the param values wouldn't work, but would it suffice to decay the params, ckp1, and z? Would all the buffers have to be decayed? Any insight would be appreciated.