Includes blockwise 8bit and 32bit versions, each supporting paged operation.
AdEMAMix is a modification to Adam which introduces an additional EMA component. It is observed that AdEMAMix can forget training data at a slower pace and can reach similar loss as AdamW with significantly less training data.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.
Adds support for the AdEMAMix optimizer described here: https://arxiv.org/abs/2409.03137
Includes blockwise 8bit and 32bit versions, each supporting paged operation.
AdEMAMix is a modification to Adam which introduces an additional EMA component. It is observed that AdEMAMix can forget training data at a slower pace and can reach similar loss as AdamW with significantly less training data.
TODO: Implement scheduler for alpha/beta3