Open Andron00e opened 1 month ago
add schedules also | link
upd: has been added via this commit
some problems with installation of the lates version of schedulefree
, so I added this manually
see: https://github.com/epfml/llm-baselines/blob/soap/src/optim/schedulefree.py
is there a pull request for this? would be nice to collaborate
is there a pull request for this? would be nice to collaborate
hi, we are deploying it to the soap
branch together with @mpagli
a useful settings:
Adam-mini Note
I use model.named_parameters()
for Adam-mini instead of group_specs
, therefore in main.py
it looks like:
elif args.opt == "adam-mini":
opt = Adam_mini(
device=args.device,
world_size=args.world_size,
named_parameters=model.named_parameters(), # check
lr=args.lr,
betas=(args.beta1, args.beta2),
weight_decay=args.weight_decay,
model_sharding=args.model_sharding,
dim=args.n_embd,
n_heads=args.n_head,
n_kv_heads=args.n_kv_head,
verbose=args.adam_mini_verbose,
)
TODO: update partitions names
hi, I'll add sophia and adafactor.
hi, I'll add sophia and adafactor.
Hello! Super, just develop this in your branch and then PR to soap
. I am a bit overloaded these days, but wanted to try Sophia also
Note: in official repository, they do not show SophiaH (with Hutchinson's preconditioner), only SophiaG. We want to have both methods here. SophiaH is nicely implemented in optax for now, but its not so hard to write in PyToch, see: this link
Thx)
hi, Bristen is back early, so I'll get back to that.
I did some research on Sophia, though, main findings:
Adafactor is simple, it's already close to being released officially, see https://github.com/pytorch/pytorch/pull/129905.
When I get some time next I'll return to this if you haven't.
muon optimizer should also be a good one to add. i think @doikov might be interested in that one too: https://x.com/Yuchenj_UW/status/1846964136204173318
once we have a handful, we'll have a nice benchmark collection for LLM optimizers, probably worth a small writeup soon
muon optimizer should also be a good one to add. i think @doikov might be interested in that one too: https://x.com/Yuchenj_UW/status/1846964136204173318
yes, i am working on that.
already have some test runs of the Muon. but, again, it is hard to deduce when batch size
is less than 0.5M tokens
btw an interesting exercise – to try this new muon/soap/whatever on the banana function :)
hi, Bristen is back early, so I'll get back to that.
I did some research on Sophia, though, main findings:
- The official implementation of SophiaG makes some weird choices, described here https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1244/final-projects/CaiaMaiCostelloJasonDanielLazar.pdf.
- The levanter implementation does not have SophiaG, only SophiaH.
- There's a quite readable Julia implementation of SophiaH here https://github.com/SciML/Optimization.jl/blob/master/src/sophia.jl.
Adafactor is simple, it's already close to being released officially, see pytorch/pytorch#129905.
When I get some time next I'll return to this if you haven't.
I mean, for the official version of SophaiG, you may just look at the paper's repo: https://github.com/Liuhong99/Sophia
SOAPMuonAdam-miniLionSophiaAdEMAMixSchedule-FreeAdafactorSignum, signSGDProdigySGDFLAMBMARS(with grad calculation in a different stochasticity)