add methods - Githubissues

Andron00e commented 1 month ago

~~SOAP~~
~~Muon~~
Shampoo (only DistributedShampoo)
~~Adam-mini~~
~~Lion~~
~~Sophia~~
~~AdEMAMix~~
~~Schedule-Free~~
~~Adafactor~~
Adalayer
~~Signum, signSGD~~
AdaHessian
~~Prodigy~~
~~SGDF~~
ADOPT (in testing)
Adan
~~LAMB~~
~~MARS~~ (with grad calculation in a different stochasticity)
...

Andron00e commented 1 month ago

~~add schedules also | link~~

upd: has been added via this commit

Andron00e commented 1 month ago

some problems with installation of the lates version of schedulefree, so I added this manually see: https://github.com/epfml/llm-baselines/blob/soap/src/optim/schedulefree.py

martinjaggi commented 1 month ago

is there a pull request for this? would be nice to collaborate

Andron00e commented 1 month ago

is there a pull request for this? would be nice to collaborate

hi, we are deploying it to the soap branch together with @mpagli

Andron00e commented 1 month ago

a useful settings:

anything but sgd
...

Andron00e commented 1 month ago

Adam-mini Note

I use model.named_parameters() for Adam-mini instead of group_specs, therefore in main.py it looks like:

  elif args.opt == "adam-mini":
      opt = Adam_mini(
          device=args.device,
          world_size=args.world_size,
          named_parameters=model.named_parameters(),  # check
          lr=args.lr,
          betas=(args.beta1, args.beta2),
          weight_decay=args.weight_decay,
          model_sharding=args.model_sharding,
          dim=args.n_embd,
          n_heads=args.n_head,
          n_kv_heads=args.n_kv_head,
          verbose=args.adam_mini_verbose,
      )

TODO: update partitions names

kylematoba commented 3 weeks ago

hi, I'll add sophia and adafactor.

Andron00e commented 3 weeks ago

hi, I'll add sophia and adafactor.

Hello! Super, just develop this in your branch and then PR to soap. I am a bit overloaded these days, but wanted to try Sophia also

Note: in official repository, they do not show SophiaH (with Hutchinson's preconditioner), only SophiaG. We want to have both methods here. SophiaH is nicely implemented in optax for now, but its not so hard to write in PyToch, see: this link

Thx)

kylematoba commented 3 weeks ago

hi, Bristen is back early, so I'll get back to that.

I did some research on Sophia, though, main findings:

The official implementation of SophiaG makes some weird choices, described here https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1244/final-projects/CaiaMaiCostelloJasonDanielLazar.pdf.
The levanter implementation does not have SophiaG, only SophiaH.
There's a quite readable Julia implementation of SophiaH here https://github.com/SciML/Optimization.jl/blob/master/src/sophia.jl.

Adafactor is simple, it's already close to being released officially, see https://github.com/pytorch/pytorch/pull/129905.

When I get some time next I'll return to this if you haven't.

martinjaggi commented 3 weeks ago

muon optimizer should also be a good one to add. i think @doikov might be interested in that one too: https://x.com/Yuchenj_UW/status/1846964136204173318

martinjaggi commented 3 weeks ago

once we have a handful, we'll have a nice benchmark collection for LLM optimizers, probably worth a small writeup soon

Andron00e commented 3 weeks ago

muon optimizer should also be a good one to add. i think @doikov might be interested in that one too: https://x.com/Yuchenj_UW/status/1846964136204173318

yes, i am working on that. already have some test runs of the Muon. but, again, it is hard to deduce when batch size is less than 0.5M tokens

btw an interesting exercise – to try this new muon/soap/whatever on the banana function :)

Andron00e commented 3 weeks ago

hi, Bristen is back early, so I'll get back to that.

I did some research on Sophia, though, main findings:

The official implementation of SophiaG makes some weird choices, described here https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1244/final-projects/CaiaMaiCostelloJasonDanielLazar.pdf.

The levanter implementation does not have SophiaG, only SophiaH.

There's a quite readable Julia implementation of SophiaH here https://github.com/SciML/Optimization.jl/blob/master/src/sophia.jl.

Adafactor is simple, it's already close to being released officially, see pytorch/pytorch#129905.

When I get some time next I'll return to this if you haven't.

I mean, for the official version of SophaiG, you may just look at the paper's repo: https://github.com/Liuhong99/Sophia

epfml / llm-baselines

add methods #18