facebookresearch / optimizers

For optimization algorithm research and development.
Other
252 stars 24 forks source link

Fails from DeepSpeed #19

Open catid-saronic opened 2 weeks ago

catid-saronic commented 2 weeks ago

Using the latest main to train a YoloV9e object detector:

[rank0]:     train_one_epoch(train_loader, model, args, model_dtype)
[rank0]:   File "/mnt/dingus_drive/catid/train_detector/train.py", line 90, in train_one_epoch
[rank0]:     model.step()
[rank0]:   File "/home/saronic/miniconda3/envs/train/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2213, in step
[rank0]:     self._take_model_step(lr_kwargs)
[rank0]:   File "/home/saronic/miniconda3/envs/train/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 2119, in _take_model_step
[rank0]:     self.optimizer.step()
[rank0]:   File "/home/saronic/miniconda3/envs/train/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/saronic/miniconda3/envs/train/lib/python3.10/site-packages/deepspeed/runtime/bf16_optimizer.py", line 303, in step
[rank0]:     self.optimizer.step()
[rank0]:   File "/home/saronic/miniconda3/envs/train/lib/python3.10/site-packages/torch/optim/lr_scheduler.py", line 130, in wrapper
[rank0]:     return func.__get__(opt, opt.__class__)(*args, **kwargs)
[rank0]:   File "/home/saronic/miniconda3/envs/train/lib/python3.10/site-packages/torch/optim/optimizer.py", line 484, in wrapper
[rank0]:     out = func(*args, **kwargs)
[rank0]:   File "/home/saronic/miniconda3/envs/train/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/home/saronic/miniconda3/envs/train/lib/python3.10/site-packages/distributed_shampoo/distributed_shampoo.py", line 1165, in step
[rank0]:     ].merge_and_block_gradients()
[rank0]:   File "/home/saronic/miniconda3/envs/train/lib/python3.10/site-packages/distributed_shampoo/utils/shampoo_distributor.py", line 300, in merge_and_block_gradients
[rank0]:     local_masked_blocked_grads = self._merge_and_block_gradients()
[rank0]:   File "/home/saronic/miniconda3/envs/train/lib/python3.10/site-packages/distributed_shampoo/utils/shampoo_distributor.py", line 211, in _merge_and_block_gradients
[rank0]:     grad.view(merged_dims), self._param_group[MAX_PRECONDITIONER_DIM]
[rank0]: RuntimeError: shape '[1728]' is invalid for input of size 7268980

Looks like there's some issue with this code when used from DeepSpeed?

hjmshi commented 1 week ago

Hi @catid-saronic, thanks for your interest in our code! We have not tested using our Shampoo code with DeepSpeed. For scaling up models, we have preliminary support for FSDP; however, this does require some model information.

If you're interested in getting things working with DeepSpeed, would be happy to help though. Let me know if you have any other questions.