Closed DesperateExplorer closed 1 year ago
Sure, this is a concurrent work of official implementation.
Sure, this is a concurrent work of official implementation.
- I think there is no trade-off for memory usage, but we not only added hook functions but also changed the optimization process, for instance, computing the backward twice to support gradient normalization. We are contacting the huggingface team to see if we can contribute to their trainer.
- LOMO is compatible with pipeline parallelism and Zero as well. However, I think Zero may be degraded when using SGD. In general, any model parallelism written in native PyTorch and using default backward pass is compatible with LOMO. Otherwise, we need some extract efforts, for instance, we show the integration of deepspeed in this repo. For more support, you check another project
Deepspeed itself has a config param called "gradient_clipping"
, which seems to be a clipping-by-norm config param. Whether this config param has any effect on the gradients when using Lomo, under the current implementation in this repo?
Hi, this param has no effect and the gradient-related params are set via clip_grad_norm
and clip_grad_value
in Lomo class.
gradient_clipping
is a param for DeepSpeed's optimizer. Since we don't init optimizer with DeepSpeed, it has no effect.
Thanks.
Dear authors, I notice that in the official implementation of PyTorch, there is an internal implementation of
optimizer_hook
registry merged into themain
branch in March, 2023. It will be very kind of you to shed light on the following two points:opt.step()
style in the official one and the implementation of LOMO?torch.distributed.optim
in terms of model parallelism?Thanks!