OpenLMLab / LOMO

LOMO: LOw-Memory Optimization
MIT License
978 stars 68 forks source link

Is LOMO a concurrent work of the official implementation? #43

Closed DesperateExplorer closed 1 year ago

DesperateExplorer commented 1 year ago

Dear authors, I notice that in the official implementation of PyTorch, there is an internal implementation of optimizer_hook registry merged into the main branch in March, 2023. It will be very kind of you to shed light on the following two points:

  1. Current implementation of LOMO does not support nontrivial learning rate scheduler while the official implemention seems can; so is there any compatibility-memory-efficiency trade-off between the opt.step() style in the official one and the implementation of LOMO?
  2. According to my understanding, LOMO is compatible with model parallelism by and only by tensor parallelism. Is such an understanding reasonable and whether LOMO has an edge over the official implementation in torch.distributed.optim in terms of model parallelism?

Thanks!

QipengGuo commented 1 year ago

Sure, this is a concurrent work of official implementation.

  1. I think there is no trade-off for memory usage, but we not only added hook functions but also changed the optimization process, for instance, computing the backward twice to support gradient normalization. We are contacting the huggingface team to see if we can contribute to their trainer.
  2. LOMO is compatible with pipeline parallelism and Zero as well. However, I think Zero may be degraded when using SGD. In general, any model parallelism written in native PyTorch and using default backward pass is compatible with LOMO. Otherwise, we need some extract efforts, for instance, we show the integration of deepspeed in this repo. For more support, you check another project
DesperateExplorer commented 1 year ago

Sure, this is a concurrent work of official implementation.

  1. I think there is no trade-off for memory usage, but we not only added hook functions but also changed the optimization process, for instance, computing the backward twice to support gradient normalization. We are contacting the huggingface team to see if we can contribute to their trainer.
  2. LOMO is compatible with pipeline parallelism and Zero as well. However, I think Zero may be degraded when using SGD. In general, any model parallelism written in native PyTorch and using default backward pass is compatible with LOMO. Otherwise, we need some extract efforts, for instance, we show the integration of deepspeed in this repo. For more support, you check another project

Deepspeed itself has a config param called "gradient_clipping", which seems to be a clipping-by-norm config param. Whether this config param has any effect on the gradients when using Lomo, under the current implementation in this repo?

KaiLv69 commented 1 year ago

Hi, this param has no effect and the gradient-related params are set via clip_grad_norm and clip_grad_value in Lomo class. gradient_clipping is a param for DeepSpeed's optimizer. Since we don't init optimizer with DeepSpeed, it has no effect.

DesperateExplorer commented 1 year ago

Thanks.