For the code:
https://github.com/THUDM/CogVideo/blob/2fdc59c3ce48aee1ba7572a1c241e5b3090abffa/sat/configs/sft.yaml#L39 , contiguous_gradients is deepspeed memory optimization, which is default True. I am very curious why is it set False in CogvideoX sft procedure? And we accidentally discovered that when it is set to True, the loss will become abnormally large when training on more than 128 GPUs; so what is your motivation for disabling it? Could you please share it?
For the code: https://github.com/THUDM/CogVideo/blob/2fdc59c3ce48aee1ba7572a1c241e5b3090abffa/sat/configs/sft.yaml#L39 , contiguous_gradients is deepspeed memory optimization, which is default True. I am very curious why is it set False in CogvideoX sft procedure? And we accidentally discovered that when it is set to True, the loss will become abnormally large when training on more than 128 GPUs; so what is your motivation for disabling it? Could you please share it?
Reference: (according to https://www.deepspeed.ai/docs/config-json/#bfloat16-training-options)