THUDM / CogVideo

text and image to video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
Apache License 2.0
9.37k stars 882 forks source link

Question about one deepspeed-config: contiguous_gradients #531

Open williechai opened 5 days ago

williechai commented 5 days ago

For the code: https://github.com/THUDM/CogVideo/blob/2fdc59c3ce48aee1ba7572a1c241e5b3090abffa/sat/configs/sft.yaml#L39 , contiguous_gradients is deepspeed memory optimization, which is default True. I am very curious why is it set False in CogvideoX sft procedure? And we accidentally discovered that when it is set to True, the loss will become abnormally large when training on more than 128 GPUs; so what is your motivation for disabling it? Could you please share it?

Reference: image (according to https://www.deepspeed.ai/docs/config-json/#bfloat16-training-options)

yzy-thu commented 5 days ago

Sorry, we didn't pay attention to this setting before. We've always used it this way. Hope someone can provide an explanation.