Why does training with DEEPSPEED continuously increase the GPU memory usage?

Luodian / Otter

🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.

https://otter-ntu.github.io/

MIT License

3.56k stars 241 forks source link

Why does training with DEEPSPEED continuously increase the GPU memory usage? #243

Open kanshichao opened 1 year ago

kanshichao commented 1 year ago

I configured the 'distributed_type' within the 'accelerate_config_fsdp' to utilize DEEPSPEED. However, during the course of training, there is a persistent escalation in GPU memory consumption, culminating in an eventual Out-Of-Memory (OOM) error. What factors contribute to this phenomenon?

Luodian commented 1 year ago

I guess you should use this file as Deepspeed training config?

https://github.com/Luodian/Otter/blob/main/pipeline/accelerate_configs/accelerate_config_zero2.yaml