Support 7b model fine-tuning with DDP

BorealisAI / flora-opt

This is the official repository for the paper "Flora: Low-Rank Adapters Are Secretly Gradient Compressors" in ICML 2024.

https://arxiv.org/abs/2402.03293

GNU Lesser General Public License v3.0

28 stars 3 forks source link

Support 7b model fine-tuning with DDP #5

Open seongjunyun opened 2 weeks ago

seongjunyun commented 2 weeks ago

Hi, first of all, thank you for this cool work! It's impressive and I appreciate the effort you've put into it. I have a question about using Flora with DDP. Have you tried Flora to train 7B with DDP (multi-gpu)? As far as I know, when using DDP, the memory gets doubled due to the buffer for gradient synchronization in DDP, so the required memory of 7B (bf16) is around 28GB. Thus, it's hard to train 7B in gpus with 32GB even though I tried it with Galore. I was wondering if you've encountered the same issue and any suggestion would be appreciated. Thanks!

yongchanghao commented 1 week ago

Hi there.

That is a good point. Flora itself is compatible with distributed training like DDP. However, in our implementation, we used LOMO to avoid storing the gradients for all layers, which is not naturally compatible with PyTorch's DDP.

The general idea to use LOMO with distributed training is to synchronize gradients layer by layer (rather than communicating in buckets). We do not currently have an implementation yet, but we would integrate this feature in the future.

github-actions[bot] commented 5 days ago

Stale due to inactivity. Closing in 3 days if no further activities.

seongjunyun commented 4 days ago

Got it, thanks for your clarification and looking forward to intergation of that feature!