Open seongjunyun opened 2 weeks ago
Hi there.
That is a good point. Flora itself is compatible with distributed training like DDP. However, in our implementation, we used LOMO to avoid storing the gradients for all layers, which is not naturally compatible with PyTorch's DDP.
The general idea to use LOMO with distributed training is to synchronize gradients layer by layer (rather than communicating in buckets). We do not currently have an implementation yet, but we would integrate this feature in the future.
Stale due to inactivity. Closing in 3 days if no further activities.
Got it, thanks for your clarification and looking forward to intergation of that feature!
Hi, first of all, thank you for this cool work! It's impressive and I appreciate the effort you've put into it. I have a question about using Flora with DDP. Have you tried Flora to train 7B with DDP (multi-gpu)? As far as I know, when using DDP, the memory gets doubled due to the buffer for gradient synchronization in DDP, so the required memory of 7B (bf16) is around 28GB. Thus, it's hard to train 7B in gpus with 32GB even though I tried it with Galore. I was wondering if you've encountered the same issue and any suggestion would be appreciated. Thanks!