OpenBMB / BMTrain

Efficient Training (including pre-training and fine-tuning) for Big Models
Apache License 2.0
548 stars 74 forks source link

[Feature] performance problem #193

Open Xiang-cd opened 5 months ago

Xiang-cd commented 5 months ago

Is your feature request related to a problem? Please describe.

非常赞赏学长们的工作!我有一个小小的问题注意到readme里有一个吞吐和显存占用的表格。BMtrain显著优于Deepspeed- megaton,我好奇这其中的优化主要来源于什么地方呢。同样的逻辑,为什么我们能够支持更多的bach size,吞吐更高?是否也有显卡配置的原因呢(sxm的机器是不是会因为高带宽抹去这样的差距)。 我觉得做到这样的优化绝对是系统顶会级别的工作,学长们有兴趣分析这其中的优化点并总结成文章投稿吗。 其实我非常希望使用BMtrain的框架,但是只看到其中的好,不知道为什么好,心里就不踏实。

Describe the solution you'd like

同上

Describe alternatives you've considered

No response

Additional context

No response

Pegessi commented 3 months ago

I believe this work is remarkable for the combination of memory and parallelism and is great for bringing higher throughput. However, insufficient part is that experiments about Megatron-Deepspeed as baseline in this paper do not use Megatron-DeepSpeed with PTD-P and memory optimization in Megatron-LM. Futhermore, Zero3 default config is inefficient because of cpu offloading enabled in few GPUs, which has been dicussed in some blogs.