[Feature] performance problem

Is your feature request related to a problem? Please describe.

非常赞赏学长们的工作！我有一个小小的问题注意到readme里有一个吞吐和显存占用的表格。BMtrain显著优于Deepspeed- megaton，我好奇这其中的优化主要来源于什么地方呢。同样的逻辑，为什么我们能够支持更多的bach size，吞吐更高？是否也有显卡配置的原因呢（sxm的机器是不是会因为高带宽抹去这样的差距）。我觉得做到这样的优化绝对是系统顶会级别的工作，学长们有兴趣分析这其中的优化点并总结成文章投稿吗。其实我非常希望使用BMtrain的框架，但是只看到其中的好，不知道为什么好，心里就不踏实。

Describe the solution you'd like

同上

Describe alternatives you've considered

No response

Additional context

No response

OpenBMB / BMTrain

[Feature] performance problem #193

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context