OpenBMB / BMTrain

Efficient Training (including pre-training and fine-tuning) for Big Models
Apache License 2.0
560 stars 77 forks source link

Tensor Parallel #153

Closed zkh2016 closed 1 year ago

zkh2016 commented 1 year ago

PR主要修改点:

  1. 添加tensor parallel模式:https://github.com/OpenBMB/BMTrain/issues/149
  2. 修改topology以支持PP,TP,ZERO组合
  3. 修改parameter相关代码,适配TP模式
  4. PP去除单独切分参数的逻辑,复用CheckpointBlock的参数切分
  5. save/load适配TP模式

TODO:

  1. 优化linear,反向通信可以overlap