Oneflow-Inc / libai

LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training
https://libai.readthedocs.io
Apache License 2.0
391 stars 55 forks source link

测试并行框架,张量并行结果与官网所给数据不一致 #477

Closed lisuq closed 1 year ago

lisuq commented 1 year ago

我在8*V100-32GB服务器上测试并行框架,流水线并行的结果与官网所给数据基本一致,但张量并行相差甚远,请问是什么原因?用的是 https://github.com/Oneflow-Inc/OneAutoTest/tree/main/libai 中的配置,执行脚本bash tools/args_libai_gpt2.sh configs/gpt2_nl24_nah16_hs1024.py 1 8 0 127.0.0.1 8 1 true false 8 8

实验结果: 011937b09e2cbabad2892d96c61fd53