Oneflow-Inc / libai

LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training
https://libai.readthedocs.io
Apache License 2.0
389 stars 55 forks source link

MT5 8卡纯模型并行,graph模式运行报错 #409

Closed Ldpe2G closed 1 year ago

Ldpe2G commented 1 year ago

问题描述

oneflow 版本: '0.8.1.dev20221023+cu112' libai 问题复现分支: dev_optimize_MT5

运行命令:

bash tools/train.sh tools/train_net.py projects/MT5/configs/mt5_pretrain.py 8

报错信息:

F20221024 16:00:04.695173 2661978 exec_graph.cpp:126] ((16,512,64) vs (16,512,96)) 
Error message from /home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/graph/exec_graph.cpp:100
    CheckPhysicalBlobDesc(*JUST(GetLogicalBlobDesc(bn)), nd_sbp_signature->bn_in_op2nd_sbp().at(bn), *op_parallel_desc, parallel_ctx, *physical_blob_desc):  check physical shape failed, op name Python Stack[-2]: 'forward' at '/home/ldp/libai/projects/MT5/layers/transformer_layer.py': line 178; Python Stack[-1]: 'forward' at '/home/ldp/libai/projects/MT5/layers/attention_layer.py': line 237;  ... more

  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/graph/exec_graph.cpp", line 126, in InferBlobDescs
    CheckPhysicalBlobDesc( *op(), op()->output_bns(), std ... nd_sbp_signature, parallel_ctx, GetBlobDesc4BnInOp)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/graph/exec_graph.cpp", line 100, in CheckPhysicalBlobDesc
    CheckPhysicalBlobDesc(*JUST(GetLogicalBlobDesc(bn)), nd_sbp_signature->bn_in_op2nd_sbp().at(bn), *op_parallel_desc, parallel_ctx, *physical_blob_desc)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/graph/exec_graph.cpp", line 80, in CheckPhysicalBlobDesc
strint commented 1 year ago

related isssue: https://github.com/Oneflow-Inc/libai/issues/405

strint commented 1 year ago

config 文件是什么样的?

2卡有办法复现不

Ldpe2G commented 1 year ago

2he

config 文件是什么样的?

2卡有办法复现不

2卡和4卡都能跑,8卡下才会报错

leaves-zwx commented 1 year ago
context shape and sbp: oneflow.Size([16, 128, 12, 64]), (oneflow.sbp.split(dim=3),)
context = context.flatten(2)
context shape and sbp: oneflow.Size([16, 128, 768]), (oneflow.sbp.split(dim=0),)
leaves-zwx commented 1 year ago

应该是 flatten 的 GetSbp 给的有问题,我修复一下

leaves-zwx commented 1 year ago

可以试下这个修复能解决这个问题吗?https://github.com/Oneflow-Inc/oneflow/pull/9322

Yipeng1994 commented 1 year ago

我在pr里面评论了,就是组头没被整除的问题,12不能被8整除

Yipeng1994 commented 1 year ago

https://github.com/Oneflow-Inc/oneflow/pull/9323 这个应该能解决你的问题了

Ldpe2G commented 1 year ago

Oneflow-Inc/oneflow#9323 这个应该能解决你的问题了

8卡纯模型并行 可以跑了

leaves-zwx commented 1 year ago

https://github.com/Oneflow-Inc/oneflow/pull/9322 拉最新的 commit 再试试呢?可以跑了吗?

Ldpe2G commented 1 year ago

Oneflow-Inc/oneflow#9322 拉最新的 commit 再试试呢?可以跑了吗?

会报错

  File "/home/ldp/oneflow/python/oneflow/nn/graph/block.py", line 280, in __block_forward
RuntimeError: shape '(16,32768)' is invalid for input of size 6291456
    result = self._origin.__class__.forward(self, *args, **kwargs)                                                                                     File "/home/ldp/libai/projects/MT5/mt5_model.py", line 345, in forward
    logits = self.mt5_model(                                                                                                                           File "/home/ldp/oneflow/python/oneflow/nn/graph/block.py", line 248, in __call__
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/oneflow/python/oneflow/nn/graph/block.py", line 280, in __block_forward
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/home/ldp/libai/projects/MT5/mt5_model.py", line 209, in forward
    enc_hidden_states, position_bias = layer(
  File "/home/ldp/oneflow/python/oneflow/nn/graph/block.py", line 248, in __call__
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/oneflow/python/oneflow/nn/graph/block.py", line 280, in __block_forward
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/home/ldp/libai/projects/MT5/layers/transformer_layer.py", line 178, in forward
    attention_output, position_bias = self.self_attention(
  File "/home/ldp/oneflow/python/oneflow/nn/graph/block.py", line 248, in __call__
    result = self.__block_forward(*args, **kwargs)
  File "/home/ldp/oneflow/python/oneflow/nn/graph/block.py", line 280, in __block_forward
    result = self._origin.__class__.forward(self, *args, **kwargs)
  File "/home/ldp/libai/projects/MT5/layers/attention_layer.py", line 236, in forward
    context = context.flatten(2)
RuntimeError: shape '(16,32768)' is invalid for input of size 6291456
leaves-zwx commented 1 year ago

https://github.com/Oneflow-Inc/oneflow/pull/9322 问题已经修复,测试可以跑通。