Closed Ldpe2G closed 2 years ago
related isssue: https://github.com/Oneflow-Inc/libai/issues/405
config 文件是什么样的?
2卡有办法复现不
2he
config 文件是什么样的?
2卡有办法复现不
2卡和4卡都能跑,8卡下才会报错
context shape and sbp: oneflow.Size([16, 128, 12, 64]), (oneflow.sbp.split(dim=3),)
context = context.flatten(2)
context shape and sbp: oneflow.Size([16, 128, 768]), (oneflow.sbp.split(dim=0),)
应该是 flatten 的 GetSbp 给的有问题,我修复一下
可以试下这个修复能解决这个问题吗?https://github.com/Oneflow-Inc/oneflow/pull/9322
我在pr里面评论了,就是组头没被整除的问题,12不能被8整除
https://github.com/Oneflow-Inc/oneflow/pull/9323 这个应该能解决你的问题了
Oneflow-Inc/oneflow#9323 这个应该能解决你的问题了
8卡纯模型并行 可以跑了
https://github.com/Oneflow-Inc/oneflow/pull/9322 拉最新的 commit 再试试呢?可以跑了吗?
Oneflow-Inc/oneflow#9322 拉最新的 commit 再试试呢?可以跑了吗?
会报错
File "/home/ldp/oneflow/python/oneflow/nn/graph/block.py", line 280, in __block_forward
RuntimeError: shape '(16,32768)' is invalid for input of size 6291456
result = self._origin.__class__.forward(self, *args, **kwargs) File "/home/ldp/libai/projects/MT5/mt5_model.py", line 345, in forward
logits = self.mt5_model( File "/home/ldp/oneflow/python/oneflow/nn/graph/block.py", line 248, in __call__
result = self.__block_forward(*args, **kwargs)
File "/home/ldp/oneflow/python/oneflow/nn/graph/block.py", line 280, in __block_forward
result = self._origin.__class__.forward(self, *args, **kwargs)
File "/home/ldp/libai/projects/MT5/mt5_model.py", line 209, in forward
enc_hidden_states, position_bias = layer(
File "/home/ldp/oneflow/python/oneflow/nn/graph/block.py", line 248, in __call__
result = self.__block_forward(*args, **kwargs)
File "/home/ldp/oneflow/python/oneflow/nn/graph/block.py", line 280, in __block_forward
result = self._origin.__class__.forward(self, *args, **kwargs)
File "/home/ldp/libai/projects/MT5/layers/transformer_layer.py", line 178, in forward
attention_output, position_bias = self.self_attention(
File "/home/ldp/oneflow/python/oneflow/nn/graph/block.py", line 248, in __call__
result = self.__block_forward(*args, **kwargs)
File "/home/ldp/oneflow/python/oneflow/nn/graph/block.py", line 280, in __block_forward
result = self._origin.__class__.forward(self, *args, **kwargs)
File "/home/ldp/libai/projects/MT5/layers/attention_layer.py", line 236, in forward
context = context.flatten(2)
RuntimeError: shape '(16,32768)' is invalid for input of size 6291456
https://github.com/Oneflow-Inc/oneflow/pull/9322 问题已经修复,测试可以跑通。
问题描述
oneflow 版本:
'0.8.1.dev20221023+cu112'
libai 问题复现分支:dev_optimize_MT5
运行命令:
报错信息: