Open ShawnXuan opened 3 months ago
graph测试脚本: train_graph_distributed_fp32.sh
运行报错信息如下:
[ERROR](GRAPH:TrainGraph_0:TrainGraph) building plan got error.
Traceback (most recent call last):
File "/data1/home/zengqunhong/models/Vision/classification/image/resnet50/train.py", line 390, in <module>
trainer()
File "/data1/home/zengqunhong/models/Vision/classification/image/resnet50/train.py", line 223, in __call__
self.train()
File "/data1/home/zengqunhong/models/Vision/classification/image/resnet50/train.py", line 228, in train
self.train_one_epoch()
File "/data1/home/zengqunhong/models/Vision/classification/image/resnet50/train.py", line 248, in train_one_epoch
loss, pred, label = self.train_graph()
File "/data1/home/zengqunhong/miniconda3/envs/torchnpu/lib/python3.9/site-packages/oneflow/nn/graph/graph.py", line 284, in __call__
self._compile(*args, **kwargs)
File "/data1/home/zengqunhong/miniconda3/envs/torchnpu/lib/python3.9/site-packages/oneflow/nn/graph/graph.py", line 852, in _compile
return self._compile_new(*args, **kwargs)
File "/data1/home/zengqunhong/miniconda3/envs/torchnpu/lib/python3.9/site-packages/oneflow/nn/graph/graph.py", line 876, in _compile_new
self.finish_compile_and_init_runtime()
File "/data1/home/zengqunhong/miniconda3/envs/torchnpu/lib/python3.9/site-packages/oneflow/nn/graph/graph.py", line 1427, in finish_compile_and_init_runtime
self._c_nn_graph.compile_plan_for_runtime()
oneflow._oneflow_internal.exception.RuntimeError: Error: TaskType: 1, DeviceType: 6 has not been registered
oneflow-npu合并支持graph PR: https://github.com/Oneflow-Inc/oneflow-npu/pull/217 报错信息如下:
Stack trace (most recent call last) in thread 3030764:
Object "/data1/home/zengqunhong/miniconda3/envs/torchnpu/lib/python3.9/site-packages/oneflow/../oneflow.libs/liboneflow-1561515c.so", at 0xfffed5bbca17, in
Object "/data1/home/zengqunhong/miniconda3/envs/torchnpu/lib/python3.9/site-packages/oneflow/../oneflow.libs/liboneflow-1561515c.so", at 0xfffed5bbac2b, in Thread::PollMsgChannel()
Object "/data1/home/zengqunhong/miniconda3/envs/torchnpu/lib/python3.9/site-packages/oneflow/../oneflow.libs/liboneflow-1561515c.so", at 0xfffed57eb8d7, in
Object "/data1/home/zengqunhong/miniconda3/envs/torchnpu/lib/python3.9/site-packages/oneflow/../oneflow.libs/liboneflow-1561515c.so", at 0xfffed549b0c7, in Kernel::Launch(KernelContext*) const
Object "/data1/home/zengqunhong/miniconda3/envs/torchnpu/lib/python3.9/site-packages/oneflow/../oneflow.libs/liboneflow-1561515c.so", at 0xfffed549aa63, in Kernel::Forward(KernelContext*) const
Object "/data1/home/zengqunhong/miniconda3/envs/torchnpu/lib/python3.9/site-packages/oneflow/../oneflow.libs/liboneflow-1561515c.so", at 0xfffed54e08b3, in UserKernel::ForwardDataContent(KernelContext*) const
Object "/data1/home/zengqunhong/oneflow-npu/build/temp.linux-aarch64-cpython-39/oneflow_npu/liboneflow_npu.so", at 0xfffe6c5d0678, in
对比3090-4卡和910B4卡训练输出发现:二者输出逻辑不一样:npu4张卡会输出4遍同样的数据,3090则是分开输出。 910B输出: 3090输出: