Oneflow-Inc / models

Models and examples built with OneFlow
Apache License 2.0
94 stars 37 forks source link

Resnet device #410

Open ShawnXuan opened 3 months ago

xiaohoua commented 2 months ago

对比3090-4卡和910B4卡训练输出发现:二者输出逻辑不一样:npu4张卡会输出4遍同样的数据,3090则是分开输出。 910B输出: image 3090输出: image

0x404 commented 1 month ago

graph测试脚本: train_graph_distributed_fp32.sh

运行报错信息如下:

[ERROR](GRAPH:TrainGraph_0:TrainGraph) building plan got error.
Traceback (most recent call last):
  File "/data1/home/zengqunhong/models/Vision/classification/image/resnet50/train.py", line 390, in <module>
    trainer()
  File "/data1/home/zengqunhong/models/Vision/classification/image/resnet50/train.py", line 223, in __call__
    self.train()
  File "/data1/home/zengqunhong/models/Vision/classification/image/resnet50/train.py", line 228, in train
    self.train_one_epoch()
  File "/data1/home/zengqunhong/models/Vision/classification/image/resnet50/train.py", line 248, in train_one_epoch
    loss, pred, label = self.train_graph()
  File "/data1/home/zengqunhong/miniconda3/envs/torchnpu/lib/python3.9/site-packages/oneflow/nn/graph/graph.py", line 284, in __call__
    self._compile(*args, **kwargs)
  File "/data1/home/zengqunhong/miniconda3/envs/torchnpu/lib/python3.9/site-packages/oneflow/nn/graph/graph.py", line 852, in _compile
    return self._compile_new(*args, **kwargs)
  File "/data1/home/zengqunhong/miniconda3/envs/torchnpu/lib/python3.9/site-packages/oneflow/nn/graph/graph.py", line 876, in _compile_new
    self.finish_compile_and_init_runtime()
  File "/data1/home/zengqunhong/miniconda3/envs/torchnpu/lib/python3.9/site-packages/oneflow/nn/graph/graph.py", line 1427, in finish_compile_and_init_runtime
    self._c_nn_graph.compile_plan_for_runtime()
oneflow._oneflow_internal.exception.RuntimeError: Error: TaskType: 1, DeviceType: 6 has not been registered

oneflow-npu合并支持graph PR: https://github.com/Oneflow-Inc/oneflow-npu/pull/217 报错信息如下:

Stack trace (most recent call last) in thread 3030764:
   Object "/data1/home/zengqunhong/miniconda3/envs/torchnpu/lib/python3.9/site-packages/oneflow/../oneflow.libs/liboneflow-1561515c.so", at 0xfffed5bbca17, in 
   Object "/data1/home/zengqunhong/miniconda3/envs/torchnpu/lib/python3.9/site-packages/oneflow/../oneflow.libs/liboneflow-1561515c.so", at 0xfffed5bbac2b, in Thread::PollMsgChannel()
   Object "/data1/home/zengqunhong/miniconda3/envs/torchnpu/lib/python3.9/site-packages/oneflow/../oneflow.libs/liboneflow-1561515c.so", at 0xfffed57eb8d7, in 
   Object "/data1/home/zengqunhong/miniconda3/envs/torchnpu/lib/python3.9/site-packages/oneflow/../oneflow.libs/liboneflow-1561515c.so", at 0xfffed549b0c7, in Kernel::Launch(KernelContext*) const
   Object "/data1/home/zengqunhong/miniconda3/envs/torchnpu/lib/python3.9/site-packages/oneflow/../oneflow.libs/liboneflow-1561515c.so", at 0xfffed549aa63, in Kernel::Forward(KernelContext*) const
   Object "/data1/home/zengqunhong/miniconda3/envs/torchnpu/lib/python3.9/site-packages/oneflow/../oneflow.libs/liboneflow-1561515c.so", at 0xfffed54e08b3, in UserKernel::ForwardDataContent(KernelContext*) const
   Object "/data1/home/zengqunhong/oneflow-npu/build/temp.linux-aarch64-cpython-39/oneflow_npu/liboneflow_npu.so", at 0xfffe6c5d0678, in