Oneflow-Inc / libai

LiBai(李白): A Toolbox for Large-Scale Distributed Parallel Training
https://libai.readthedocs.io
Apache License 2.0
390 stars 55 forks source link

DETR结果对齐实验记录 #288

Open HiHippie opened 2 years ago

HiHippie commented 2 years ago

Eager global 模型并行

参数对齐:https://github.com/facebookresearch/detr

问题排查TODO LIST:

HiHippie commented 2 years ago

detr https://github.com/facebookresearch/detr 复现结果

name backbone box AP
DETR ResNet50 42.0
libai DETR ResNet50 25.9
l libai DETR 修复resnet50权重加载bug后 ResNet50 29.7
libai DETR 修复MultiHeadAttention实现 ResNet50 42.0

DETR

IoU metric: bbox
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.420
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.624
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.442
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.205
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.458
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.611
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.333
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.533
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.574
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.312
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.629
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.805 

libai DETR

IoU metric: bbox
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.259
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.487
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.242
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.083
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.247
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.464
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.248
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.380
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.412
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.163
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.421
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.681

libai DETR 修复resnet50 bug后

IoU metric: bbox
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.297
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.525
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.283
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.107
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.294
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.506
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.268
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.418
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.453
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.200
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.474
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.713

libai DETR 修复bug后,接在torch权重后结果与原论文一致

IoU metric: bbox
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.420
 Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.624
 Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.442
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.205
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.458
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.611
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.333
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.533
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.574
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.312
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.629
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.805

@rentainhe @Ldpe2G 目前inference对齐了。没有遇到oneflow或libai的bug,主要是对实现细节的修订。

为了正确加载torch权重,我的注意力实现参考了很多torch.nn.MultiHeadAttention,感觉有点偏离libai,这周我完善一下。

Ldpe2G commented 2 years ago

这是哪个backbone的结果,类似这样 https://github.com/facebookresearch/detr#model-zoo 列一下表格?

HiHippie commented 2 years ago

这是哪个backbone的结果,类似这样 https://github.com/facebookresearch/detr#model-zoo 列一下表格?

OK

rentainhe commented 2 years ago

是inference的结果吗~

HiHippie commented 2 years ago

是inference的结果吗~

是的。今天排查到我实现的multihead attention和torch.nn.MultiHeadAttention不一致(detr源代码用的它),可能是这里的问题,目前在修改代码。

rentainhe commented 2 years ago

是inference的结果吗~

是的。今天排查到我实现的multihead attention和torch.nn.MultiHeadAttention不一致(detr源代码用的它),可能是这里的问题,目前在修改代码。

OKOK~

HiHippie commented 2 years ago

对某些input shape导致loss.backward报错"F20220602 14:17:25.050042 15603 shape.cpp:187] Check failed: !broadcast_axis_vec.empty() "问题的排查

问题定位至:projects/DETR/utils/box_ops.py 中 min/max oneflow的bug

def generalized_box_iou(boxes1, boxes2):
    """
    Generalized IoU from https://giou.stanford.edu/

    The boxes should be in [x0, y0, x1, y1] format

    Returns a [N, M] pairwise matrix, where N = len(boxes1)
    and M = len(boxes2)
    """
    # degenerate boxes gives inf / nan results
    # so do an early check
    assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
    assert (boxes2[:, 2:] >= boxes2[:, :2]).all()
    iou, union = box_iou(boxes1, boxes2)
    lt = flow.min(boxes1[:, None, :2], boxes2[:, :2])
    rb = flow.max(boxes1[:, None, 2:], boxes2[:, 2:])

    wh = (rb - lt).clamp(min=0)  # [N,M,2]
    area = wh[:, :, 0] * wh[:, :, 1]

    return iou - (area - union) / area

最小复现代码:以flow.max为例,flow.min同理

版本:

>>> flow.__version__
'0.8.0.dev20220606+cu112'

>>> torch.__version__
'1.11.0+cu113'
# import torch
import oneflow as torch

class Net(torch.nn.Module):

    def __init__(self):
        super().__init__()

        self.linear = torch.nn.Linear(10,10)

    def forward(self, x, z):

        x = self.linear(x)
        '''
        我这边的场景是:
        输入的shape是动态的 有时x和z的shape一致有时不一致
        参考torch代码统一扩充了维度 来避免维度不同时无法比较的问题
        oneflow的问题在于如果shape是 x->[1, d],  z->[1, d]会有bug
        但如果第一维度不是1 是没有bug的
        测试用例给出了这三种cases
        '''
        #当x和z的shape不一致时,给x扩充维度才能做max/min
        h = torch.max(x[:, None, :], z)
        #h = torch.min(x[:, None, :], z)

        return h.mean()

net = Net()

# shape不一致的case
# x = torch.randn(15, 10)
# z = torch.randn(5, 10)

# shape一致的case
# x = torch.randn(15, 10)
# z = torch.randn(15, 10)

# ! oneflow出bug的case
# shape一致但为[1,x]的case
x = torch.randn(1, 10)
z = torch.randn(1, 10)

optimizer = torch.optim.SGD(net.parameters(), lr=0.1)

y = torch.ones([1])

criterion = torch.nn.MSELoss()

output = net(x, z)

optimizer.zero_grad()
loss = criterion(output, y)

# ! backward时报bug
loss.backward() 

optimizer.step()

输入为:

x = torch.randn(1, 10)
z = torch.randn(1, 10)

Bugs:

F20220607 10:56:17.827364 39752 shape.cpp:184] Check failed: !broadcast_axis_vec.empty() 
*** Check failure stack trace: ***
    @     0x7f4f74f2ff9a  google::LogMessage::Fail()
    @     0x7f4f74f30282  google::LogMessage::SendToLog()
    @     0x7f4f74f2fb07  google::LogMessage::Flush()
    @     0x7f4f74f32679  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f4f6be5ac9d  oneflow::Shape::Axes4BroadcastTo()
    @     0x7f4f6bc1c4ef  oneflow::one::BroadcastMinMax::Apply()
    @     0x7f4f6bc1d5d1  oneflow::one::OpExprGradFunction<>::ApplyIf()
    @     0x7f4f6d61c609  _ZNSt17_Function_handlerIFN7oneflow5MaybeIvvEERKNS0_3one11TensorTupleEPS4_bEZNKS3_19AutogradInterpreter5ApplyERKNS3_6OpExprES6_S7_RKNS3_19OpExprInterpContextEEUlS6_S7_bE0_E9_M_invokeERKSt9_Any_dataS6_OS7_Ob
    @     0x7f4f6bbd5407  oneflow::one::FunctionNode::Apply()
    @     0x7f4f6bbd9158  oneflow::one::GraphTask::Apply()
    @     0x7f4f6bbd9fb8  oneflow::one::GraphAutogradEngine::RunBackwardAndSaveGrads4LeafTensor()
    @     0x7f4f6bbd3ef5  oneflow::one::AutogradEngine::RunBackwardAndSaveGrads4LeafTensorIf()
    @     0x7f50283c48e9  oneflow::autograd::Backward()
    @     0x7f50283bc21f  (unknown)
    @     0x7f50285ddc79  (unknown)
    @     0x55ade7f25348  PyCFunction_Call
    @     0x55ade7f14dbc  _PyObject_MakeTpCall.localalias.6
    @     0x55ade7f9c545  _PyEval_EvalFrameDefault
    @     0x55ade7f6a270  _PyEval_EvalCodeWithName.localalias.4
    @     0x55ade7f6b0a3  _PyFunction_Vectorcall.localalias.352
    @     0x55ade7ed4a61  _PyEval_EvalFrameDefault.cold.2825
    @     0x55ade7f6a270  _PyEval_EvalCodeWithName.localalias.4
    @     0x55ade7f6b0a3  _PyFunction_Vectorcall.localalias.352
    @     0x55ade7ed4a40  _PyEval_EvalFrameDefault.cold.2825
    @     0x55ade7f6a270  _PyEval_EvalCodeWithName.localalias.4
    @     0x55ade7fff543  PyEval_EvalCode
    @     0x55ade7fff5e4  run_eval_code_obj
    @     0x55ade8025854  run_mod
    @     0x55ade7ee6390  pyrun_file
    @     0x55ade7ee90d2  PyRun_SimpleFileExFlags.localalias.16
    @     0x55ade7ee9bf0  Py_RunMain.cold.2953
    @     0x55ade8028a09  Py_BytesMain
Aborted

以上代码torch无bug

@clackhan

HiHippie commented 2 years ago

libai/utils/distributed.py 中

def convert_to_distributed_default_setting(module):
    """
    Helper function to convert all eager local tensor in :attr:`nn.Module` in the model to
    global tensor with data parallelism as default.
    """
    for param in module.parameters():
        if not param.is_global:
            module.to_global(
                sbp=get_nd_sbp([flow.sbp.broadcast, flow.sbp.broadcast]),
                placement=get_layer_placement(0),
            )
            return

作用是在build_model时将模型to_global。

但如果模型中有register_buffer参数,module.parameters()是不包含register_buffer参数的,所以也就不会把buffer参数to_global。

这里是否应该改成state_dict来实现:

def convert_to_distributed_default_setting(module):
    """
    Helper function to convert all eager local tensor in :attr:`nn.Module` in the model to
    global tensor with data parallelism as default.
    """
    for _, v in module.state_dict().items():
        if not v.is_global:
            module.to_global(
                sbp=dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.broadcast]),
                placement=dist.get_layer_placement(0),
            )
            return

这样才能把buffer参数to_global

@rentainhe @CPFLAME 帮忙看下有没有必要改一下~

CPFLAME commented 2 years ago

我感觉应该可以改. 改了以后可以跑一下其他的case,
比如bash dev/model_test.sh, 看看其他的模型有没有报错

HiHippie commented 2 years ago

OK,我来试试

HiHippie commented 2 years ago

global eager ddp 4卡数据并行很快就会报如下OOM错误,2卡会后面一点再报错。

F20220713 07:57:17.976464 1348305 virtual_machine_engine.cpp:332] 
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/virtual_machine_engine.cpp", line 332, in DispatchInstruction
    ret
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/eager/op_call_instruction_type.cpp", line 49, in Prepare
    AllocateOutputBlobsMemory(operand, device_ctx)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/eager/op_call_instruction_type.cpp", line 103, in AllocateOutputBlobsMemory
    blob_object->TryAllocateBlobBodyMemory(device_ctx)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/eager/eager_blob_object.cpp", line 66, in TryAllocateBlobBodyMemory
    allocator->Allocate(&dptr, required_body_bytes)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/bin_allocator.h", line 392, in Allocate
    AllocateBlockToExtendTotalMem(aligned_size)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/bin_allocator.h", line 305, in AllocateBlockToExtendTotalMem
    backend_->Allocate(&mem_ptr, final_allocate_bytes)
out of memory
Error Type: oneflow.ErrorProto.runtime_error
*** Check failure stack trace: ***

我用pynvml监控了下0卡显存的占用

        pynvml.nvmlInit()
        NUM_EXPAND = 1024 * 1024
        handle = pynvml.nvmlDeviceGetHandleByIndex(0)
        meminfo = pynvml.nvmlDeviceGetMemoryInfo(handle)
        print(meminfo.used / NUM_EXPAND)

4卡迭代过程中 memory变化 image

2卡迭代过程中memory变化 image

vae迭代过程的memory变化

image

目前不确定是哪里的问题

CPFLAME commented 2 years ago

是不是有一些变量没有及时释放

HiHippie commented 2 years ago

是不是有一些变量没有及时释放

我排查下

HiHippie commented 2 years ago

上面的问题定位到了,是因为在执行hidden_state+position_embedding时候如果二者的sbp不一致(hidden_state是split(0),position_embedding是broadcast),就会导致OOM问题。但如果二者保持一致(split(0)),就没问题了。

详细的最小复现我明天整理下

这可能是个潜在的bug?

HiHippie commented 2 years ago

记录一个之前遗留的问题

首先有如下代码,transformer两个output,第二个没有用到

        hs, _ = self.transformer(self.input_proj(src), mask, self.query_embed.weight, pos[-1])

在transformer内部,逻辑如下:

        memory = self.encoder(src, src_key_padding_mask=mask, pos=pos_embed)
        hs = self.decoder(tgt, memory, memory_key_padding_mask=mask,
                          pos=pos_embed, query_pos=query_embed)

        return hs.transpose(1, 2), memory.permute(1, 2, 0).view(bs, c, h, w)
        # return hs.transpose(1, 2)

可以看到第二个output是encoder的输出memory

问题是,如果transformer返回memory.permute(1, 2, 0).view(bs, c, h, w),则会报错。不返回的话可以正常运行。

经过排查,是.view这个op导致的。

完整报错信息如下:

F20220726 03:19:20.448858 993233 copy_data_content_kernel.cpp:49] Check failed: out->shape_view().elem_cnt() == elem_cnt (860160 vs. 903168) 
F20220726 03:19:20.448522 993266 copy_data_content_kernel.cpp:49] Check failed: out->shape_view().elem_cnt() == elem_cnt (946176 vs. 903168) 
F20220726 03:19:20.448385 993267 copy_data_content_kernel.cpp:49] Check failed: out->shape_view().elem_cnt() == elem_cnt (860160 vs. 903168) 
*** Check failure stack trace: ***
*** Check failure stack trace: ***
*** Check failure stack trace: ***
F20220726 03:19:20.448640 993270 copy_data_content_kernel.cpp:49] Check failed: out->shape_view().elem_cnt() == elem_cnt (946176 vs. 903168) 
*** Check failure stack trace: ***
    @     0x7f81e7480efa  google::LogMessage::Fail()
    @     0x7fa1499a7efa  google::LogMessage::Fail()
    @     0x7f3474521efa  google::LogMessage::Fail()
    @     0x7fd168531efa  google::LogMessage::Fail()
    @     0x7f81e74811e2  google::LogMessage::SendToLog()
    @     0x7fa1499a81e2  google::LogMessage::SendToLog()
    @     0x7f34745221e2  google::LogMessage::SendToLog()
    @     0x7fd1685321e2  google::LogMessage::SendToLog()
    @     0x7f3474521a67  google::LogMessage::Flush()
    @     0x7fa1499a7a67  google::LogMessage::Flush()
    @     0x7f81e7480a67  google::LogMessage::Flush()
    @     0x7fd168531a67  google::LogMessage::Flush()
    @     0x7fa1499aa5d9  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f34745245d9  google::LogMessageFatal::~LogMessageFatal()
    @     0x7fd1685345d9  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f81e74835d9  google::LogMessageFatal::~LogMessageFatal()
    @     0x7fa1432b6020  oneflow::(anonymous namespace)::CopyDataContentKernel::Compute()
    @     0x7fd161e40020  oneflow::(anonymous namespace)::CopyDataContentKernel::Compute()
    @     0x7f346de30020  oneflow::(anonymous namespace)::CopyDataContentKernel::Compute()
    @     0x7f81e0d8f020  oneflow::(anonymous namespace)::CopyDataContentKernel::Compute()
    @     0x7fd1630594ba  oneflow::one::StatefulOpKernel::Compute()
    @     0x7f346f0494ba  oneflow::one::StatefulOpKernel::Compute()
    @     0x7fa1444cf4ba  oneflow::one::StatefulOpKernel::Compute()
    @     0x7f81e1fa84ba  oneflow::one::StatefulOpKernel::Compute()
    @     0x7fd15e437e1a  oneflow::vm::OpCallInstructionType::Compute()
    @     0x7f346a427e1a  oneflow::vm::OpCallInstructionType::Compute()
    @     0x7fa13f8ade1a  oneflow::vm::OpCallInstructionType::Compute()
    @     0x7f81dd386e1a  oneflow::vm::OpCallInstructionType::Compute()
    @     0x7fd161b4ad10  oneflow::vm::FuseInstructionPolicy::Compute()
    @     0x7f346db3ad10  oneflow::vm::FuseInstructionPolicy::Compute()
    @     0x7fa142fc0d10  oneflow::vm::FuseInstructionPolicy::Compute()
    @     0x7f81e0a99d10  oneflow::vm::FuseInstructionPolicy::Compute()
    @     0x7fd161b2a6a1  oneflow::vm::EpStreamType::Run()
    @     0x7f346db1a6a1  oneflow::vm::EpStreamType::Run()
    @     0x7fa142fa06a1  oneflow::vm::EpStreamType::Run()
    @     0x7f81e0a796a1  oneflow::vm::EpStreamType::Run()
    @     0x7fd161b3104f  oneflow::vm::ThreadCtx::TryReceiveAndRun()
    @     0x7f346db2104f  oneflow::vm::ThreadCtx::TryReceiveAndRun()
    @     0x7fa142fa704f  oneflow::vm::ThreadCtx::TryReceiveAndRun()
    @     0x7fd161b333c0  oneflow::(anonymous namespace)::WorkerLoop()
    @     0x7f346db233c0  oneflow::(anonymous namespace)::WorkerLoop()
    @     0x7fa142fa93c0  oneflow::(anonymous namespace)::WorkerLoop()
    @     0x7f81e0a8004f  oneflow::vm::ThreadCtx::TryReceiveAndRun()
    @     0x7fd161b3351d  _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJPFvPN7oneflow2vm9ThreadCtxERKSt8functionIFvS6_EEES6_ZNS3_14VirtualMachine15CreateThreadCtxENS3_6SymbolINS3_6DeviceEEENS3_10StreamRoleEEUlS6_E2_EEEEE6_M_runEv
    @     0x7f346db2351d  _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJPFvPN7oneflow2vm9ThreadCtxERKSt8functionIFvS6_EEES6_ZNS3_14VirtualMachine15CreateThreadCtxENS3_6SymbolINS3_6DeviceEEENS3_10StreamRoleEEUlS6_E2_EEEEE6_M_runEv
    @     0x7fa142fa951d  _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJPFvPN7oneflow2vm9ThreadCtxERKSt8functionIFvS6_EEES6_ZNS3_14VirtualMachine15CreateThreadCtxENS3_6SymbolINS3_6DeviceEEENS3_10StreamRoleEEUlS6_E2_EEEEE6_M_runEv
    @     0x7f81e0a823c0  oneflow::(anonymous namespace)::WorkerLoop()
    @     0x7f81e0a8251d  _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJPFvPN7oneflow2vm9ThreadCtxERKSt8functionIFvS6_EEES6_ZNS3_14VirtualMachine15CreateThreadCtxENS3_6SymbolINS3_6DeviceEEENS3_10StreamRoleEEUlS6_E2_EEEEE6_M_runEv
    @     0x7fd16854693f  execute_native_thread_routine
    @     0x7fd238398609  start_thread
    @     0x7f347453693f  execute_native_thread_routine
    @     0x7fa1499bc93f  execute_native_thread_routine
    @     0x7fd2382bd163  clone
    @     0x7f3544388609  start_thread
    @     0x7f81e749593f  execute_native_thread_routine
    @     0x7fa21980e609  start_thread
    @     0x7f35442ad163  clone
    @     0x7fa219733163  clone
    @     0x7f82b72e7609  start_thread
    @     0x7f82b720c163  clone
Killing subprocess 992337
Killing subprocess 992338
Killing subprocess 992339
Killing subprocess 992340
Traceback (most recent call last):
  File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/distributed/launch.py", line 231, in <module>
    main()
  File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/distributed/launch.py", line 219, in main
    sigkill_handler(signal.SIGTERM, None)
  File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/distributed/launch.py", line 188, in sigkill_handler
    returncode=last_return_code, cmd=cmd
subprocess.CalledProcessError: Command '['/dataset/czq_home/anaconda3/envs/libai/bin/python3', '-u', 'projects/DETR/train_net.py', '--config-file', 'projects/DETR/configs/detr_training.py']' died with <Signals.SIGABRT: 6>.
HiHippie commented 2 years ago

记录待复现/排查的bug

训练过程中会遇到RuntimeError: Check failed: in_tensor_desc.is_dynamic() == false,暂未定位到问题。 求助guo ran后得知是“系统中对is_dynamic的处理不太完善,很多op都假设处理的静态的情况”。 DETR有很多padding,以及动态大小的tensor情况,且用到很多reshape, permute之类的op,可能是潜在的原因。

复现/排查到之后会更新过来。

File "/dataset/czq_home/projects/libai/libai/engine/default.py", line 472, in train
    super().train(self.start_iter, self.max_iter)
  File "/dataset/czq_home/projects/libai/libai/engine/trainer.py", line 146, in train
    self.run_step()
  File "/dataset/czq_home/projects/libai/libai/engine/default.py", line 476, in run_step
    self._trainer.run_step(self.get_batch, self.cfg.train.input_placement_device)
  File "/dataset/czq_home/projects/libai/projects/DETR/trainer/detr_trainer.py", line 55, in run_step
    data = next(self._data_loader_iter)
  File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/utils/data/dataloader.py", line 530, in __next__
    data = self._next_data()
  File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/utils/data/dataloader.py", line 1129, in _next_data
    return self._process_data(data)
  File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/utils/data/dataloader.py", line 1175, in _process_data
    data.reraise()
  File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/_utils.py", line 55, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 2.
Original Traceback (most recent call last):
  File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/utils/data/_utils/worker.py", line 349, in _worker_loop
    data = fetcher.fetch(index)
  File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/utils/data/_utils/fetch.py", line 65, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/utils/data/_utils/fetch.py", line 65, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "projects/DETR/datasets/detection.py", line 116, in __getitem__
    img, target = self.prepare(img, target)
  File "projects/DETR/datasets/detection.py", line 78, in __call__
    boxes = boxes[keep]
RuntimeError: Check failed: in_tensor_desc.is_dynamic() == false 
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/functional/impl/array_functor.cpp", line 1993, in operator()
    PrepareSliceIndices(index, *(x->shape()), &slice_indices, &tensor_indices, &expand_dims, &target_dims)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/functional/tensor_index.cpp", line 281, in PrepareSliceIndices
    ExpandMaskIndex(tensor)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/functional/tensor_index.cpp", line 80, in ExpandMaskIndex
    functional::Reshape(item, {size})
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/op_interpreter_util.cpp", line 140, in Dispatch<oneflow::one::Tensor>
    Dispatch<TensorTuple>(op_expr, inputs, ctx)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/op_interpreter_util.cpp", line 131, in Dispatch<oneflow::one::TensorTuple>
    Dispatch(op_expr, inputs, outputs.get(), ctx)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/op_interpreter.cpp", line 111, in Apply
    internal_->Apply(op_expr, *inputs_ptr, outputs, ctx)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/eager_local_op_interpreter.cpp", line 83, in NaiveInterpret
    [&]() -> Maybe<const LocalTensorInferResult> { LocalTensorMetaInferArgs ... Data_YouAreNotAllowedToCallThisFuncOutsideThisFile(); }()
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/eager_local_op_interpreter.cpp", line 83, in operator()
    user_op_expr.mut_local_tensor_infer_cache()->GetOrInfer(infer_args)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/local_tensor_infer_cache.cpp", line 198, in GetOrInfer
    Infer(*user_op_expr, infer_args)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/local_tensor_infer_cache.cpp", line 177, in Infer
    user_op_expr.InferPhysicalTensorDesc( infer_args.attrs ... ) -> TensorMeta* { return &output_mut_metas.at(i); })
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_expr.cpp", line 530, in InferPhysicalTensorDesc
    tensor_desc_infer_fn_(&infer_ctx)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/user/ops/reshape_op.cpp", line 41, in InferLogicalTensorDesc

Error Type: oneflow.ErrorProto.check_failed_error
yuanms2 commented 2 years ago

@Ldpe2G @BBuf 看看这个应该也是算子方面的问题,对动态形状的处理,yolo 里面应该也会遇到

BBuf commented 2 years ago

@Ldpe2G @BBuf 看看这个应该也是算子方面的问题,对动态形状的处理,yolo 里面应该也会遇到

希望这里可以整理出一份最小复现代码,只看错误栈有点乱且难以定位。

HiHippie commented 2 years ago

@Ldpe2G @BBuf 看看这个应该也是算子方面的问题,对动态形状的处理,yolo 里面应该也会遇到

希望这里可以整理出一份最小复现代码,只看错误栈有点乱且难以定位。

好的,我正在查了,只是目前还没搞清楚。有复现代码后会更新过来。

Ldpe2G commented 2 years ago

@Ldpe2G @BBuf 看看这个应该也是算子方面的问题,对动态形状的处理,yolo 里面应该也会遇到

希望这里可以整理出一份最小复现代码,只看错误栈有点乱且难以定位。

好的,我正在查了,只是目前还没搞清楚。有复现代码后会更新过来。

我看是数据读取时候报错的,是不是在做什么data augmentation

yuanms2 commented 2 years ago

张晓雨: eager是没什么问题的,Graph这里的处理之前啸宇和慈杰尝试推进过,他们有更详细的记录。

许啸宇: 现在在做 graph 的 inplace。先做了些调研,关联动态 shape 推导、寄存器规划、动态内存分配

相关issue https://github.com/Oneflow-Inc/OneTeam/issues/1076

HiHippie commented 2 years ago

张晓雨: eager是没什么问题的,Graph这里的处理之前啸宇和慈杰尝试推进过,他们有更详细的记录。

许啸宇: 现在在做 graph 的 inplace。先做了些调研,关联动态 shape 推导、寄存器规划、动态内存分配

相关issue Oneflow-Inc/OneTeam#1076

好的,谢谢袁老师。我这边是eager,看来更可能是我自己实现有问题,我在尝试复现看看。

HiHippie commented 2 years ago

张晓雨: eager是没什么问题的,Graph这里的处理之前啸宇和慈杰尝试推进过,他们有更详细的记录。 许啸宇: 现在在做 graph 的 inplace。先做了些调研,关联动态 shape 推导、寄存器规划、动态内存分配 相关issue Oneflow-Inc/OneTeam#1076

好的,谢谢袁老师。我这边是eager,看来更可能是我自己实现有问题,我在尝试复现看看。

记录待复现/排查的bug

训练过程中会遇到RuntimeError: Check failed: in_tensor_desc.is_dynamic() == false,暂未定位到问题。 求助guo ran后得知是“系统中对is_dynamic的处理不太完善,很多op都假设处理的静态的情况”。 DETR有很多padding,以及动态大小的tensor情况,且用到很多reshape, permute之类的op,可能是潜在的原因。

复现/排查到之后会更新过来。

File "/dataset/czq_home/projects/libai/libai/engine/default.py", line 472, in train
    super().train(self.start_iter, self.max_iter)
  File "/dataset/czq_home/projects/libai/libai/engine/trainer.py", line 146, in train
    self.run_step()
  File "/dataset/czq_home/projects/libai/libai/engine/default.py", line 476, in run_step
    self._trainer.run_step(self.get_batch, self.cfg.train.input_placement_device)
  File "/dataset/czq_home/projects/libai/projects/DETR/trainer/detr_trainer.py", line 55, in run_step
    data = next(self._data_loader_iter)
  File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/utils/data/dataloader.py", line 530, in __next__
    data = self._next_data()
  File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/utils/data/dataloader.py", line 1129, in _next_data
    return self._process_data(data)
  File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/utils/data/dataloader.py", line 1175, in _process_data
    data.reraise()
  File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/_utils.py", line 55, in reraise
    raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 2.
Original Traceback (most recent call last):
  File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/utils/data/_utils/worker.py", line 349, in _worker_loop
    data = fetcher.fetch(index)
  File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/utils/data/_utils/fetch.py", line 65, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/utils/data/_utils/fetch.py", line 65, in <listcomp>
    data = [self.dataset[idx] for idx in possibly_batched_index]
  File "projects/DETR/datasets/detection.py", line 116, in __getitem__
    img, target = self.prepare(img, target)
  File "projects/DETR/datasets/detection.py", line 78, in __call__
    boxes = boxes[keep]
RuntimeError: Check failed: in_tensor_desc.is_dynamic() == false 
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/functional/impl/array_functor.cpp", line 1993, in operator()
    PrepareSliceIndices(index, *(x->shape()), &slice_indices, &tensor_indices, &expand_dims, &target_dims)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/functional/tensor_index.cpp", line 281, in PrepareSliceIndices
    ExpandMaskIndex(tensor)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/functional/tensor_index.cpp", line 80, in ExpandMaskIndex
    functional::Reshape(item, {size})
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/op_interpreter_util.cpp", line 140, in Dispatch<oneflow::one::Tensor>
    Dispatch<TensorTuple>(op_expr, inputs, ctx)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/op_interpreter_util.cpp", line 131, in Dispatch<oneflow::one::TensorTuple>
    Dispatch(op_expr, inputs, outputs.get(), ctx)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/op_interpreter.cpp", line 111, in Apply
    internal_->Apply(op_expr, *inputs_ptr, outputs, ctx)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/eager_local_op_interpreter.cpp", line 83, in NaiveInterpret
    [&]() -> Maybe<const LocalTensorInferResult> { LocalTensorMetaInferArgs ... Data_YouAreNotAllowedToCallThisFuncOutsideThisFile(); }()
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/eager_local_op_interpreter.cpp", line 83, in operator()
    user_op_expr.mut_local_tensor_infer_cache()->GetOrInfer(infer_args)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/local_tensor_infer_cache.cpp", line 198, in GetOrInfer
    Infer(*user_op_expr, infer_args)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/local_tensor_infer_cache.cpp", line 177, in Infer
    user_op_expr.InferPhysicalTensorDesc( infer_args.attrs ... ) -> TensorMeta* { return &output_mut_metas.at(i); })
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_expr.cpp", line 530, in InferPhysicalTensorDesc
    tensor_desc_infer_fn_(&infer_ctx)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/user/ops/reshape_op.cpp", line 41, in InferLogicalTensorDesc

Error Type: oneflow.ErrorProto.check_failed_error

这个问题,我在更新oneflow后消失了,尝试训练了一些iter也没再出现。

HiHippie commented 2 years ago

参考https://github.com/Oneflow-Inc/OneTeam/issues/779 做模型loss对齐的记录

no_aux_loss AdamW 单卡 加载预训练权重 image

aux_loss AdamW 单卡 加载预训练权重 image

aux_loss AdamW 4卡 加载预训练权重 因为torch版本采用的DistributedSampler和libai这边的采样顺序不一样,暂采用单个样本训练,loss曲线如下。 image

xiezipeng-ML commented 2 years ago

参考Oneflow-Inc/OneTeam#779 做模型loss对齐的记录

  • [x] 检查网络结构model.py是否对齐
  • [x] 确定dataloader的shuffle有没有关掉
  • [x] 网络的dropout有没有关掉
  • [x] 确定lr_scheduler和optimizer是否相同
  • [x] 为了双重保险, 可以把传参里面的dropout_prob全部设置为0, 同时把model的mode设置为.eval(), 这样在训练的时候可以保证模型的dropout和bn等op全部都是固定的, 不包含随机性

no_aux_loss AdamW 单卡 image

@CPFLAME @xiezipeng-ML loss对齐这个程度算ok吗?求个经验~ 加载的权重是收敛后的模型,所以可能看不大出下降趋势了

看loss曲线基本没问题 也可以加载初始化权重来训练 看看下降