Open HiHippie opened 2 years ago
detr https://github.com/facebookresearch/detr 复现结果
name | backbone | box AP |
---|---|---|
DETR | ResNet50 | 42.0 |
libai DETR | ResNet50 | 25.9 |
l libai DETR 修复resnet50权重加载bug后 | ResNet50 | 29.7 |
libai DETR 修复MultiHeadAttention实现 | ResNet50 | 42.0 |
DETR
IoU metric: bbox
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.420
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.624
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.442
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.205
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.458
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.611
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.333
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.533
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.574
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.312
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.629
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.805
libai DETR
IoU metric: bbox
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.259
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.487
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.242
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.083
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.247
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.464
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.248
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.380
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.412
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.163
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.421
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.681
libai DETR 修复resnet50 bug后
IoU metric: bbox
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.297
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.525
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.283
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.107
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.294
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.506
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.268
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.418
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.453
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.200
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.474
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.713
libai DETR 修复bug后,接在torch权重后结果与原论文一致
IoU metric: bbox
Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.420
Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.624
Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.442
Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.205
Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.458
Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.611
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.333
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.533
Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.574
Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.312
Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.629
Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.805
@rentainhe @Ldpe2G 目前inference对齐了。没有遇到oneflow或libai的bug,主要是对实现细节的修订。
为了正确加载torch权重,我的注意力实现参考了很多torch.nn.MultiHeadAttention,感觉有点偏离libai,这周我完善一下。
这是哪个backbone的结果,类似这样 https://github.com/facebookresearch/detr#model-zoo 列一下表格?
这是哪个backbone的结果,类似这样 https://github.com/facebookresearch/detr#model-zoo 列一下表格?
OK
是inference的结果吗~
是inference的结果吗~
是的。今天排查到我实现的multihead attention和torch.nn.MultiHeadAttention不一致(detr源代码用的它),可能是这里的问题,目前在修改代码。
是inference的结果吗~
是的。今天排查到我实现的multihead attention和torch.nn.MultiHeadAttention不一致(detr源代码用的它),可能是这里的问题,目前在修改代码。
OKOK~
对某些input shape导致loss.backward报错"F20220602 14:17:25.050042 15603 shape.cpp:187] Check failed: !broadcast_axis_vec.empty() "问题的排查
问题定位至:projects/DETR/utils/box_ops.py 中 min/max oneflow的bug
def generalized_box_iou(boxes1, boxes2):
"""
Generalized IoU from https://giou.stanford.edu/
The boxes should be in [x0, y0, x1, y1] format
Returns a [N, M] pairwise matrix, where N = len(boxes1)
and M = len(boxes2)
"""
# degenerate boxes gives inf / nan results
# so do an early check
assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
assert (boxes2[:, 2:] >= boxes2[:, :2]).all()
iou, union = box_iou(boxes1, boxes2)
lt = flow.min(boxes1[:, None, :2], boxes2[:, :2])
rb = flow.max(boxes1[:, None, 2:], boxes2[:, 2:])
wh = (rb - lt).clamp(min=0) # [N,M,2]
area = wh[:, :, 0] * wh[:, :, 1]
return iou - (area - union) / area
最小复现代码:以flow.max为例,flow.min同理
版本:
>>> flow.__version__
'0.8.0.dev20220606+cu112'
>>> torch.__version__
'1.11.0+cu113'
# import torch
import oneflow as torch
class Net(torch.nn.Module):
def __init__(self):
super().__init__()
self.linear = torch.nn.Linear(10,10)
def forward(self, x, z):
x = self.linear(x)
'''
我这边的场景是:
输入的shape是动态的 有时x和z的shape一致有时不一致
参考torch代码统一扩充了维度 来避免维度不同时无法比较的问题
oneflow的问题在于如果shape是 x->[1, d], z->[1, d]会有bug
但如果第一维度不是1 是没有bug的
测试用例给出了这三种cases
'''
#当x和z的shape不一致时,给x扩充维度才能做max/min
h = torch.max(x[:, None, :], z)
#h = torch.min(x[:, None, :], z)
return h.mean()
net = Net()
# shape不一致的case
# x = torch.randn(15, 10)
# z = torch.randn(5, 10)
# shape一致的case
# x = torch.randn(15, 10)
# z = torch.randn(15, 10)
# ! oneflow出bug的case
# shape一致但为[1,x]的case
x = torch.randn(1, 10)
z = torch.randn(1, 10)
optimizer = torch.optim.SGD(net.parameters(), lr=0.1)
y = torch.ones([1])
criterion = torch.nn.MSELoss()
output = net(x, z)
optimizer.zero_grad()
loss = criterion(output, y)
# ! backward时报bug
loss.backward()
optimizer.step()
输入为:
x = torch.randn(1, 10)
z = torch.randn(1, 10)
Bugs:
F20220607 10:56:17.827364 39752 shape.cpp:184] Check failed: !broadcast_axis_vec.empty()
*** Check failure stack trace: ***
@ 0x7f4f74f2ff9a google::LogMessage::Fail()
@ 0x7f4f74f30282 google::LogMessage::SendToLog()
@ 0x7f4f74f2fb07 google::LogMessage::Flush()
@ 0x7f4f74f32679 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f4f6be5ac9d oneflow::Shape::Axes4BroadcastTo()
@ 0x7f4f6bc1c4ef oneflow::one::BroadcastMinMax::Apply()
@ 0x7f4f6bc1d5d1 oneflow::one::OpExprGradFunction<>::ApplyIf()
@ 0x7f4f6d61c609 _ZNSt17_Function_handlerIFN7oneflow5MaybeIvvEERKNS0_3one11TensorTupleEPS4_bEZNKS3_19AutogradInterpreter5ApplyERKNS3_6OpExprES6_S7_RKNS3_19OpExprInterpContextEEUlS6_S7_bE0_E9_M_invokeERKSt9_Any_dataS6_OS7_Ob
@ 0x7f4f6bbd5407 oneflow::one::FunctionNode::Apply()
@ 0x7f4f6bbd9158 oneflow::one::GraphTask::Apply()
@ 0x7f4f6bbd9fb8 oneflow::one::GraphAutogradEngine::RunBackwardAndSaveGrads4LeafTensor()
@ 0x7f4f6bbd3ef5 oneflow::one::AutogradEngine::RunBackwardAndSaveGrads4LeafTensorIf()
@ 0x7f50283c48e9 oneflow::autograd::Backward()
@ 0x7f50283bc21f (unknown)
@ 0x7f50285ddc79 (unknown)
@ 0x55ade7f25348 PyCFunction_Call
@ 0x55ade7f14dbc _PyObject_MakeTpCall.localalias.6
@ 0x55ade7f9c545 _PyEval_EvalFrameDefault
@ 0x55ade7f6a270 _PyEval_EvalCodeWithName.localalias.4
@ 0x55ade7f6b0a3 _PyFunction_Vectorcall.localalias.352
@ 0x55ade7ed4a61 _PyEval_EvalFrameDefault.cold.2825
@ 0x55ade7f6a270 _PyEval_EvalCodeWithName.localalias.4
@ 0x55ade7f6b0a3 _PyFunction_Vectorcall.localalias.352
@ 0x55ade7ed4a40 _PyEval_EvalFrameDefault.cold.2825
@ 0x55ade7f6a270 _PyEval_EvalCodeWithName.localalias.4
@ 0x55ade7fff543 PyEval_EvalCode
@ 0x55ade7fff5e4 run_eval_code_obj
@ 0x55ade8025854 run_mod
@ 0x55ade7ee6390 pyrun_file
@ 0x55ade7ee90d2 PyRun_SimpleFileExFlags.localalias.16
@ 0x55ade7ee9bf0 Py_RunMain.cold.2953
@ 0x55ade8028a09 Py_BytesMain
Aborted
以上代码torch无bug
@clackhan
libai/utils/distributed.py 中
def convert_to_distributed_default_setting(module):
"""
Helper function to convert all eager local tensor in :attr:`nn.Module` in the model to
global tensor with data parallelism as default.
"""
for param in module.parameters():
if not param.is_global:
module.to_global(
sbp=get_nd_sbp([flow.sbp.broadcast, flow.sbp.broadcast]),
placement=get_layer_placement(0),
)
return
作用是在build_model时将模型to_global。
但如果模型中有register_buffer参数,module.parameters()是不包含register_buffer参数的,所以也就不会把buffer参数to_global。
这里是否应该改成state_dict来实现:
def convert_to_distributed_default_setting(module):
"""
Helper function to convert all eager local tensor in :attr:`nn.Module` in the model to
global tensor with data parallelism as default.
"""
for _, v in module.state_dict().items():
if not v.is_global:
module.to_global(
sbp=dist.get_nd_sbp([flow.sbp.broadcast, flow.sbp.broadcast]),
placement=dist.get_layer_placement(0),
)
return
这样才能把buffer参数to_global
@rentainhe @CPFLAME 帮忙看下有没有必要改一下~
我感觉应该可以改. 改了以后可以跑一下其他的case,
比如bash dev/model_test.sh
, 看看其他的模型有没有报错
OK,我来试试
global eager ddp
4卡数据并行很快就会报如下OOM错误,2卡会后面一点再报错。
F20220713 07:57:17.976464 1348305 virtual_machine_engine.cpp:332]
File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/virtual_machine_engine.cpp", line 332, in DispatchInstruction
ret
File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/eager/op_call_instruction_type.cpp", line 49, in Prepare
AllocateOutputBlobsMemory(operand, device_ctx)
File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/eager/op_call_instruction_type.cpp", line 103, in AllocateOutputBlobsMemory
blob_object->TryAllocateBlobBodyMemory(device_ctx)
File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/eager/eager_blob_object.cpp", line 66, in TryAllocateBlobBodyMemory
allocator->Allocate(&dptr, required_body_bytes)
File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/bin_allocator.h", line 392, in Allocate
AllocateBlockToExtendTotalMem(aligned_size)
File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/bin_allocator.h", line 305, in AllocateBlockToExtendTotalMem
backend_->Allocate(&mem_ptr, final_allocate_bytes)
out of memory
Error Type: oneflow.ErrorProto.runtime_error
*** Check failure stack trace: ***
我用pynvml监控了下0卡显存的占用
pynvml.nvmlInit()
NUM_EXPAND = 1024 * 1024
handle = pynvml.nvmlDeviceGetHandleByIndex(0)
meminfo = pynvml.nvmlDeviceGetMemoryInfo(handle)
print(meminfo.used / NUM_EXPAND)
4卡迭代过程中 memory变化
2卡迭代过程中memory变化
vae迭代过程的memory变化
目前不确定是哪里的问题
是不是有一些变量没有及时释放
是不是有一些变量没有及时释放
我排查下
上面的问题定位到了,是因为在执行hidden_state+position_embedding
时候如果二者的sbp不一致(hidden_state是split(0),position_embedding是broadcast),就会导致OOM问题。但如果二者保持一致(split(0)),就没问题了。
详细的最小复现我明天整理下
这可能是个潜在的bug?
记录一个之前遗留的问题
首先有如下代码,transformer两个output,第二个没有用到
hs, _ = self.transformer(self.input_proj(src), mask, self.query_embed.weight, pos[-1])
在transformer内部,逻辑如下:
memory = self.encoder(src, src_key_padding_mask=mask, pos=pos_embed)
hs = self.decoder(tgt, memory, memory_key_padding_mask=mask,
pos=pos_embed, query_pos=query_embed)
return hs.transpose(1, 2), memory.permute(1, 2, 0).view(bs, c, h, w)
# return hs.transpose(1, 2)
可以看到第二个output是encoder的输出memory
。
问题是,如果transformer返回memory.permute(1, 2, 0).view(bs, c, h, w)
,则会报错。不返回的话可以正常运行。
经过排查,是.view这个op导致的。
完整报错信息如下:
F20220726 03:19:20.448858 993233 copy_data_content_kernel.cpp:49] Check failed: out->shape_view().elem_cnt() == elem_cnt (860160 vs. 903168)
F20220726 03:19:20.448522 993266 copy_data_content_kernel.cpp:49] Check failed: out->shape_view().elem_cnt() == elem_cnt (946176 vs. 903168)
F20220726 03:19:20.448385 993267 copy_data_content_kernel.cpp:49] Check failed: out->shape_view().elem_cnt() == elem_cnt (860160 vs. 903168)
*** Check failure stack trace: ***
*** Check failure stack trace: ***
*** Check failure stack trace: ***
F20220726 03:19:20.448640 993270 copy_data_content_kernel.cpp:49] Check failed: out->shape_view().elem_cnt() == elem_cnt (946176 vs. 903168)
*** Check failure stack trace: ***
@ 0x7f81e7480efa google::LogMessage::Fail()
@ 0x7fa1499a7efa google::LogMessage::Fail()
@ 0x7f3474521efa google::LogMessage::Fail()
@ 0x7fd168531efa google::LogMessage::Fail()
@ 0x7f81e74811e2 google::LogMessage::SendToLog()
@ 0x7fa1499a81e2 google::LogMessage::SendToLog()
@ 0x7f34745221e2 google::LogMessage::SendToLog()
@ 0x7fd1685321e2 google::LogMessage::SendToLog()
@ 0x7f3474521a67 google::LogMessage::Flush()
@ 0x7fa1499a7a67 google::LogMessage::Flush()
@ 0x7f81e7480a67 google::LogMessage::Flush()
@ 0x7fd168531a67 google::LogMessage::Flush()
@ 0x7fa1499aa5d9 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f34745245d9 google::LogMessageFatal::~LogMessageFatal()
@ 0x7fd1685345d9 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f81e74835d9 google::LogMessageFatal::~LogMessageFatal()
@ 0x7fa1432b6020 oneflow::(anonymous namespace)::CopyDataContentKernel::Compute()
@ 0x7fd161e40020 oneflow::(anonymous namespace)::CopyDataContentKernel::Compute()
@ 0x7f346de30020 oneflow::(anonymous namespace)::CopyDataContentKernel::Compute()
@ 0x7f81e0d8f020 oneflow::(anonymous namespace)::CopyDataContentKernel::Compute()
@ 0x7fd1630594ba oneflow::one::StatefulOpKernel::Compute()
@ 0x7f346f0494ba oneflow::one::StatefulOpKernel::Compute()
@ 0x7fa1444cf4ba oneflow::one::StatefulOpKernel::Compute()
@ 0x7f81e1fa84ba oneflow::one::StatefulOpKernel::Compute()
@ 0x7fd15e437e1a oneflow::vm::OpCallInstructionType::Compute()
@ 0x7f346a427e1a oneflow::vm::OpCallInstructionType::Compute()
@ 0x7fa13f8ade1a oneflow::vm::OpCallInstructionType::Compute()
@ 0x7f81dd386e1a oneflow::vm::OpCallInstructionType::Compute()
@ 0x7fd161b4ad10 oneflow::vm::FuseInstructionPolicy::Compute()
@ 0x7f346db3ad10 oneflow::vm::FuseInstructionPolicy::Compute()
@ 0x7fa142fc0d10 oneflow::vm::FuseInstructionPolicy::Compute()
@ 0x7f81e0a99d10 oneflow::vm::FuseInstructionPolicy::Compute()
@ 0x7fd161b2a6a1 oneflow::vm::EpStreamType::Run()
@ 0x7f346db1a6a1 oneflow::vm::EpStreamType::Run()
@ 0x7fa142fa06a1 oneflow::vm::EpStreamType::Run()
@ 0x7f81e0a796a1 oneflow::vm::EpStreamType::Run()
@ 0x7fd161b3104f oneflow::vm::ThreadCtx::TryReceiveAndRun()
@ 0x7f346db2104f oneflow::vm::ThreadCtx::TryReceiveAndRun()
@ 0x7fa142fa704f oneflow::vm::ThreadCtx::TryReceiveAndRun()
@ 0x7fd161b333c0 oneflow::(anonymous namespace)::WorkerLoop()
@ 0x7f346db233c0 oneflow::(anonymous namespace)::WorkerLoop()
@ 0x7fa142fa93c0 oneflow::(anonymous namespace)::WorkerLoop()
@ 0x7f81e0a8004f oneflow::vm::ThreadCtx::TryReceiveAndRun()
@ 0x7fd161b3351d _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJPFvPN7oneflow2vm9ThreadCtxERKSt8functionIFvS6_EEES6_ZNS3_14VirtualMachine15CreateThreadCtxENS3_6SymbolINS3_6DeviceEEENS3_10StreamRoleEEUlS6_E2_EEEEE6_M_runEv
@ 0x7f346db2351d _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJPFvPN7oneflow2vm9ThreadCtxERKSt8functionIFvS6_EEES6_ZNS3_14VirtualMachine15CreateThreadCtxENS3_6SymbolINS3_6DeviceEEENS3_10StreamRoleEEUlS6_E2_EEEEE6_M_runEv
@ 0x7fa142fa951d _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJPFvPN7oneflow2vm9ThreadCtxERKSt8functionIFvS6_EEES6_ZNS3_14VirtualMachine15CreateThreadCtxENS3_6SymbolINS3_6DeviceEEENS3_10StreamRoleEEUlS6_E2_EEEEE6_M_runEv
@ 0x7f81e0a823c0 oneflow::(anonymous namespace)::WorkerLoop()
@ 0x7f81e0a8251d _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJPFvPN7oneflow2vm9ThreadCtxERKSt8functionIFvS6_EEES6_ZNS3_14VirtualMachine15CreateThreadCtxENS3_6SymbolINS3_6DeviceEEENS3_10StreamRoleEEUlS6_E2_EEEEE6_M_runEv
@ 0x7fd16854693f execute_native_thread_routine
@ 0x7fd238398609 start_thread
@ 0x7f347453693f execute_native_thread_routine
@ 0x7fa1499bc93f execute_native_thread_routine
@ 0x7fd2382bd163 clone
@ 0x7f3544388609 start_thread
@ 0x7f81e749593f execute_native_thread_routine
@ 0x7fa21980e609 start_thread
@ 0x7f35442ad163 clone
@ 0x7fa219733163 clone
@ 0x7f82b72e7609 start_thread
@ 0x7f82b720c163 clone
Killing subprocess 992337
Killing subprocess 992338
Killing subprocess 992339
Killing subprocess 992340
Traceback (most recent call last):
File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/distributed/launch.py", line 231, in <module>
main()
File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/distributed/launch.py", line 219, in main
sigkill_handler(signal.SIGTERM, None)
File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/distributed/launch.py", line 188, in sigkill_handler
returncode=last_return_code, cmd=cmd
subprocess.CalledProcessError: Command '['/dataset/czq_home/anaconda3/envs/libai/bin/python3', '-u', 'projects/DETR/train_net.py', '--config-file', 'projects/DETR/configs/detr_training.py']' died with <Signals.SIGABRT: 6>.
记录待复现/排查的bug
训练过程中会遇到RuntimeError: Check failed: in_tensor_desc.is_dynamic() == false
,暂未定位到问题。
求助guo ran后得知是“系统中对is_dynamic的处理不太完善,很多op都假设处理的静态的情况”。
DETR有很多padding,以及动态大小的tensor情况,且用到很多reshape, permute之类的op,可能是潜在的原因。
复现/排查到之后会更新过来。
File "/dataset/czq_home/projects/libai/libai/engine/default.py", line 472, in train
super().train(self.start_iter, self.max_iter)
File "/dataset/czq_home/projects/libai/libai/engine/trainer.py", line 146, in train
self.run_step()
File "/dataset/czq_home/projects/libai/libai/engine/default.py", line 476, in run_step
self._trainer.run_step(self.get_batch, self.cfg.train.input_placement_device)
File "/dataset/czq_home/projects/libai/projects/DETR/trainer/detr_trainer.py", line 55, in run_step
data = next(self._data_loader_iter)
File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/utils/data/dataloader.py", line 530, in __next__
data = self._next_data()
File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/utils/data/dataloader.py", line 1129, in _next_data
return self._process_data(data)
File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/utils/data/dataloader.py", line 1175, in _process_data
data.reraise()
File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/_utils.py", line 55, in reraise
raise self.exc_type(msg)
RuntimeError: Caught RuntimeError in DataLoader worker process 2.
Original Traceback (most recent call last):
File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/utils/data/_utils/worker.py", line 349, in _worker_loop
data = fetcher.fetch(index)
File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/utils/data/_utils/fetch.py", line 65, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/utils/data/_utils/fetch.py", line 65, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "projects/DETR/datasets/detection.py", line 116, in __getitem__
img, target = self.prepare(img, target)
File "projects/DETR/datasets/detection.py", line 78, in __call__
boxes = boxes[keep]
RuntimeError: Check failed: in_tensor_desc.is_dynamic() == false
File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/functional/impl/array_functor.cpp", line 1993, in operator()
PrepareSliceIndices(index, *(x->shape()), &slice_indices, &tensor_indices, &expand_dims, &target_dims)
File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/functional/tensor_index.cpp", line 281, in PrepareSliceIndices
ExpandMaskIndex(tensor)
File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/functional/tensor_index.cpp", line 80, in ExpandMaskIndex
functional::Reshape(item, {size})
File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/op_interpreter_util.cpp", line 140, in Dispatch<oneflow::one::Tensor>
Dispatch<TensorTuple>(op_expr, inputs, ctx)
File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/op_interpreter_util.cpp", line 131, in Dispatch<oneflow::one::TensorTuple>
Dispatch(op_expr, inputs, outputs.get(), ctx)
File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/op_interpreter.cpp", line 111, in Apply
internal_->Apply(op_expr, *inputs_ptr, outputs, ctx)
File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/eager_local_op_interpreter.cpp", line 83, in NaiveInterpret
[&]() -> Maybe<const LocalTensorInferResult> { LocalTensorMetaInferArgs ... Data_YouAreNotAllowedToCallThisFuncOutsideThisFile(); }()
File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/eager_local_op_interpreter.cpp", line 83, in operator()
user_op_expr.mut_local_tensor_infer_cache()->GetOrInfer(infer_args)
File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/local_tensor_infer_cache.cpp", line 198, in GetOrInfer
Infer(*user_op_expr, infer_args)
File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/local_tensor_infer_cache.cpp", line 177, in Infer
user_op_expr.InferPhysicalTensorDesc( infer_args.attrs ... ) -> TensorMeta* { return &output_mut_metas.at(i); })
File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_expr.cpp", line 530, in InferPhysicalTensorDesc
tensor_desc_infer_fn_(&infer_ctx)
File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/user/ops/reshape_op.cpp", line 41, in InferLogicalTensorDesc
Error Type: oneflow.ErrorProto.check_failed_error
@Ldpe2G @BBuf 看看这个应该也是算子方面的问题,对动态形状的处理,yolo 里面应该也会遇到
@Ldpe2G @BBuf 看看这个应该也是算子方面的问题,对动态形状的处理,yolo 里面应该也会遇到
希望这里可以整理出一份最小复现代码,只看错误栈有点乱且难以定位。
@Ldpe2G @BBuf 看看这个应该也是算子方面的问题,对动态形状的处理,yolo 里面应该也会遇到
希望这里可以整理出一份最小复现代码,只看错误栈有点乱且难以定位。
好的,我正在查了,只是目前还没搞清楚。有复现代码后会更新过来。
@Ldpe2G @BBuf 看看这个应该也是算子方面的问题,对动态形状的处理,yolo 里面应该也会遇到
希望这里可以整理出一份最小复现代码,只看错误栈有点乱且难以定位。
好的,我正在查了,只是目前还没搞清楚。有复现代码后会更新过来。
我看是数据读取时候报错的,是不是在做什么data augmentation
张晓雨: eager是没什么问题的,Graph这里的处理之前啸宇和慈杰尝试推进过,他们有更详细的记录。
许啸宇: 现在在做 graph 的 inplace。先做了些调研,关联动态 shape 推导、寄存器规划、动态内存分配
张晓雨: eager是没什么问题的,Graph这里的处理之前啸宇和慈杰尝试推进过,他们有更详细的记录。
许啸宇: 现在在做 graph 的 inplace。先做了些调研,关联动态 shape 推导、寄存器规划、动态内存分配
相关issue Oneflow-Inc/OneTeam#1076
好的,谢谢袁老师。我这边是eager,看来更可能是我自己实现有问题,我在尝试复现看看。
张晓雨: eager是没什么问题的,Graph这里的处理之前啸宇和慈杰尝试推进过,他们有更详细的记录。 许啸宇: 现在在做 graph 的 inplace。先做了些调研,关联动态 shape 推导、寄存器规划、动态内存分配 相关issue Oneflow-Inc/OneTeam#1076
好的,谢谢袁老师。我这边是eager,看来更可能是我自己实现有问题,我在尝试复现看看。
记录待复现/排查的bug
训练过程中会遇到
RuntimeError: Check failed: in_tensor_desc.is_dynamic() == false
,暂未定位到问题。 求助guo ran后得知是“系统中对is_dynamic的处理不太完善,很多op都假设处理的静态的情况”。 DETR有很多padding,以及动态大小的tensor情况,且用到很多reshape, permute之类的op,可能是潜在的原因。复现/排查到之后会更新过来。
File "/dataset/czq_home/projects/libai/libai/engine/default.py", line 472, in train super().train(self.start_iter, self.max_iter) File "/dataset/czq_home/projects/libai/libai/engine/trainer.py", line 146, in train self.run_step() File "/dataset/czq_home/projects/libai/libai/engine/default.py", line 476, in run_step self._trainer.run_step(self.get_batch, self.cfg.train.input_placement_device) File "/dataset/czq_home/projects/libai/projects/DETR/trainer/detr_trainer.py", line 55, in run_step data = next(self._data_loader_iter) File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/utils/data/dataloader.py", line 530, in __next__ data = self._next_data() File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/utils/data/dataloader.py", line 1129, in _next_data return self._process_data(data) File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/utils/data/dataloader.py", line 1175, in _process_data data.reraise() File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/_utils.py", line 55, in reraise raise self.exc_type(msg) RuntimeError: Caught RuntimeError in DataLoader worker process 2. Original Traceback (most recent call last): File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/utils/data/_utils/worker.py", line 349, in _worker_loop data = fetcher.fetch(index) File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/utils/data/_utils/fetch.py", line 65, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/dataset/czq_home/anaconda3/envs/libai/lib/python3.7/site-packages/oneflow/utils/data/_utils/fetch.py", line 65, in <listcomp> data = [self.dataset[idx] for idx in possibly_batched_index] File "projects/DETR/datasets/detection.py", line 116, in __getitem__ img, target = self.prepare(img, target) File "projects/DETR/datasets/detection.py", line 78, in __call__ boxes = boxes[keep] RuntimeError: Check failed: in_tensor_desc.is_dynamic() == false File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/functional/impl/array_functor.cpp", line 1993, in operator() PrepareSliceIndices(index, *(x->shape()), &slice_indices, &tensor_indices, &expand_dims, &target_dims) File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/functional/tensor_index.cpp", line 281, in PrepareSliceIndices ExpandMaskIndex(tensor) File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/functional/tensor_index.cpp", line 80, in ExpandMaskIndex functional::Reshape(item, {size}) File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/op_interpreter_util.cpp", line 140, in Dispatch<oneflow::one::Tensor> Dispatch<TensorTuple>(op_expr, inputs, ctx) File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/op_interpreter_util.cpp", line 131, in Dispatch<oneflow::one::TensorTuple> Dispatch(op_expr, inputs, outputs.get(), ctx) File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/op_interpreter.cpp", line 111, in Apply internal_->Apply(op_expr, *inputs_ptr, outputs, ctx) File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/eager_local_op_interpreter.cpp", line 83, in NaiveInterpret [&]() -> Maybe<const LocalTensorInferResult> { LocalTensorMetaInferArgs ... Data_YouAreNotAllowedToCallThisFuncOutsideThisFile(); }() File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/eager_local_op_interpreter.cpp", line 83, in operator() user_op_expr.mut_local_tensor_infer_cache()->GetOrInfer(infer_args) File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/local_tensor_infer_cache.cpp", line 198, in GetOrInfer Infer(*user_op_expr, infer_args) File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/local_tensor_infer_cache.cpp", line 177, in Infer user_op_expr.InferPhysicalTensorDesc( infer_args.attrs ... ) -> TensorMeta* { return &output_mut_metas.at(i); }) File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_expr.cpp", line 530, in InferPhysicalTensorDesc tensor_desc_infer_fn_(&infer_ctx) File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/user/ops/reshape_op.cpp", line 41, in InferLogicalTensorDesc Error Type: oneflow.ErrorProto.check_failed_error
这个问题,我在更新oneflow后消失了,尝试训练了一些iter也没再出现。
参考https://github.com/Oneflow-Inc/OneTeam/issues/779 做模型loss对齐的记录
no_aux_loss AdamW 单卡 加载预训练权重
aux_loss AdamW 单卡 加载预训练权重
aux_loss AdamW 4卡 加载预训练权重 因为torch版本采用的DistributedSampler和libai这边的采样顺序不一样,暂采用单个样本训练,loss曲线如下。
参考Oneflow-Inc/OneTeam#779 做模型loss对齐的记录
- [x] 检查网络结构model.py是否对齐
- [x] 确定dataloader的shuffle有没有关掉
- [x] 网络的dropout有没有关掉
- [x] 确定lr_scheduler和optimizer是否相同
- [x] 为了双重保险, 可以把传参里面的dropout_prob全部设置为0, 同时把model的mode设置为.eval(), 这样在训练的时候可以保证模型的dropout和bn等op全部都是固定的, 不包含随机性
no_aux_loss AdamW 单卡
@CPFLAME @xiezipeng-ML loss对齐这个程度算ok吗?求个经验~ 加载的权重是收敛后的模型,所以可能看不大出下降趋势了
看loss曲线基本没问题 也可以加载初始化权重来训练 看看下降
Eager global 模型并行
参数对齐:https://github.com/facebookresearch/detr
问题排查TODO LIST: