Segmentation fault occur when compile custom ONNX model

min0628 commented 2 years ago

Hi, TI.

With a lot of help, I compiled the ONNX OD model, and it worked successfully on SK-TDA4VM. I tried to compile our custom ONNX OD model using tutorial_detection.ipynb.

1) I modify runtime_options and pipeline_configs to use our model.

runtime_options.update({'object_detection:meta_layers_names_list':f'/models/model.prototxt'})

pipeline_configs = {
    'od-ssd_mobilenet_v2_model_onnx': dict(
        task_type='detection',
        calibration_dataset=calib_dataset,
        input_dataset=val_dataset,
        preprocess=preproc_transforms.get_transform_onnx((352,640), (352,640), backend='cv2', mean=(123.675, 116.28, 103.53)),
        session=session_type(
            work_dir=work_dir, target_device=settings.target_device, runtime_options=runtime_options,
            model_path='/models/model.onnx'),
        postprocess=postproc_transforms.get_transform_detection_mmdet_onnx(squeeze_axis=None, normalized_detections=False, formatter=postprocess.DetectionBoxSL2BoxLS()),
        metric=dict(label_offset_pred=datasets.coco_det_label_offset_80to90(label_offset=1)),
        model_info=dict(metric_reference={'accuracy_ap[.5:.95]%':27.2})
    )
}

2) And try to compile, Segmentation fault occur.

tidl_tools_path                                 = /workspace/tidl_tools
artifacts_folder                                = /tmp/tmp3cv8e_h9/modelartifacts/8bits/od-ssd_mobilenet_v2_model_onnx/artifacts
tidl_tensor_bits                                = 8
debug_level                                     = 3
num_tidl_subgraphs                              = 16
tidl_denylist                                   = 
tidl_calibration_accuracy_level                 = 7
tidl_calibration_options:num_frames_calibration = 10
tidl_calibration_options:bias_calibration_iterations = 10
power_of_2_quantization                         = 2
enable_high_resolution_optimization             = 0
pre_batchnorm_fold                              = 1
add_data_convert_ops                          = 3
output_feature_16bit_names_list                 =
m_params_16bit_names_list                       =
reserved_compile_constraints_flag               = 1601
ti_internal_reserved_1                          =
Parsing ONNX Model
model_proto 0x7fff145faba0

 ****** WARNING : Network not identified as Object Detection network - Ignore if network is not OD *****

Supported TIDL layer type ---            Conv -- Conv_0
Supported TIDL layer type ---            Clip -- Clip_1
Supported TIDL layer type ---            Conv -- Conv_2
Supported TIDL layer type ---            Clip -- Clip_3
Supported TIDL layer type ---            Conv -- Conv_4
Supported TIDL layer type ---            Conv -- Conv_5
Supported TIDL layer type ---            Clip -- Clip_6
Supported TIDL layer type ---            Conv -- Conv_7
Supported TIDL layer type ---            Clip -- Clip_8
Supported TIDL layer type ---            Conv -- Conv_9
Supported TIDL layer type ---            Conv -- Conv_10
Supported TIDL layer type ---            Clip -- Clip_11
Supported TIDL layer type ---            Conv -- Conv_12
Supported TIDL layer type ---            Clip -- Clip_13
Supported TIDL layer type ---            Conv -- Conv_14

Segmentation fault: 11

Is it necessary object_detection:meta_arch_type in runtime_options? I didn't find documents about meta_arch_type , include EdegAI TIDL Tools.

mathmanu commented 2 years ago

Hi,

object_detection:meta_arch_type is necessary for OD models

Please go through the examples in: https://github.com/TexasInstruments/edgeai-benchmark/blob/master/configs/detection.py and see if you are able to correct your configuration.

min0628 commented 2 years ago

Hi,

I checked detection.py and found SSD based model use meta_arch_type = 3. So i tried compile with meta_arch_type = 3. Error occur on ONNX runtime. I understand that TIDL supports Concat. Why does this problem occur?

Traceback (most recent call last):
  File "/workspace/jai_benchmark/pipelines/pipeline_runner.py", line 147, in _run_pipeline
    accuracy_result = accuracy_pipeline(description)
  File "/workspace/jai_benchmark/pipelines/accuracy_pipeline.py", line 104, in __call__
    param_result = self._run(description=description)
  File "/workspace/jai_benchmark/pipelines/accuracy_pipeline.py", line 130, in _run
    self._import_model(description)
  File "/workspace/jai_benchmark/pipelines/accuracy_pipeline.py", line 182, in _import_model
    self._run_with_log(session.import_model, calib_data)
  File "/workspace/jai_benchmark/pipelines/accuracy_pipeline.py", line 282, in _run_with_log
    return func(*args, **kwargs)
  File "/workspace/jai_benchmark/sessions/onnxrt_session.py", line 69, in import_model
    outputs = self.interpreter.run(output_keys, calib_dict)
  File "/root/anaconda3/envs/benchmark/lib/python3.6/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 188, in run
    return self._sess.run(output_names, input_feed, run_options)
onnxruntime.capi.onnxruntime_pybind11_state.RuntimeException: [ONNXRuntimeError] : 6 : RUNTIME_EXCEPTION : Non-zero status code returned while running Concat node. Name:'Concat_194' Status Message: /root/onnxruntime/onnxruntime/core/providers/cpu/tensor/concat.cc:72 onnxruntime::common::Status onnxruntime::ConcatBase::PrepareForCompute(onnxruntime::OpKernelContext*, const std::vector<const onnxruntime::Tensor*>&, onnxruntime::Prepare&) const inputs_n_rank == inputs_0_rank was false. Ranks of input data are different, cannot concatenate them. expected rank: 3 got: 4

Log:

TIDL Meta PipeLine (Proto) File  : /tmp/tmp9y2etbac/modelartifacts/8bits/od-ssd_mobilenet_v2_onnxrt_models/model/model.prototxt
ssd

Number of OD backbone nodes = 0
Size of odBackboneNodeIds = 0

Preliminary subgraphs created = 3
Final number of subgraphs created are : 3, - Offloaded Nodes - 23, Total Nodes - 184

 ************** Frame index 1 : Running float import *************
WARNING: [TIDL_E_DATAFLOW_INFO_NULL] ti_cnnperfsim.out fails to allocate memory in MSMC. Please look into perfsim log. This model can only be used on PC emulation, it will get fault on target.
****************************************************
**          1 WARNINGS          0 ERRORS          **
****************************************************
 0.0s:  VX_ZONE_INIT:Enabled
 0.26s:  VX_ZONE_ERROR:Enabled
 0.42s:  VX_ZONE_WARNING:Enabled
 0.1774s:  VX_ZONE_INIT:[tivxInit:178] Initialization Done !!!

**********  Frame Index 1 : Running float inference **********

 ************** Frame index 1 : Running float import *************
WARNING: [TIDL_E_DATAFLOW_INFO_NULL] ti_cnnperfsim.out fails to allocate memory in MSMC. Please look into perfsim log. This model can only be used on PC emulation, it will get fault on target.
****************************************************
**          1 WARNINGS          0 ERRORS          **
****************************************************
 0.488703s:  VX_ZONE_ERROR:[tivxAlgiVisionCreate:344] Calling ialg.algAlloc failed with status = -1110
 0.488730s:  VX_ZONE_ERROR:[tivxKernelTIDLCreate:656] tivxAlgiVisionCreate returned NULL
 0.489104s:  VX_ZONE_ERROR:[ownContextSendCmd:817] Command ack message returned failure cmd_status: -1
 0.489145s:  VX_ZONE_ERROR:[ownContextSendCmd:851] tivxEventWait() failed.
 0.489164s:  VX_ZONE_ERROR:[ownNodeKernelInit:538] Target kernel, TIVX_CMD_NODE_CREATE failed for node TIDLNode
 0.489177s:  VX_ZONE_ERROR:[ownNodeKernelInit:539] Please be sure the target callbacks have been registered for this core
 0.489216s:  VX_ZONE_ERROR:[ownNodeKernelInit:540] If the target callbacks have been registered, please ensure no errors are occurring within the create callback of this kernel
 0.489279s:  VX_ZONE_ERROR:[ownGraphNodeKernelInit:583] kernel init for node 0, kernel com.ti.tidl ... failed !!!
 0.489306s:  VX_ZONE_ERROR:[vxVerifyGraph:2055] Node kernel init failed
 0.489338s:  VX_ZONE_ERROR:[vxVerifyGraph:2109] Graph verify failed
TIDL_RT_OVX: ERROR: Verifying TIDL graph ... Failed !!!
TIDL_RT_OVX: ERROR: Verify OpenVX graph failed
 0.490296s:  VX_ZONE_ERROR:[tivxAlgiVisionCreate:344] Calling ialg.algAlloc failed with status = -1110
 0.490319s:  VX_ZONE_ERROR:[tivxKernelTIDLCreate:656] tivxAlgiVisionCreate returned NULL
 0.490390s:  VX_ZONE_ERROR:[ownContextSendCmd:817] Command ack message returned failure cmd_status: -1
 0.490398s:  VX_ZONE_ERROR:[ownContextSendCmd:851] tivxEventWait() failed.
 0.490414s:  VX_ZONE_ERROR:[ownNodeKernelInit:538] Target kernel, TIVX_CMD_NODE_CREATE failed for node TIDLNode
 0.490417s:  VX_ZONE_ERROR:[ownNodeKernelInit:539] Please be sure the target callbacks have been registered for this core
 0.490419s:  VX_ZONE_ERROR:[ownNodeKernelInit:540] If the target callbacks have been registered, please ensure no errors are occurring within the create callback of this kernel
 0.490437s:  VX_ZONE_ERROR:[ownGraphNodeKernelInit:583] kernel init for node 0, kernel com.ti.tidl ... failed !!!
 0.490458s:  VX_ZONE_ERROR:[vxVerifyGraph:2055] Node kernel init failed
 0.490477s:  VX_ZONE_ERROR:[vxVerifyGraph:2109] Graph verify failed
 0.490505s:  VX_ZONE_ERROR:[ownGraphScheduleGraphWrapper:820] graph is not in a state required to be scheduled
 0.490536s:  VX_ZONE_ERROR:[vxProcessGraph:755] schedule graph failed
 0.490539s:  VX_ZONE_ERROR:[vxProcessGraph:760] wait graph failed
ERROR: Running TIDL graph ... Failed !!!

**********  Frame Index 1 : Running float inference **********

 ************** Frame index 1 : Running float import *************
WARNING: [TIDL_E_DATAFLOW_INFO_NULL] ti_cnnperfsim.out fails to allocate memory in MSMC. Please look into perfsim log. This model can only be used on PC emulation, it will get fault on target.
****************************************************
**          1 WARNINGS          0 ERRORS          **
****************************************************
 0.865370s:  VX_ZONE_ERROR:[tivxAlgiVisionCreate:344] Calling ialg.algAlloc failed with status = -1110
 0.865422s:  VX_ZONE_ERROR:[tivxKernelTIDLCreate:656] tivxAlgiVisionCreate returned NULL
 0.865478s:  VX_ZONE_ERROR:[ownContextSendCmd:817] Command ack message returned failure cmd_status: -1
 0.865523s:  VX_ZONE_ERROR:[ownContextSendCmd:851] tivxEventWait() failed.
 0.865550s:  VX_ZONE_ERROR:[ownNodeKernelInit:538] Target kernel, TIVX_CMD_NODE_CREATE failed for node TIDLNode
 0.865552s:  VX_ZONE_ERROR:[ownNodeKernelInit:539] Please be sure the target callbacks have been registered for this core
 0.865555s:  VX_ZONE_ERROR:[ownNodeKernelInit:540] If the target callbacks have been registered, please ensure no errors are occurring within the create callback of this kernel
 0.865558s:  VX_ZONE_ERROR:[ownGraphNodeKernelInit:583] kernel init for node 0, kernel com.ti.tidl ... failed !!!
 0.865562s:  VX_ZONE_ERROR:[vxVerifyGraph:2055] Node kernel init failed
 0.865564s:  VX_ZONE_ERROR:[vxVerifyGraph:2109] Graph verify failed
TIDL_RT_OVX: ERROR: Verifying TIDL graph ... Failed !!!
TIDL_RT_OVX: ERROR: Verify OpenVX graph failed
 0.866505s:  VX_ZONE_ERROR:[tivxAlgiVisionCreate:344] Calling ialg.algAlloc failed with status = -1110
 0.866532s:  VX_ZONE_ERROR:[tivxKernelTIDLCreate:656] tivxAlgiVisionCreate returned NULL
 0.866597s:  VX_ZONE_ERROR:[ownContextSendCmd:817] Command ack message returned failure cmd_status: -1
 0.866637s:  VX_ZONE_ERROR:[ownContextSendCmd:851] tivxEventWait() failed.
 0.866641s:  VX_ZONE_ERROR:[ownNodeKernelInit:538] Target kernel, TIVX_CMD_NODE_CREATE failed for node TIDLNode
 0.866643s:  VX_ZONE_ERROR:[ownNodeKernelInit:539] Please be sure the target callbacks have been registered for this core
 0.866645s:  VX_ZONE_ERROR:[ownNodeKernelInit:540] If the target callbacks have been registered, please ensure no errors are occurring within the create callback of this kernel
 0.866649s:  VX_ZONE_ERROR:[ownGraphNodeKernelInit:583] kernel init for node 0, kernel com.ti.tidl ... failed !!!
 0.866658s:  VX_ZONE_ERROR:[vxVerifyGraph:2055] Node kernel init failed
 0.866661s:  VX_ZONE_ERROR:[vxVerifyGraph:2109] Graph verify failed
 0.866713s:  VX_ZONE_ERROR:[ownGraphScheduleGraphWrapper:820] graph is not in a state required to be scheduled
 0.866719s:  VX_ZONE_ERROR:[vxProcessGraph:755] schedule graph failed
 0.866724s:  VX_ZONE_ERROR:[vxProcessGraph:760] wait graph failed
ERROR: Running TIDL graph ... Failed !!!

**********  Frame Index 1 : Running float inference **********
2022-03-30 07:01:20.059405900 [E:onnxruntime:, sequential_executor.cc:339 Execute] Non-zero status code returned while running Concat node. Name:'Concat_194' Status Message: /root/onnxruntime/onnxruntime/core/providers/cpu/tensor/concat.cc:72 onnxruntime::common::Status onnxruntime::ConcatBase::PrepareForCompute(onnxruntime::OpKernelContext*, const std::vector<const onnxruntime::Tensor*>&, onnxruntime::Prepare&) const inputs_n_rank == inputs_0_rank was false. Ranks of input data are different, cannot concatenate them. expected rank: 3 got: 4

This image show Concat_194 node.

mathmanu commented 2 years ago

It seems your model has several layers that the underlying TIDL does not support and ONNXRuntime has created subgraphs to offload whetever can be supported in TIDL. This is not an issue - but only indicates that your model may not be fully optimal.

Preliminary subgraphs created = 3 Final number of subgraphs created are : 3, - Offloaded Nodes - 23, Total Nodes - 184

Please compare your prototxt file to the other prototxt fiels used in the SSD examples in detection.py

If you still have issue, if you can share your onnx file and prototxt file we can take a look.

min0628 commented 2 years ago

Hi, I checked SSD examples in detection.py. but still have issue. I attached my model files(ONNX, prototxt). https://drive.google.com/file/d/13L96RRJ5oOw59J9-B0zRspEENsaSxM12/view?usp=sharing

anandp09 commented 2 years ago

@min0628 Can you confirm if you have run shape inference on this model? onnx.shape_inference.infer_shapes_path(input_model_path, output_model_path)

min0628 commented 2 years ago

Hi, Sorry for late update. Unfortunately I don't use that model anymore.

I got a different model from my colleague. and it works on SK-TDA4VM. There are a few issues, but I don't mention it because the topic is different. Thank you for your help.

TexasInstruments / edgeai-benchmark

Segmentation fault occur when compile custom ONNX model #3