Half model error - Githubissues

NingNanXin commented 1 year ago

感谢大佬们开源的工作。在使用TPAT产生插件ScatterElements的时候，产生如下报错 onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Node (Concat_52) Op (Concat) [ShapeInferenceError] Can't merge shape info. Both source and target dimension have values but they differ. Source=1 Target=3 Dimension=0 我的运行命令为： python onnx_to_plugin.py -i CodeFormer.onnx -o plan.onnx -n ScatterElements_1022 -dynamic=true -min=1 -max=6 -opt=3

报错位置发生在python/cuda_kernels.py compute_tensor()，重新加载half_model.onnx的时候，报错日志如下

 File "/data//TPAT/python/onnx_to_plugin.py", line 287, in <module>
    onnx2plugin(
  File "/data//TPAT/python/onnx_to_plugin.py", line 190, in onnx2plugin
    onnx_name_mapping_trt_plugin = generate_plugin_library(
  File "/data//TPAT/python/onnx_to_plugin.py", line 85, in generate_plugin_library
    cuda_kernel.run()
  File "/data//TPAT/python/cuda_kernel.py", line 54, in run
    graph_def = self.extract_target_onnx_node(self._onnx_model)
  File "/data//TPAT/python/cuda_kernel.py", line 211, in extract_target_onnx_node
    computed_tensor_shapes = self.compute_tensor_shape(
  File "/data//TPAT/python/cuda_kernel.py", line 163, in compute_tensor_shape
    session = ort.InferenceSession(half_model_path, providers=EP_list)
  File "/home/ningnx/anaconda3/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 360, in __init__
    self._create_inference_session(providers, provider_options, disabled_optimizers)
  File "/home/ningnx/anaconda3/lib/python3.9/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 408, in _create_inference_session
    sess.initialize_session(providers, provider_options, disabled_optimizers)
onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Node (Concat_52) Op (Concat) [ShapeInferenceError] Can't merge shape info. Both source and target dimension have values but they differ. Source=1 Target=2 Dimension=0

附上onnx的地址原始onnx half model

希望大佬帮忙可以解答一下这个问题，感恩！

buptqq commented 1 year ago

TPAT在生成Plugin的过程里，会多次使用shapeinference 推算出出对应算子的输入输出形状（shape-inference对无法预测的场景会使用onnx-runtime真实的跑一遍）. 但CodeFormer.onnx这个文件似乎没有办法用Onnx-runtime跑起来，你可以先确保这个文件能用shape inference和onnx runtime跑起来吗？ shape infernece : http://www.xavierdupre.fr/app/onnxcustom/helpsphinx/api/onnx_python/shape_inference.html Onnx-runtiem : https://onnxruntime.ai/docs/

buptqq commented 1 year ago

TPAT在生成Plugin的过程里，会多次使用shapeinference 推算出出对应算子的输入输出形状（shape-inference对无法预测的场景会使用onnx-runtime真实的跑一遍）. 但CodeFormer.onnx这个文件似乎没有办法用Onnx-runtime跑起来，你可以先确保这个文件能用shape inference和onnx runtime跑起来吗？ shape infernece : http://www.xavierdupre.fr/app/onnxcustom/helpsphinx/api/onnx_python/shape_inference.html Onnx-runtiem : https://onnxruntime.ai/docs/

另外：对于比较大的onnx，我们比较建议可以手写一个onnx，包括你需要生成plugin的op，Shape保持一致，生成了对应的plugin之后，将这个比较大的onnx type改为plugin的Class name。这样onnx-parser也可以识别你这个plugin

NingNanXin commented 1 year ago

感谢回复，原版的onnx使用onnxruntime是可以正常使用，明天我会尝试shape inference测试，感谢大佬的回复，明天上午我会放上测试结果以及代码

QianQiu @.***> 于2023年3月1日周三 17:19写道：

TPAT在生成Plugin的过程里，会多次使用shapeinference 推算出出对应算子的输入输出形状（shape-inference对无法预测的场景会使用onnx-runtime真实的跑一遍）. 但CodeFormer.onnx这个文件似乎没有办法用Onnx-runtime跑起来，你可以先确保这个文件能用shape inference和onnx runtime跑起来吗？ shape infernece : http://www.xavierdupre.fr/app/onnxcustom/helpsphinx/api/onnx_python/shape_inference.html Onnx-runtiem : https://onnxruntime.ai/docs/

另外：对于比较大的onnx，我们比较建议可以手写一个onnx，包括你需要生成plugin的op，Shape保持一致，生成了对应的plugin之后，将这个比较大的onnx type改为plugin的Class name。这样onnx-parser也可以识别你这个plugin

— Reply to this email directly, view it on GitHub https://github.com/Tencent/TPAT/issues/28#issuecomment-1449662571, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANOJGHQBZXMGE4FZOYNAYCDWZ4ICLANCNFSM6AAAAAAVLXCGZM . You are receiving this because you authored the thread.Message ID: @.***>

NingNanXin commented 1 year ago

TPAT在生成Plugin的过程里，会多次使用shapeinference 推算出出对应算子的输入输出形状（shape-inference对无法预测的场景会使用onnx-runtime真实的跑一遍）. 但CodeFormer.onnx这个文件似乎没有办法用Onnx-runtime跑起来，你可以先确保这个文件能用shape inference和onnx runtime跑起来吗？ shape infernece : http://www.xavierdupre.fr/app/onnxcustom/helpsphinx/api/onnx_python/shape_inference.html Onnx-runtiem : https://onnxruntime.ai/docs/

这是我的onnx测试代码，使用shape_Infer和onnxruntime均没有问题。这是一个超分辨率的模型

def onnx_infer():
    # shape_infer
    onnx_model = onnx.load("CodeFormer.onnx")
    infer_shape = shape_inference.infer_shapes(onnx_model)
    onnx.checker.check_model(onnx_model)

    sess = onnxruntime.InferenceSession("CodeFormer.onnx")
    image = cv2.imread("00_00.png")
    image = cv2.resize(image, (512, 512), interpolation=cv2.INTER_LINEAR)
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    image = image / 255
    mean = np.array([0.5, 0.5, 0.5])
    std = np.array([0.5, 0.5, 0.5])
    image = (image - mean) / std
    image = np.transpose(image, [2, 0, 1])
    image = np.expand_dims(image, 0).astype(np.float32)

    output = sess.run([], {"input": image})[0]

    output = np.squeeze(output, 0)
    output = np.clip(output, -1, 1)
    output = ((output + 1) / 2) * 255
    output = np.transpose(output, [1, 2, 0])
    output = cv2.cvtColor(output, cv2.COLOR_RGB2BGR)
    cv2.imwrite("test.png", output)

测试图片 00_00

buptqq commented 1 year ago

ScatterElement

我这里没有对应的一些环境，可以请你用shape_inference和onnxruntime跑一下half_model.onnx吗？ half_model.onnx是CodeFormer.onnx 里从input截取到ScatterElements这个op的子图

NingNanXin commented 1 year ago

加载了half_model后报一样的错误 onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Node (Concat_52) Op (Concat) [ShapeInferenceError] Can't merge shape info. Both source and target dimension have values but they differ. Source=1 Target=2 Dimension=0 暂时我怀疑是动态batch的问题，但是我还没有定位到该问题是发生在模型的哪个环节。我将代码内的dynamic_batch强制为false，并将onnx的导出设置为batch=1，成功生成了插件。

buptqq commented 1 year ago

实际上TPAT的dynamic Batch的方案使用Padding的方式实现的。核心思路是对dynamic batch的onnx模型填充进batch维，生成了各自对应的plugin之后，用一个统一的plugin给拼起来。对onnx model里的batch赋真实值的Code：python/onnx_to_plugin.py : add_explicit_bs 函数。所以对于比较大的模型，例如整个图里bs所在的维度可能会发生改变，当用Shape-inference和Onnx-Runtime运行从input截取到目标Node的子图的时候，就可能会failed. 但是更简单的方式其实是创建一个与CodeFormer.onnx有相同输入输出Shape的ScatterElements onnx算子，生成了plugin之后用于CodeFormer.onnx。

加载了half_model后报一样的错误 onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Node (Concat_52) Op (Concat) [ShapeInferenceError] Can't merge shape info. Both source and target dimension have values but they differ. Source=1 Target=2 Dimension=0 暂时我怀疑是动态batch的问题，但是我还没有定位到该问题是发生在模型的哪个环节。我将代码内的dynamic_batch强制为false，并将onnx的导出设置为batch=1，成功生成了插件。

NingNanXin commented 1 year ago

实际上TPAT的dynamic Batch的方案使用Padding的方式实现的。核心思路是对dynamic batch的onnx模型填充进batch维，生成了各自对应的plugin之后，用一个统一的plugin给拼起来。对onnx model里的batch赋真实值的Code：python/onnx_to_plugin.py : add_explicit_bs 函数。所以对于比较大的模型，例如整个图里bs所在的维度可能会发生改变，当用Shape-inference和Onnx-Runtime运行从input截取到目标Node的子图的时候，就可能会failed. 但是更简单的方式其实是创建一个与CodeFormer.onnx有相同输入输出Shape的ScatterElements onnx算子，生成了plugin之后用于CodeFormer.onnx。

加载了half_model后报一样的错误 onnxruntime.capi.onnxruntime_pybind11_state.Fail: [ONNXRuntimeError] : 1 : FAIL : Node (Concat_52) Op (Concat) [ShapeInferenceError] Can't merge shape info. Both source and target dimension have values but they differ. Source=1 Target=2 Dimension=0 暂时我怀疑是动态batch的问题，但是我还没有定位到该问题是发生在模型的哪个环节。我将代码内的dynamic_batch强制为false，并将onnx的导出设置为batch=1，成功生成了插件。

了解🫡

Tencent / TPAT

Half model error #28