深度学习模型转换与部署那些事(含ONNX格式详细分析)

bindog commented 4 years ago

https://bindog.github.io/blog/2020/03/13/deep-learning-model-convert-and-depoly/

背景背景深度学习模型在训练完成之后，部署并应用在生产环境的这一步至关重要，毕竟训练出来的模型不能只接受一些公开数据集和榜单的检验，还需要在真正的业务场景下创造价值，不能只是为了PR而躺在实验机器上在现有条件下，一般涉及到模型的部署就要涉及到模型的转换，而转换的过程也是随着对应平台的不同而不同，一般工程师接触到...

waittim commented 4 years ago

您好，我在尝试的时候发现onnx_model.graph.input只包含了整个模型的输入，也就是图片的信息。找不到中间的权重。请问中间的权重之类的信息是在哪里找呢？谢谢！

bindog commented 4 years ago

权重在model.graph.initializer里面

bindog commented 4 years ago

这种是常量（实际上也就是节点node），你既然知道如何通过输入node.input找对应的节点，可以用同样的方式去输出node.output中找1146

waittim commented 4 years ago

其实我用这种方法能找到相应的节点。但是在check_model的时候还是提示没有对应节点。不知道是不是不止需要拓扑关系，还需要在储存时按照使用顺序排序？

我是这样找相应节点的：

name = '731'

print('#'*50,'\nFound in initializer: ')
for initializer in onnx_model.graph.initializer:
    if initializer.name == name:
        print(onnx.numpy_helper.to_array(initializer))

print('#'*50,'\nFound in output: ')
for node in onnx_model.graph.node:
    if name in node.output:
        print(node)

print('#'*50,'\nFound in input: ')
for node in onnx_model.graph.node:
    if name in node.input:
        print(node)
        for att in node.attribute:
            print(att.type)

结果是：

################################################## 
Found in initializer: 
################################################## 
Found in output: 
output: "731"
name: "Constant_228"
op_type: "Constant"
attribute {
  name: "value"
  t {
    data_type: 6
    raw_data: "\000\000\000\000"
  }
  type: TENSOR
}

################################################## 
Found in input: 
input: "730"
input: "731"
output: "732"
name: "Gather_229"
op_type: "Gather"
attribute {
  name: "axis"
  i: 0
  type: INT
}

2

可以看出其中是有以731作为输出的常量node。可是当我使用onnx.checker.check_model(onnx_model)时，还是会报错，反馈没有731的输出：

---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
<ipython-input-275-259ab7a176f6> in <module>
----> 1 onnx.checker.check_model(onnx_model)

~/anaconda3/envs/mask/lib/python3.6/site-packages/onnx/checker.py in check_model(model, full_check)
    100         if sys.getsizeof(protobuf_string) > MAXIMUM_PROTOBUF:
    101             raise ValueError('This protobuf of onnx model is too large (>2GB). Call check_model with model path instead.')
--> 102         C.check_model(protobuf_string)
    103         m = model
    104     if full_check:

ValidationError: Nodes in a graph must be topologically sorted, however input '731' of node: 
input: "730" input: "731" output: "732" name: "Gather_229" op_type: "Gather" attribute { name: "axis" i: 0 type: INT }
 is not output of any previous nodes.

bindog commented 4 years ago

其实我用这种方法能找到相应的节点。但是在check_model的时候还是提示没有对应节点。不知道是不是不止需要拓扑关系，还需要在储存时按照使用顺序排序？

我是这样找相应节点的：

name = '731'

print('#'*50,'\nFound in initializer: ')
for initializer in onnx_model.graph.initializer:
    if initializer.name == name:
        print(onnx.numpy_helper.to_array(initializer))

print('#'*50,'\nFound in output: ')
for node in onnx_model.graph.node:
    if name in node.output:
        print(node)

print('#'*50,'\nFound in input: ')
for node in onnx_model.graph.node:
    if name in node.input:
        print(node)
        for att in node.attribute:
            print(att.type)

结果是：

################################################## 
Found in initializer: 
################################################## 
Found in output: 
output: "731"
name: "Constant_228"
op_type: "Constant"
attribute {
  name: "value"
  t {
    data_type: 6
    raw_data: "\000\000\000\000"
  }
  type: TENSOR
}

################################################## 
Found in input: 
input: "730"
input: "731"
output: "732"
name: "Gather_229"
op_type: "Gather"
attribute {
  name: "axis"
  i: 0
  type: INT
}

2

可以看出其中是有以731作为输出的常量node。可是当我使用onnx.checker.check_model(onnx_model)时，还是会报错，反馈没有731的输出：

---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
<ipython-input-275-259ab7a176f6> in <module>
----> 1 onnx.checker.check_model(onnx_model)

~/anaconda3/envs/mask/lib/python3.6/site-packages/onnx/checker.py in check_model(model, full_check)
    100         if sys.getsizeof(protobuf_string) > MAXIMUM_PROTOBUF:
    101             raise ValueError('This protobuf of onnx model is too large (>2GB). Call check_model with model path instead.')
--> 102         C.check_model(protobuf_string)
    103         m = model
    104     if full_check:

ValidationError: Nodes in a graph must be topologically sorted, however input '731' of node: 
input: "730" input: "731" output: "732" name: "Gather_229" op_type: "Gather" attribute { name: "axis" i: 0 type: INT }
 is not output of any previous nodes.

你可以看下我的这次commit，https://github.com/bindog/onnx-surgery/commit/e5e137173f5e9f7dbf8d8dc75fe5ae44a12dead0

修复的就是check_model的问题，的确是有顺序上的要求的，在model.graph.node中，作为input的节点出现的顺序必须要在当前节点的顺序之前

waittim commented 4 years ago

了解了，谢谢！我在把append改成insert后也通过了check_model()，可是在使用onnxruntime测试的时候，出现了报错InvalidGraph: [ONNXRuntimeError] : 10 : INVALID_GRAPH : Load model from weights/yolo-fastest-transfer.onnx failed:This is an invalid model. Type Error: Type 'tensor(int32)' of input parameter (1540) of operator (ConstantOfShape) in node (ConstantOfShape_245) is invalid. 回过头检查的时候发现:

################################################## 
Found in initializer: 
dims: 1
data_type: 6
name: "1540"
raw_data: "\005\000\000\000"

[5]
################################################## 
Found in output: 
################################################## 
Found in input: 
input: "1540"
output: "757"
name: "ConstantOfShape_245"
op_type: "ConstantOfShape"
attribute {
  name: "value"
  t {
    dims: 1
    data_type: 6
    raw_data: "\001\000\000\000"
  }
  type: TENSOR
}

看起来和修改前的INT64版本只有data_type和raw_data的差异，在阅读ConstantOfShape的文档后猜测是因为该类型的node不能使用tenser(int32)作为输入，只能使用tensor(int64)。（将INT64转为INT32是因为需要在JS环境中部署）

请问您知道有哪种类型的node是可以使用tenser(int32)作为输入并能达到相同功能的么？我现在也在读它的原始文档。不过刚开始接触，进度比较慢……

Elonaever commented 3 years ago

您好，我觉得您写的非常好不过我发现onnxruntime好像也有TensorRT和ARM平台等的支持，请问使用这种方法和直接使用TensorRT等方法会有什么区别吗

young169 commented 3 years ago

你好，我想问一下，如果是TensorRT不支持的层，比如Instance Norm这种，怎么办呢？是只能等TensorRT支持还是可以通过自己实现解决呢？

leesendy commented 2 years ago

您好，我注意到您的文章中提到：【注意，由于tensorflow的模型输入一般会比较灵活，输入的batch_size可以留空，可以在运行时传入不同大小的batch_size数据。但是一般在ONNX和TensorRT这些框架中，我们习惯于指定一个固定的batch_size，那如何修改呢，可以参考上一篇文章中我写的那个小工具，有一个例子展示如何修改ONNX模型的batch_size】，如何获取文中所描述的修改ONNX模型batch_size的小工具呢

bindog / gitalk-comment

深度学习模型转换与部署那些事(含ONNX格式详细分析) #1