Yolov7-tiny to TensorflowLite conversion results in a dynamic output model incompatible with TfLite Java API

ingura commented 1 year ago

Issue Type

Others

onnx2tf version number

1.5.36

onnx version number

1.12.0

tensorflow version number

2.10.1

Download URL for ONNX

pip install onnx==1.12.0

Parameter Replacement JSON

none

Description

Hi, your library is awesome!

I converted the Yolov7-tiny from PyTorch to TfLite using: onnx2tf -i yolov7-tiny.onnx -o models-NHWC-final/ -osd -oh5 -cotof

I am trying to use it on an android device. The model works when tested on a PC however the Tensorflow Java API for Android does not support dynamic output models according to their documentation: https://www.tensorflow.org/lite/guide/inference. While the resulting yolo tflite model has a dynamic number of outputs( the number of outputs change with the number of indications/detentions)

On the other hand, if I follow the conversion path PyTorch -> ONNX-> Tensorflow I do get yolov7 with a fixed output size so I suspect it is possible to achieve this with onnx2tf as well while also doing the NcHW to NHWc conversion in the process.

Is there a way to have onnx2tf output a fixed/static output .tflite model for yolov7-tiny?

Thank you

PINTO0309 commented 1 year ago

I have already implemented almost all the features that engineers around the world need. Please read the README seriously first.

https://github.com/PINTO0309/onnx2tf#cli-parameter

-ois OVERWRITE_INPUT_SHAPE [OVERWRITE_INPUT_SHAPE ...], \
    --overwrite_input_shape OVERWRITE_INPUT_SHAPE [OVERWRITE_INPUT_SHAPE ...]
  Overwrite the input shape.
  The format is
  "i1:dim0,...,dimN" "i2:dim0,...,dimN" "i3:dim0,...,dimN"
  When there is only one input, for example,
  "data:1,3,224,224"
  When there are multiple inputs, for example,
  "data1:1,3,224,224" "data2:1,3,112" "data3:5"
  A value of 1 or more must be specified.
  Numerical values other than dynamic dimensions are ignored.
  Ignores --batch_size if specified at the same time as --batch_size.

If the model includes an NMS, then padding must be added after the NMS to keep the output size fixed. You have not shared the relevant ONNX files with me, so I am unable to determine exactly what the problem is.

ingura commented 1 year ago

Thanks for your prompt reply.

I went trough all the paragraphs included in the README file that mentions the "output" of the network and I could not find anything related to fixing the output size or choosing a static output size. As I understand the "-ois" parameter mentioned above overrides the input shape but not the output.

To generate the ONNX model I use the following line: python export.py --weights yolov7-tiny.pt --grid --end2end --simplify --topk-all 100 --iou-thres 0.65 --conf-thres 0.35 --img-size 640 640 --max-wh 640 where "export.py" comes from the standard yolov7 distribution: https://github.com/WongKinYiu/yolov7/blob/main/export.py

export.py generates my yolov7-tiny.onnx model. The following line takes the resulting yolov7-tiny.onnx and generates a fixed output tensorflow model which runs well on a PC (but the input is NcHW): onnx-tf convert -i yolov7-tiny.onnx -o model_tf/

So the yolov7-tiny.onnx model I use with onnx2tf should be functional. The parameters you exposed for onnx2tf are indeed many however I could not find anything related to fixing the output size of the model.

Have i missed anything important?

PINTO0309 commented 1 year ago

I understood. I will add a parameter to convert to fixed size NMS in the next version of the tool. Are you satisfied with just being able to specify the number of boxes to output? e.g. --fixed_nms_output_boxes=100 onnx: yolov7-tiny.onnx.zip

However, since the mission of this tool is to faithfully convert the input ONNX files, I feel that if the structure of the model is to be rewritten significantly, it would be cleaner to originally rewrite the export logic side of YOLOv7. This is because there are more complex models that come with a lot of post-processing behind the NMS. YOLOvN post-processing is too simple.

ingura commented 1 year ago

The tensorflow API for Android basically requires a fixed size output tensor of shape [num_boxes, 7] so fixing the number of output boxes should solve the problem.

Specifically the output tensor will always have the shape of [100, 7] regardless of how many objects are in the input image.

It's true that adding a static output to the .tflite version of yolov7 will not represent yolov7 structure faithfully however it is the only structure that would work on an Android platform and therefore by not fixing the output, the .tflite conversion is not very useful exactly on the platform for which it was made.

PINTO0309 commented 1 year ago

Yes, I am. I have been utilizing TensorFlow quite a bit for a long time, so I understand the specifications you gave me.

This is because there are more complex models that come with a lot of post-processing behind the NMS. YOLOvN post-processing is too simple.

This is a very big problem with the proposed specifications.

Faster R-CNN

ingura commented 1 year ago

Well if all these models have a dynamic output they can not be used as a .tflite model on an Android platform. So the conversion to .tflite would not be very useful without a static output.

Knowing that there are models for which it is very difficult to convert their output to be static, what if this conversion option is offered for the many models that do not have a lot of processing behind the NMS block but not necessarily for the other models?

Basically we have a parameter that allows us to fix the output size for the majority of simpler models (in regards to their NMS aspect) so they can be used as .tflite models on Androids. And to let the users know what to expect when they use this parameter a warning is displayed that lets them know that not all models can reasonably have their output fixed. This way you have most utility embedded in your library while avoiding the problems of misuse.

PINTO0309 commented 1 year ago

OK. then I will look into how to add the feature as a special option.

ingura commented 1 year ago

Thanks

PINTO0309 commented 1 year ago

Idea notes for implementation, taking a break from my day job.

Delete all post-processing added on the YOLOvN side. This is because the post-processing given by the official paper implementation is quite redundant.
- e.g. YOLOv7
Allow the output OP name of the model body to be specified.
Delete all OPs after the output OP name specified by the user.
Allow users to specify the maximum output bounding box, NMS threshold, number of classes, and score threshold.
Generate a clean post-processing of my proprietary implementation.
Allow users to choose which of the repertoire of post-processes to use for YOLO, SSD, and other object detection.
For mobile devices, or low-spec edge devices, having the number of output bounding boxes fixed at 100 is very detrimental to performance. Therefore, the post-processing is rewritten according to the maximum number of detection bounding boxes specified by the user.
If the final output is less than 100, zero-pad or -1 pad all portions of (100 - number of detections).
I am not sure if the 7 in [num_boxes, 7] really needs to be 7. If possible, I do not want to output useless tensors.
The shape of the input tensor to post-processing is limited to the shape of [batch_size, prior_box_numbers, (background_class + classes)].
selective --post_process_type = [None | 'yolo' | 'ssd' | 'xxx' | 'yyy']

PINTO0309 commented 1 year ago

The contents of the standard post-processing of TFLite that I analyzed from the binary files two years ago. https://github.com/PINTO0309/tflite2tensorflow/blob/c13504df2f82dc234f1009e34dbab9c8b65c7ce4/tflite2tensorflow/tflite2tensorflow.py#L5278-L5509

  elif custom_op_type == 'TFLite_Detection_PostProcess':
      TFLite_Detection_PostProcess_flg = True
      ################################################################### Extraction of boxes, scores, anchors, options
      boxes = None
      try:
          boxes_detail = interpreter._get_tensor_details(op['inputs'][0])
      except:
          pass
      try:
          boxes = tensors[op['inputs'][0]]
      except:
          boxes = interpreter.get_tensor(boxes_detail['index'])
      scores = None
      try:
          scores_detail = interpreter._get_tensor_details(op['inputs'][1])
      except:
          pass
      try:
          scores = tensors[op['inputs'][1]]
      except:
          scores = interpreter.get_tensor(scores_detail['index'])
      anchors = None
      try:
          anchors_detail = interpreter._get_tensor_details(op['inputs'][2])
      except:
          pass
      try:
          anchors = tensors[op['inputs'][2]]
      except:
          anchors = interpreter.get_tensor(anchors_detail['index'])
      anchors = backward_quantization(anchors_detail, anchors)
      """
      custom_options = [
          120,95,115,99,97,108,101,0, #x_scale
          100,101,116,101,99,116,105,111,110,115, 95,112,101,114,95,99,108,97,115,115,0, #detections_per_class
          110,117,109,95,99,108,97,115,115,101,115,0, #num_classes
          121,95,115,99,97,108,101,0, #y_scale
          110,109,115,95,115,99,111,114,101,95,116,104,114,101,115,104,111,108,100,0, #nms_score_threshold
          119,95,115,99,97,108,101,0, #w_scale
          109,97,120,95,100,101,116,101,99,116,105,111,110,115,0, #max_detections
          104,95,115,99,97,108,101,0, #h_scale
          117,115,101,95,114,101,103,117,108,97,114,95,110,109,115,0, #use_regular_nms
          109,97,120,95,99,108,97,115,115,101,115,95,112,101,114,95,100,101,116,101,99,116,105,111,110,0, #max_classes_per_detection
          110,109,115,95,105,111,117,95,116,104,114,101,115,104,111,108,100,0, #nms_iou_threshold
          11,153,70,47,87,23,117,138,68,100,170,130,11,0,0,0,1,0,0,0,11,0,0,0, #24
          0,0,0,0, #detections_per_class 0
          0,0,160,64, #h_scale 5.0
          1,0,0,0, #max_classes_per_detection 1
          100,0,0,0, #max_detections 100
          154,153,25,63, #nms_iou_threshold 0.6000000238418579
          119,204,43,50, #nms_score_threshold 0.00000000999999993922529
          90,0,0,0, #num_classes 90
          0,0,0,0, #use_regular_nms false 0:false 1:true
          0,0,160,64, #w_scale 5.0
          0,0,32,65, #x_scale 10.0
          0,0,32,65, #y_scale 10.0
          6,14,6,6,14,14,6,106,14,14,14,55,38,1]

          print(struct.pack('<i', 0), '@', 0) #detections_per_class b'\x00\x00\x00\x00' @ 0,0,0,0
          print(struct.pack('<f', 5.0), '@', 5.0) #h_scale b'\x00\x00\xa0\x40' @ 5 -> 0,0,160,64
          print(struct.pack('<i', 1), '@', 1) #max_classes_per_detection b'\x01\x00\x00\x00' @ 1,0,0,0
          print(struct.pack('<i', 100), '@', 100) #max_detections b'\x64\x00\x00\x00' @ 100,0,0,0
          print(struct.pack('<f', 0.6000000238418579), '@', 0.6000000238418579) #nms_iou_threshold -> b'\x9a\x99\x19\x3F' @ 0.6000000238418579 -> 154,153,25,63
          print(struct.pack('<f', 0.00000000999999993922529), '@', 0.00000000999999993922529) #nms_score_threshold -> b'\x77\xcc\x2b\x32' @  -> 119,204,43,50
          print(struct.pack('<i', 90), '@', 90) #num_classes b'\x5a\x00\x00\x00' @ 90,0,0,0
          print(struct.pack('<?', False), '@', False) #use_regular_nms b'\x00\x00\x00\x00' @ 0,0,0,0
          print(struct.pack('<f', 5.0), '@', 5.0) #w_scale b'\x00\x00\xa0\x40' @ 5 -> 0,0,160,64
          print(struct.pack('<f', 10.0), '@', 10.0) #x_scale b'\x00\x00\x20\x41' @ 10.0 -> 0,0,32,65
          print(struct.pack('<f', 10.0), '@', 10.0) #y_scale b'\x00\x00\x20\x41' @ 10.0 -> 0,0,32,65

          print(struct.unpack_from('<i', bytes([0,0,0,0]))[0]) #detections_per_class
          print(struct.unpack_from('<f', bytes([0,0,160,64]))[0]) #h_scale
          print(struct.unpack_from('<i', bytes([1,0,0,0]))[0]) #max_classes_per_detection
          print(struct.unpack_from('<i', bytes([100,0,0,0]))[0]) #max_detections
          print(struct.unpack_from('<f', bytes([154,153,25,63]))[0]) #nms_iou_threshold
          print(struct.unpack_from('<f', bytes([119,204,43,50]))[0]) #nms_score_threshold
          print(struct.unpack_from('<i', bytes([90,0,0,0]))[0]) #num_classes
          print(struct.unpack_from('<?', bytes([1,0,0,0]))[0]) #use_regular_nms
          print(struct.unpack_from('<f', bytes([0,0,160,64]))[0]) #w_scale
          print(struct.unpack_from('<f', bytes([0,0,32,65]))[0]) #x_scale
          print(struct.unpack_from('<f', bytes([0,0,32,65]))[0]) #y_scale
      """
      options = op['custom_options']
      custom_options = read_flexbuffer(np.array(options, dtype=np.uint8).tobytes())
      print('custom_options:')
      pprint.pprint(custom_options)

      h_scale = custom_options['h_scale']
      w_scale = custom_options['w_scale']
      y_scale = custom_options['y_scale']
      x_scale = custom_options['x_scale']
      nms_score_threshold = 0.0 if (custom_options['nms_score_threshold'] == -float('inf') or custom_options['nms_score_threshold'] == float('inf')) else custom_options['nms_score_threshold']
      nms_iou_threshold = 0.0 if (custom_options['nms_iou_threshold'] == -float('inf') or custom_options['nms_iou_threshold'] == float('inf')) else custom_options['nms_iou_threshold']
      num_classes = custom_options['num_classes']
      max_classes_per_detection = custom_options['max_classes_per_detection']
      max_detections = custom_options['max_detections']
      use_regular_nms = custom_options['use_regular_nms']
      detections_per_class = 0
      try:
          detections_per_class = custom_options['detections_per_class']
      except:
          pass

      output_detail1 = interpreter._get_tensor_details(op['outputs'][0])
      output_detail2 = interpreter._get_tensor_details(op['outputs'][1])
      output_detail3 = interpreter._get_tensor_details(op['outputs'][2])
      output_detail4 = interpreter._get_tensor_details(op['outputs'][3])

      ################################################################### Calculation of anchors
      anchors_yx = anchors[:, 0:2]
      anchors_hw = anchors[:, 2:4]

      ################################################################### Calculation of boxes
      boxes_div = tf.divide(boxes, [y_scale, x_scale, h_scale, w_scale])
      # ycenter, xcenter
      ycenter_xcenter = boxes_div[:, :, 0:2]
      ycenter_xcenter_calc = ycenter_xcenter * anchors_hw + anchors_yx
      # half_h, half_w
      half_h_half_w = boxes_div[:, :, 2:4]
      half_h_half_w_calc = tf.math.exp(half_h_half_w) * anchors_hw * 0.5
      # scale conversion
      """
      box.ymin = ycenter - half_h;
      box.xmin = xcenter - half_w;
      box.ymax = ycenter + half_h;
      box.xmax = xcenter + half_w;
      """
      box_ymin_box_xmin = ycenter_xcenter_calc - half_h_half_w_calc
      box_ymax_box_xmax = ycenter_xcenter_calc + half_h_half_w_calc
      boxes_concat = tf.concat([box_ymin_box_xmin, box_ymax_box_xmax], axis=-1)[0]

      ################################################################### Calculation of scores
      if scores.shape[2] > 1:
          scores_slice = scores[:,:,1:num_classes+1][0]
      else:
          scores_slice = scores[0]
      scores_argmax = tf.math.argmax(scores_slice, axis=-1, output_type=tf.int32)
      scores_reduce_max = tf.math.reduce_max(scores_slice, axis=-1)

      ################################################################### Calculation of NMS
      def NonMaxSuppressionV5_(boxes, scores, max_output_size: int, iou_threshold, score_threshold, soft_nms_sigma, pad_to_max_output_size):
          selected_indices, selected_scores, valid_outputs = \
              tf.raw_ops.NonMaxSuppressionV5(
                  boxes=boxes,
                  scores=scores,
                  max_output_size=max_output_size,
                  iou_threshold=iou_threshold,
                  score_threshold=score_threshold,
                  soft_nms_sigma=soft_nms_sigma,
                  pad_to_max_output_size=pad_to_max_output_size
              )
          return selected_indices, selected_scores, valid_outputs

      def NonMaxSuppressionV3_(boxes, scores, max_output_size: int, iou_threshold, score_threshold):
          selected_indices = \
              tf.raw_ops.NonMaxSuppressionV3(
                  boxes=boxes,
                  scores=scores,
                  max_output_size=max_output_size,
                  iou_threshold=iou_threshold,
                  score_threshold=score_threshold
              )
          return selected_indices

      selected_indices = None
      selected_scores = None
      valid_outputs = None
      if not optimizing_for_openvino_and_myriad:
          selected_indices, selected_scores, valid_outputs = \
              tf.keras.layers.Lambda(
                  NonMaxSuppressionV5_,
                  arguments={'scores': scores_reduce_max,
                              'max_output_size': max_detections,
                              'iou_threshold': nms_iou_threshold,
                              'score_threshold': nms_score_threshold,
                              'soft_nms_sigma': 0.0,
                              'pad_to_max_output_size': True}
              )(boxes_concat)

      else:
          selected_indices = \
              tf.keras.layers.Lambda(
                  NonMaxSuppressionV3_,
                  arguments={'scores': scores_reduce_max,
                              'max_output_size': max_detections,
                              'iou_threshold': nms_iou_threshold,
                              'score_threshold': nms_score_threshold}
              )(boxes_concat)
          selected_scores = tf.gather(
              scores_reduce_max,
              selected_indices
          )
          valid_outputs = max_detections
      ################################################################### Calculation of outputs
      bounding_boxes = tf.identity(
          tf.expand_dims(
              tf.gather(
                  boxes_concat,
                  selected_indices
              ),
              axis=0),
          name='TFLite_Detection_PostProcess0'
      )
      class_labels = tf.identity(
          tf.expand_dims(
              tf.gather(
                  scores_argmax,
                  selected_indices
              ),
              axis=0
          ),
          name='TFLite_Detection_PostProcess1'
      )
      class_confidences = tf.identity(
          tf.expand_dims(
              selected_scores,
              axis=0
          ),
          name='TFLite_Detection_PostProcess2'
      )
      num_of_boxes = tf.identity(
          tf.expand_dims(
              valid_outputs,
              axis=0
          ),
          name='TFLite_Detection_PostProcess3'
      )

      tensors[output_detail1['index']] = bounding_boxes
      tensors[output_detail2['index']] = class_labels
      tensors[output_detail3['index']] = class_confidences
      tensors[output_detail4['index']] = num_of_boxes

PINTO0309 commented 1 year ago

@ingura What exactly is The tensorflow API for Android that you presented, and which document specifically is the best to refer to for understanding? Actually, I would like to know the breakdown of the 7 in [num_boxes, 7] that you are expecting, rather than the API spec. https://developer.android.com/ndk/guides/neuralnetworks

[X1, Y1, X2, Y2, ClassID, ClassScore, ???] <- valid_box or invalid_box or batch_number?

I have lots of knowledge of analyzing FlatBuffers and modifying models, but no development experience using the Android API.

ingura commented 1 year ago

Your implementation details are awesome.

About the API I am using the Java version of tensorflow-lite-gpu. Here is my TensorFlow dependency list: implementation "org.tensorflow:tensorflow-lite:${tflite_version}" implementation "org.tensorflow:tensorflow-lite-gpu:${tflite_version}" implementation "org.tensorflow:tensorflow-lite-gpu-api:${tflite_version}" Here is a good guide to tfLite for Android: https://www.tensorflow.org/lite/guide/inference

In the yolov7 case our output tensor has the shape [num_boxes, 7]. The result for each box is an array of size 7 that contains:

[batch_number, boxLeftLimitX , boxTopLimitY , boxRightLimitX , boxBotomLimitY, ClassID, ClassScore]

The model outputs the top "num_boxes" results ordered in terms of the "ClassScore".

Thank you for your effort!

PINTO0309 commented 1 year ago

Thank you. I understood.

I will get around to implementation, but major concerns remain. I have a recent history of supporting implementations of models that can use Android GPUs, but it did not work. For more information, please see the issue below.

I cannot run a transformer model with token-level output on accelerated hardware in TF Lite #59232

Currently, GPU Delegate only supports the following very limited basic OPs. Also, Gather cannot be used and there are significant restrictions on the use of strided_slice. Therefore, the post-processing to be implemented based on this idea will be a model with fallback to the CPU or GPU Delegate may generate an error and Abort.

https://www.tensorflow.org/lite/performance/gpu

https://www.tensorflow.org/lite/android/delegates/gpu

ADD
AVERAGE_POOL_2D
CONCATENATION
CONV_2D
DEPTHWISE_CONV_2D v1-2
EXP
FULLY_CONNECTED
LOGISTIC
LSTM v2 (Basic LSTM only)
MAX_POOL_2D
MAXIMUM
MINIMUM
MUL
PAD
PRELU
RELU
RELU6
RESHAPE
RESIZE_BILINEAR v1-3
SOFTMAX
STRIDED_SLICE
SUB
TRANSPOSE_CONV

Just my guess before doing the implementation, but I am thinking that eventually you may need to re-implement the post-processing part with a custom GPU Delegate or custom operation.

ingura commented 1 year ago

If the GPU support is not there let's have it running on the CPU and see if it is comparable with yolov4 in terms of computational performance. And yolov4 did pretty well.

The paper mentions that there is a significant computational reduction happening in yolov7 compared to v4

ingura commented 1 year ago

Not always efficient but tfLite can split the inference execution between the GPU and CPU: https://www.tensorflow.org/lite/performance/gpu

If some of the ops are not supported by the GPU delegate, the framework will only run a part of the graph on the GPU and the remaining part on the CPU.

PINTO0309 commented 1 year ago

I see. I'll try to implement it before worrying about this and that.

I don't have an environment for testing on Android, so I would be happy to have you help me test the tool once I have finished my experimental customization of the tool.

ingura commented 1 year ago

I look forward to it

PINTO0309 commented 1 year ago

1. Input pattern

[batch, boxes, [x1, y1, x2, y2, boxscore, classscores]] <- DAMO-YOLO
[batch, boxes, [y1, x1, y2, x2, boxscore, classscores]]
[batch, boxes, [x, y, w, h, boxscore, classscores]] <- YOLOv7, FreeYOLO
[batch, boxes, [y, x, h, w, boxscore, classscores]]
[batch, [x, y, w, h, boxscore, classscores], boxes] <- YOLOv8

2. Padding pattern

Pattern.1

merged = tf.concat(
  values=[
      [N, 6],
      [N, 1],
  ],
  axis=0,
)
result = tf.pad(
  tensor=merged,
  paddings=[
      [0, (100-N)],
      [0, 0]
  ],
  mode='CONSTANT',
  constant_values=0,
)

Pattern.2

Somehow I don't think GPU Delegate is compatible with NMS.

ingura commented 1 year ago

Ill give it a try with GPU support and without to see the performance difference

PINTO0309 commented 1 year ago

I am trying to apply partial optimization first and then improve the whole thing at once. Several more pull requests will be issued, but please wait a little longer.

ingura commented 1 year ago

Thanks

PINTO0309 commented 1 year ago

ONNX exported with --end2end. Just fixed the output of YOLOv7 to [100, 7]. I do not expect it to work properly. If this experimental implementation does not work as expected, the aforementioned implementation idea would likely be invalid. Experimental: [100, 7] yolov7-tiny_float16.tflite.zip yolov7-tiny_float32.tflite.zip

ingura commented 1 year ago

Tensorflow throws: RuntimeError: tensorflow/lite/kernels/scatter_nd.cc:65 updates.DimensionsCount() - outer_dims != shape_shape.Dims(0) - ix (2 != -14)Node number 237 (SCATTER_ND) failed to prepare.

To reproduce run this python script:

import cv2
import random
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

def scaleAndFill(im, new_shape=(640, 640), color=(114, 114, 114), auto=True, scaleup=True, stride=32):
    # Resize and pad image while meeting stride-multiple constraints
    shape = im.shape[:2]  # current shape [height, width]
    if isinstance(new_shape, int):
        new_shape = (new_shape, new_shape)

    # Scale ratio (new / old)
    scale = min(new_shape[0] / shape[0], new_shape[1] / shape[1])
    if not scaleup:  # only scale down, do not scale up (for better val mAP)
        scale = min(scale, 1.0)

    # Compute padding
    new_unpad = int(round(shape[1] * scale)), int(round(shape[0] * scale))
    fillW, fillH = new_shape[1] - new_unpad[0], new_shape[0] - new_unpad[1]  #  padding

    if auto:  # minimum rectangle
        fillW, fillH = np.mod(fillW, stride), np.mod(fillH, stride)  #  padding

    fillW /= 2  # divide padding into 2 sides
    fillH /= 2

    if shape[::-1] != new_unpad:  # resize
        im = cv2.resize(im, new_unpad, interpolation=cv2.INTER_LINEAR)
    top, bottom = int(round(fillH - 0.1)), int(round(fillH + 0.1))
    left, right = int(round(fillW - 0.1)), int(round(fillW + 0.1))
    im = cv2.copyMakeBorder(im, top, bottom, left, right, cv2.BORDER_CONSTANT, value=color)  # fill border
    return im, scale, (fillW, fillH)

#Name of the classes according to class indices.
names = ['person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light', 
         'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow', 
         'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee', 
         'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 
         'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple', 
         'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch', 
         'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone', 
         'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear', 
         'hair drier', 'toothbrush']

#Creating random colors for bounding box visualization.
colors = {name:[random.randint(0, 255) for _ in range(3)] for i,name in enumerate(names)}

#Load and preprocess the image.
img = cv2.imread("D:\\..\\image1.jpg")
print("oring ImgShape:")
print(img.shape)

img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

image = img.copy()
image, ratio, dwdh = scaleAndFill(image,(640,640), auto=False)

image = np.expand_dims(image, 0)
print("expanded ImageShape:")
print(image.shape)
image = np.ascontiguousarray(image)

im = image.astype(np.float32)
im /= 255

# Load the TFLite model and allocate tensors.
# interpreter = tf.lite.Interpreter(model_path="./Models\\yolov7-tiny-NHWc_fp32.tflite")    
interpreter = tf.lite.Interpreter(model_path="./Models\\yolov7-tiny_org_post_float32.tflite")

#Allocate tensors.
interpreter.allocate_tensors()
# Get input and output tensors.
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

# Test the model on random input data.
input_shape = input_details[0]['shape']
print("required input shape:")
print(input_shape)

output_shape = output_details[0]['shape']
print("output shape:")
print(output_shape)

interpreter.set_tensor(input_details[0]['index'], im)
interpreter.invoke()

# The function `get_tensor()` returns a copy of the tensor data.
# Use `tensor()` in order to get a pointer to the tensor.
output_data = interpreter.get_tensor(output_details[0]['index'])

## Visualize results

ori_images = [img.copy()]
testOutputSize=0
for i,(batch_id,x0,y0,x1,y1,cls_id,score) in enumerate(output_data):
    testOutputSize= testOutputSize+1
    print('batch_id: {}  clsID: {}  score: {}'.format(batch_id,cls_id,score ))
    image = ori_images[int(batch_id)]
    box = np.array([x0,y0,x1,y1])
    box -= np.array(dwdh*2)
    box /= ratio
    box = box.round().astype(np.int32).tolist()
    cls_id = int(cls_id)
    score = round(float(score),3)
    name = names[cls_id]
    color = colors[name]
    name += ' '+str(score)
    cv2.rectangle(image,box[:2],box[2:],color,2)
    cv2.putText(image,name,(box[0], box[1] - 2),cv2.FONT_HERSHEY_SIMPLEX,0.75,[225, 255, 255],thickness=2)  
plt.imshow(ori_images[0])
plt.title('TfLite Indications',  fontweight ="bold")
plt.show()

print("output size:")
print(testOutputSize)
`

The result is

`

Traceback (most recent call last):
  File "D:\..\testOnnx2tf.py", line 89, in <module>
    interpreter.invoke()
  File "C:\..\site-packages\tensorflow\lite\python\interpreter.py", line 917, in invoke
    self._interpreter.Invoke()
RuntimeError: tensorflow/lite/kernels/scatter_nd.cc:65 updates.DimensionsCount() - outer_dims != shape_shape.Dims(0) - ix (2 != -14)Node number 237 (SCATTER_ND) failed to prepare.

` I wonder if we can preserve the output tensor of onnx2tf v1.5.36 and copy it into a fixed size output tensor which will become final the output of the model.

Specifically would it be practical to add another operation at the end of the model that copies the dynamic output size tensor of onnx2tf v1.5.36 of shape initial_output_tensor[num_boxes, 7 ] into a fixed size tensor that is tflite_friendly_output [100,7] in shape? In that case the result would be tflite_friendly_output [100,7] tensor with its content being

tflite_friendly_output [ : num_boxes , 7 ] = initial_output_tensor [num_boxes,7 ] while the
tflite_friendly_output [ num_boxes : , 7 ] would be filled with zeros.

PINTO0309 commented 1 year ago

tflite_friendly_output [ : num_boxes , 7 ] = initial_output_tensor [num_boxes,7 ] while the
tflite_friendly_output [ num_boxes : , 7 ] would be filled with zeros.

In fact, that is the last model I posted that implements it. I see that it is an error... :thinking:

When the output of NonMaxSuppressionV4 is variable, the runtime seems to output an error when padding because of the variable number of outputs, which can be zero or 100 depending on the input image. It seems necessary to give up using NonMaxSuppressionV4.

        # YOLOv7 Special fixed outputs
        max_output_boxes_per_class = 100
        final_output = outputs[0]
        output_paddings = tf.zeros(shape=[max_output_boxes_per_class, 7], dtype=tf.float32)

        indices = tf.range(0, tf_shape(input_tensor=final_output)[0])
        outputs = [
            tf.tensor_scatter_nd_update(
                tensor=output_paddings,
                indices=indices,
                updates=final_output,
            )
        ]

For example, I feel one solution would be to borrow mmdetection's NMS and incorporate it into Pytorch's ONNX export logic side in advance. Then, modify a little borrowed nms logic and padding at the end. After 2 days of various attempts, it seems that it would take a very long time to implement a major rewrite of the model structure on the part of this conversion tool.

https://github.com/open-mmlab/mmdetection/blob/master/mmdet/core/post_processing/bbox_nms.py

Hyunseok-Kim0 commented 1 year ago

https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/delegates/gpu/gl/kernels

Tensorflow GPU delegates supports only limited operations. I guess it is almost impossible to implement NMS using those. It looks even tf.range is not supported, which makes almost impossible to implement neccesary top-k or sort action for NMS.

Screenshot from 2023-02-02 17-26-34

There are two options for now.

If final goal is only replacing NonMaxSuppression to other operations, it is possible by using tf.image.non_max_suppression_padded. The implementation using several sub-operations unlike non_max_suppression_v4 as shown above.

If you want only static output shape, non_max_suppression_v4 has pad_to_max_output_size option. For now, NonMaxSuppression.py passing False and using slice to remove excess indices. After making option for static output for NMS, it is possible to make user to determine maximum box number.

PINTO0309 commented 1 year ago

Thanks @Hyunseok-Kim0. I'll give it a try. pad_to_max_output_size I didn't know there was such an option.

(function)
non_max_suppression_v4(
  boxes: Any,
  scores: Any,
  max_output_size: Any,
  iou_threshold: Any,
  score_threshold: Any,
  pad_to_max_output_size: bool = False,
  name: Any | None = None
) -> NonMaxSuppressionV4
Greedily selects a subset of bounding boxes in descending order of score,

pruning away boxes that have high intersection-over-union (IOU) overlap with previously selected boxes. Bounding boxes with score less than score_threshold are removed. Bounding boxes are supplied as [y1, x1, y2, x2], where (y1, x1) and (y2, x2) are the coordinates of any diagonal pair of box corners and the coordinates can be provided as normalized (i.e., lying in the interval [0, 1]) or absolute. Note that this algorithm is agnostic to where the origin is in the coordinate system and more generally is invariant to orthogonal transformations and translations of the coordinate system; thus translating or reflections of the coordinate system result in the same boxes being selected by the algorithm. The output of this operation is a set of integers indexing into the input collection of bounding boxes representing the selected boxes. The bounding box coordinates corresponding to the selected indices can then be obtained using the tf.gather operation. For example:
  selected_indices = tf.image.non_max_suppression_v2(
      boxes, scores, max_output_size, iou_threshold, score_threshold)
  selected_boxes = tf.gather(boxes, selected_indices)

Args:
  boxes: A Tensor. Must be one of the following types: half, float32.
    A 2-D float tensor of shape [num_boxes, 4].
  scores: A Tensor. Must have the same type as boxes.
    A 1-D float tensor of shape [num_boxes] representing a single score corresponding to each box (each row of boxes).
  max_output_size: A Tensor of type int32.
    A scalar integer tensor representing the maximum number of boxes to be selected by non max suppression.
  iou_threshold: A Tensor. Must be one of the following types: half, float32.
    A 0-D float tensor representing the threshold for deciding whether boxes overlap too much with respect to IOU.
  score_threshold: A Tensor. Must have the same type as iou_threshold.
    A 0-D float tensor representing the threshold for deciding when to remove boxes based on score.
  pad_to_max_output_size: An optional bool. Defaults to False.
    If true, the output selected_indices is padded to be of length max_output_size. Defaults to false.
  name: A name for the operation (optional).

Returns:
  A tuple of Tensor objects (selected_indices, valid_outputs).

  selected_indices: A Tensor of type int32.
  valid_outputs: A Tensor of type int32.

PINTO0309 commented 1 year ago

Excellent.

pad_to_max_output_size = False
pad_to_max_output_size = True

Hyunseok-Kim0 commented 1 year ago

You have to disable slice when using that option, still the dynamic output is generated in the image because of num_valid

if pad_to_max_output_size:
    return selected_indices

else:
    return selected_indices[:num_valid]

PINTO0309 commented 1 year ago

OK. Thanks.

ingura commented 1 year ago

When the output of NonMaxSuppressionV4 is variable, the runtime seems to output an error when padding because of the variable number of outputs, which can be zero or 100 depending on the input image. It seems necessary to give up using NonMaxSuppressionV4.

It seems like the conversion to a static output of a dynamic output network by padding its dynamic output is quite a general approach and it is worth looking into the cause of this error.

If the error happens at the padding step could it be because it cant handle padding with a meaningless 0 size array? Or maybe the issue happens at the other extreme paddingWith(100+)?

PINTO0309 commented 1 year ago

Maybe my use of tf.tensor_scatter_nd_update is just wrong.

In any case, your most recent objective of generating a fixed size output is achieved by merging my pull request above.

We have completed the first step in order to address the issues in a detailed step-by-step process. I will look into the issue of padding errors due to self-preprocessing later.

ingura commented 1 year ago

Thanks!

PINTO0309 commented 1 year ago

I made a mistake in releasing the package, so the PyPI version and the Docker version mismatched, but I just released the latest version 1.5.39. https://github.com/PINTO0309/onnx2tf/releases/tag/1.5.39

1.5.38 is synonymous with 1.5.39 for Docker.

yolov7-tiny_org_post_float16_fixed.tflite.zip yolov7-tiny_org_post_float32_fixed.tflite.zip

In order to support GPU Delegate, it is necessary to digest a rather challenging task. There are quite a few other issues besides the fixed size output of the NMS. As Hyunseok-Kim0 also pointed out, this is because there are quite a few OPs that support GPUs.

Realistically, I believe that only the post-processing part should be implemented on the Android JAVA side.

ingura commented 1 year ago

You're amazing, you fixed it! Maybe in the future the GPU Delegate will be better supported as well.

Thank you very much!

PINTO0309 / onnx2tf