Closed ingura closed 1 year ago
I have already implemented almost all the features that engineers around the world need. Please read the README seriously first.
https://github.com/PINTO0309/onnx2tf#cli-parameter
-ois OVERWRITE_INPUT_SHAPE [OVERWRITE_INPUT_SHAPE ...], \
--overwrite_input_shape OVERWRITE_INPUT_SHAPE [OVERWRITE_INPUT_SHAPE ...]
Overwrite the input shape.
The format is
"i1:dim0,...,dimN" "i2:dim0,...,dimN" "i3:dim0,...,dimN"
When there is only one input, for example,
"data:1,3,224,224"
When there are multiple inputs, for example,
"data1:1,3,224,224" "data2:1,3,112" "data3:5"
A value of 1 or more must be specified.
Numerical values other than dynamic dimensions are ignored.
Ignores --batch_size if specified at the same time as --batch_size.
If the model includes an NMS, then padding must be added after the NMS to keep the output size fixed. You have not shared the relevant ONNX files with me, so I am unable to determine exactly what the problem is.
Thanks for your prompt reply.
I went trough all the paragraphs included in the README file that mentions the "output" of the network and I could not find anything related to fixing the output size or choosing a static output size. As I understand the "-ois" parameter mentioned above overrides the input shape but not the output.
To generate the ONNX model I use the following line:
python export.py --weights yolov7-tiny.pt --grid --end2end --simplify --topk-all 100 --iou-thres 0.65 --conf-thres 0.35 --img-size 640 640 --max-wh 640
where "export.py" comes from the standard yolov7 distribution: https://github.com/WongKinYiu/yolov7/blob/main/export.py
export.py generates my yolov7-tiny.onnx model. The following line takes the resulting yolov7-tiny.onnx and generates a fixed output tensorflow model which runs well on a PC (but the input is NcHW):
onnx-tf convert -i yolov7-tiny.onnx -o model_tf/
So the yolov7-tiny.onnx model I use with onnx2tf should be functional. The parameters you exposed for onnx2tf are indeed many however I could not find anything related to fixing the output size of the model.
Have i missed anything important?
I understood. I will add a parameter to convert to fixed size NMS in the next version of the tool. Are you satisfied with just being able to specify the number of boxes to output? e.g. --fixed_nms_output_boxes=100
onnx: yolov7-tiny.onnx.zip
However, since the mission of this tool is to faithfully convert the input ONNX files, I feel that if the structure of the model is to be rewritten significantly, it would be cleaner to originally rewrite the export logic side of YOLOv7. This is because there are more complex models that come with a lot of post-processing behind the NMS. YOLOvN post-processing is too simple.
The tensorflow API for Android basically requires a fixed size output tensor of shape [num_boxes, 7] so fixing the number of output boxes should solve the problem.
Specifically the output tensor will always have the shape of [100, 7] regardless of how many objects are in the input image.
It's true that adding a static output to the .tflite version of yolov7 will not represent yolov7 structure faithfully however it is the only structure that would work on an Android platform and therefore by not fixing the output, the .tflite conversion is not very useful exactly on the platform for which it was made.
Yes, I am. I have been utilizing TensorFlow quite a bit for a long time, so I understand the specifications you gave me.
This is because there are more complex models that come with a lot of post-processing behind the NMS. YOLOvN post-processing is too simple.
This is a very big problem with the proposed specifications.
Well if all these models have a dynamic output they can not be used as a .tflite model on an Android platform. So the conversion to .tflite would not be very useful without a static output.
Knowing that there are models for which it is very difficult to convert their output to be static, what if this conversion option is offered for the many models that do not have a lot of processing behind the NMS block but not necessarily for the other models?
Basically we have a parameter that allows us to fix the output size for the majority of simpler models (in regards to their NMS aspect) so they can be used as .tflite models on Androids. And to let the users know what to expect when they use this parameter a warning is displayed that lets them know that not all models can reasonably have their output fixed. This way you have most utility embedded in your library while avoiding the problems of misuse.
OK. then I will look into how to add the feature as a special option.
Thanks
Idea notes for implementation, taking a break from my day job.
-1
pad all portions of (100 - number of detections)
.7
in [num_boxes, 7]
really needs to be 7
. If possible, I do not want to output useless tensors.[batch_size, prior_box_numbers, (background_class + classes)]
.--post_process_type = [None | 'yolo' | 'ssd' | 'xxx' | 'yyy']
The contents of the standard post-processing of TFLite that I analyzed from the binary files two years ago. https://github.com/PINTO0309/tflite2tensorflow/blob/c13504df2f82dc234f1009e34dbab9c8b65c7ce4/tflite2tensorflow/tflite2tensorflow.py#L5278-L5509
elif custom_op_type == 'TFLite_Detection_PostProcess':
TFLite_Detection_PostProcess_flg = True
################################################################### Extraction of boxes, scores, anchors, options
boxes = None
try:
boxes_detail = interpreter._get_tensor_details(op['inputs'][0])
except:
pass
try:
boxes = tensors[op['inputs'][0]]
except:
boxes = interpreter.get_tensor(boxes_detail['index'])
scores = None
try:
scores_detail = interpreter._get_tensor_details(op['inputs'][1])
except:
pass
try:
scores = tensors[op['inputs'][1]]
except:
scores = interpreter.get_tensor(scores_detail['index'])
anchors = None
try:
anchors_detail = interpreter._get_tensor_details(op['inputs'][2])
except:
pass
try:
anchors = tensors[op['inputs'][2]]
except:
anchors = interpreter.get_tensor(anchors_detail['index'])
anchors = backward_quantization(anchors_detail, anchors)
"""
custom_options = [
120,95,115,99,97,108,101,0, #x_scale
100,101,116,101,99,116,105,111,110,115, 95,112,101,114,95,99,108,97,115,115,0, #detections_per_class
110,117,109,95,99,108,97,115,115,101,115,0, #num_classes
121,95,115,99,97,108,101,0, #y_scale
110,109,115,95,115,99,111,114,101,95,116,104,114,101,115,104,111,108,100,0, #nms_score_threshold
119,95,115,99,97,108,101,0, #w_scale
109,97,120,95,100,101,116,101,99,116,105,111,110,115,0, #max_detections
104,95,115,99,97,108,101,0, #h_scale
117,115,101,95,114,101,103,117,108,97,114,95,110,109,115,0, #use_regular_nms
109,97,120,95,99,108,97,115,115,101,115,95,112,101,114,95,100,101,116,101,99,116,105,111,110,0, #max_classes_per_detection
110,109,115,95,105,111,117,95,116,104,114,101,115,104,111,108,100,0, #nms_iou_threshold
11,153,70,47,87,23,117,138,68,100,170,130,11,0,0,0,1,0,0,0,11,0,0,0, #24
0,0,0,0, #detections_per_class 0
0,0,160,64, #h_scale 5.0
1,0,0,0, #max_classes_per_detection 1
100,0,0,0, #max_detections 100
154,153,25,63, #nms_iou_threshold 0.6000000238418579
119,204,43,50, #nms_score_threshold 0.00000000999999993922529
90,0,0,0, #num_classes 90
0,0,0,0, #use_regular_nms false 0:false 1:true
0,0,160,64, #w_scale 5.0
0,0,32,65, #x_scale 10.0
0,0,32,65, #y_scale 10.0
6,14,6,6,14,14,6,106,14,14,14,55,38,1]
print(struct.pack('<i', 0), '@', 0) #detections_per_class b'\x00\x00\x00\x00' @ 0,0,0,0
print(struct.pack('<f', 5.0), '@', 5.0) #h_scale b'\x00\x00\xa0\x40' @ 5 -> 0,0,160,64
print(struct.pack('<i', 1), '@', 1) #max_classes_per_detection b'\x01\x00\x00\x00' @ 1,0,0,0
print(struct.pack('<i', 100), '@', 100) #max_detections b'\x64\x00\x00\x00' @ 100,0,0,0
print(struct.pack('<f', 0.6000000238418579), '@', 0.6000000238418579) #nms_iou_threshold -> b'\x9a\x99\x19\x3F' @ 0.6000000238418579 -> 154,153,25,63
print(struct.pack('<f', 0.00000000999999993922529), '@', 0.00000000999999993922529) #nms_score_threshold -> b'\x77\xcc\x2b\x32' @ -> 119,204,43,50
print(struct.pack('<i', 90), '@', 90) #num_classes b'\x5a\x00\x00\x00' @ 90,0,0,0
print(struct.pack('<?', False), '@', False) #use_regular_nms b'\x00\x00\x00\x00' @ 0,0,0,0
print(struct.pack('<f', 5.0), '@', 5.0) #w_scale b'\x00\x00\xa0\x40' @ 5 -> 0,0,160,64
print(struct.pack('<f', 10.0), '@', 10.0) #x_scale b'\x00\x00\x20\x41' @ 10.0 -> 0,0,32,65
print(struct.pack('<f', 10.0), '@', 10.0) #y_scale b'\x00\x00\x20\x41' @ 10.0 -> 0,0,32,65
print(struct.unpack_from('<i', bytes([0,0,0,0]))[0]) #detections_per_class
print(struct.unpack_from('<f', bytes([0,0,160,64]))[0]) #h_scale
print(struct.unpack_from('<i', bytes([1,0,0,0]))[0]) #max_classes_per_detection
print(struct.unpack_from('<i', bytes([100,0,0,0]))[0]) #max_detections
print(struct.unpack_from('<f', bytes([154,153,25,63]))[0]) #nms_iou_threshold
print(struct.unpack_from('<f', bytes([119,204,43,50]))[0]) #nms_score_threshold
print(struct.unpack_from('<i', bytes([90,0,0,0]))[0]) #num_classes
print(struct.unpack_from('<?', bytes([1,0,0,0]))[0]) #use_regular_nms
print(struct.unpack_from('<f', bytes([0,0,160,64]))[0]) #w_scale
print(struct.unpack_from('<f', bytes([0,0,32,65]))[0]) #x_scale
print(struct.unpack_from('<f', bytes([0,0,32,65]))[0]) #y_scale
"""
options = op['custom_options']
custom_options = read_flexbuffer(np.array(options, dtype=np.uint8).tobytes())
print('custom_options:')
pprint.pprint(custom_options)
h_scale = custom_options['h_scale']
w_scale = custom_options['w_scale']
y_scale = custom_options['y_scale']
x_scale = custom_options['x_scale']
nms_score_threshold = 0.0 if (custom_options['nms_score_threshold'] == -float('inf') or custom_options['nms_score_threshold'] == float('inf')) else custom_options['nms_score_threshold']
nms_iou_threshold = 0.0 if (custom_options['nms_iou_threshold'] == -float('inf') or custom_options['nms_iou_threshold'] == float('inf')) else custom_options['nms_iou_threshold']
num_classes = custom_options['num_classes']
max_classes_per_detection = custom_options['max_classes_per_detection']
max_detections = custom_options['max_detections']
use_regular_nms = custom_options['use_regular_nms']
detections_per_class = 0
try:
detections_per_class = custom_options['detections_per_class']
except:
pass
output_detail1 = interpreter._get_tensor_details(op['outputs'][0])
output_detail2 = interpreter._get_tensor_details(op['outputs'][1])
output_detail3 = interpreter._get_tensor_details(op['outputs'][2])
output_detail4 = interpreter._get_tensor_details(op['outputs'][3])
################################################################### Calculation of anchors
anchors_yx = anchors[:, 0:2]
anchors_hw = anchors[:, 2:4]
################################################################### Calculation of boxes
boxes_div = tf.divide(boxes, [y_scale, x_scale, h_scale, w_scale])
# ycenter, xcenter
ycenter_xcenter = boxes_div[:, :, 0:2]
ycenter_xcenter_calc = ycenter_xcenter * anchors_hw + anchors_yx
# half_h, half_w
half_h_half_w = boxes_div[:, :, 2:4]
half_h_half_w_calc = tf.math.exp(half_h_half_w) * anchors_hw * 0.5
# scale conversion
"""
box.ymin = ycenter - half_h;
box.xmin = xcenter - half_w;
box.ymax = ycenter + half_h;
box.xmax = xcenter + half_w;
"""
box_ymin_box_xmin = ycenter_xcenter_calc - half_h_half_w_calc
box_ymax_box_xmax = ycenter_xcenter_calc + half_h_half_w_calc
boxes_concat = tf.concat([box_ymin_box_xmin, box_ymax_box_xmax], axis=-1)[0]
################################################################### Calculation of scores
if scores.shape[2] > 1:
scores_slice = scores[:,:,1:num_classes+1][0]
else:
scores_slice = scores[0]
scores_argmax = tf.math.argmax(scores_slice, axis=-1, output_type=tf.int32)
scores_reduce_max = tf.math.reduce_max(scores_slice, axis=-1)
################################################################### Calculation of NMS
def NonMaxSuppressionV5_(boxes, scores, max_output_size: int, iou_threshold, score_threshold, soft_nms_sigma, pad_to_max_output_size):
selected_indices, selected_scores, valid_outputs = \
tf.raw_ops.NonMaxSuppressionV5(
boxes=boxes,
scores=scores,
max_output_size=max_output_size,
iou_threshold=iou_threshold,
score_threshold=score_threshold,
soft_nms_sigma=soft_nms_sigma,
pad_to_max_output_size=pad_to_max_output_size
)
return selected_indices, selected_scores, valid_outputs
def NonMaxSuppressionV3_(boxes, scores, max_output_size: int, iou_threshold, score_threshold):
selected_indices = \
tf.raw_ops.NonMaxSuppressionV3(
boxes=boxes,
scores=scores,
max_output_size=max_output_size,
iou_threshold=iou_threshold,
score_threshold=score_threshold
)
return selected_indices
selected_indices = None
selected_scores = None
valid_outputs = None
if not optimizing_for_openvino_and_myriad:
selected_indices, selected_scores, valid_outputs = \
tf.keras.layers.Lambda(
NonMaxSuppressionV5_,
arguments={'scores': scores_reduce_max,
'max_output_size': max_detections,
'iou_threshold': nms_iou_threshold,
'score_threshold': nms_score_threshold,
'soft_nms_sigma': 0.0,
'pad_to_max_output_size': True}
)(boxes_concat)
else:
selected_indices = \
tf.keras.layers.Lambda(
NonMaxSuppressionV3_,
arguments={'scores': scores_reduce_max,
'max_output_size': max_detections,
'iou_threshold': nms_iou_threshold,
'score_threshold': nms_score_threshold}
)(boxes_concat)
selected_scores = tf.gather(
scores_reduce_max,
selected_indices
)
valid_outputs = max_detections
################################################################### Calculation of outputs
bounding_boxes = tf.identity(
tf.expand_dims(
tf.gather(
boxes_concat,
selected_indices
),
axis=0),
name='TFLite_Detection_PostProcess0'
)
class_labels = tf.identity(
tf.expand_dims(
tf.gather(
scores_argmax,
selected_indices
),
axis=0
),
name='TFLite_Detection_PostProcess1'
)
class_confidences = tf.identity(
tf.expand_dims(
selected_scores,
axis=0
),
name='TFLite_Detection_PostProcess2'
)
num_of_boxes = tf.identity(
tf.expand_dims(
valid_outputs,
axis=0
),
name='TFLite_Detection_PostProcess3'
)
tensors[output_detail1['index']] = bounding_boxes
tensors[output_detail2['index']] = class_labels
tensors[output_detail3['index']] = class_confidences
tensors[output_detail4['index']] = num_of_boxes
@ingura
What exactly is The tensorflow API for Android
that you presented, and which document specifically is the best to refer to for understanding? Actually, I would like to know the breakdown of the 7 in [num_boxes, 7] that you are expecting, rather than the API spec.
https://developer.android.com/ndk/guides/neuralnetworks
[X1, Y1, X2, Y2, ClassID, ClassScore, ???] <- valid_box or invalid_box or batch_number?
I have lots of knowledge of analyzing FlatBuffers and modifying models, but no development experience using the Android API.
Your implementation details are awesome.
About the API I am using the Java version of tensorflow-lite-gpu. Here is my TensorFlow dependency list:
implementation "org.tensorflow:tensorflow-lite:${tflite_version}" implementation "org.tensorflow:tensorflow-lite-gpu:${tflite_version}" implementation "org.tensorflow:tensorflow-lite-gpu-api:${tflite_version}"
Here is a good guide to tfLite for Android: https://www.tensorflow.org/lite/guide/inference
In the yolov7 case our output tensor has the shape [num_boxes, 7]. The result for each box is an array of size 7 that contains:
[batch_number, boxLeftLimitX , boxTopLimitY , boxRightLimitX , boxBotomLimitY, ClassID, ClassScore]
The model outputs the top "num_boxes" results ordered in terms of the "ClassScore".
Thank you for your effort!
Thank you. I understood.
I will get around to implementation, but major concerns remain. I have a recent history of supporting implementations of models that can use Android GPUs, but it did not work. For more information, please see the issue below.
I cannot run a transformer model with token-level output on accelerated hardware in TF Lite #59232
Currently, GPU Delegate only supports the following very limited basic OPs. Also, Gather
cannot be used and there are significant restrictions on the use of strided_slice
. Therefore, the post-processing to be implemented based on this idea will be a model with fallback to the CPU or GPU Delegate may generate an error and Abort.
https://www.tensorflow.org/lite/performance/gpu
https://www.tensorflow.org/lite/android/delegates/gpu
ADD
AVERAGE_POOL_2D
CONCATENATION
CONV_2D
DEPTHWISE_CONV_2D v1-2
EXP
FULLY_CONNECTED
LOGISTIC
LSTM v2 (Basic LSTM only)
MAX_POOL_2D
MAXIMUM
MINIMUM
MUL
PAD
PRELU
RELU
RELU6
RESHAPE
RESIZE_BILINEAR v1-3
SOFTMAX
STRIDED_SLICE
SUB
TRANSPOSE_CONV
Just my guess before doing the implementation, but I am thinking that eventually you may need to re-implement the post-processing part with a custom GPU Delegate or custom operation.
If the GPU support is not there let's have it running on the CPU and see if it is comparable with yolov4 in terms of computational performance. And yolov4 did pretty well.
The paper mentions that there is a significant computational reduction happening in yolov7 compared to v4
Not always efficient but tfLite can split the inference execution between the GPU and CPU: https://www.tensorflow.org/lite/performance/gpu
If some of the ops are not supported by the GPU delegate, the framework will only run a part of the graph on the GPU and the remaining part on the CPU.
I see. I'll try to implement it before worrying about this and that.
I don't have an environment for testing on Android, so I would be happy to have you help me test the tool once I have finished my experimental customization of the tool.
I look forward to it
[batch, boxes, [x1, y1, x2, y2, boxscore, classscores]]
<- DAMO-YOLO[batch, boxes, [y1, x1, y2, x2, boxscore, classscores]]
[batch, boxes, [x, y, w, h, boxscore, classscores]]
<- YOLOv7, FreeYOLO[batch, boxes, [y, x, h, w, boxscore, classscores]]
[batch, [x, y, w, h, boxscore, classscores], boxes]
<- YOLOv8merged = tf.concat(
values=[
[N, 6],
[N, 1],
],
axis=0,
)
result = tf.pad(
tensor=merged,
paddings=[
[0, (100-N)],
[0, 0]
],
mode='CONSTANT',
constant_values=0,
)
Somehow I don't think GPU Delegate is compatible with NMS.
Ill give it a try with GPU support and without to see the performance difference
I am trying to apply partial optimization first and then improve the whole thing at once. Several more pull requests will be issued, but please wait a little longer.
Thanks
ONNX exported with --end2end
. Just fixed the output of YOLOv7 to [100, 7]
. I do not expect it to work properly. If this experimental implementation does not work as expected, the aforementioned implementation idea would likely be invalid.
Experimental: [100, 7]
yolov7-tiny_float16.tflite.zip
yolov7-tiny_float32.tflite.zip
Tensorflow throws:
RuntimeError: tensorflow/lite/kernels/scatter_nd.cc:65 updates.DimensionsCount() - outer_dims != shape_shape.Dims(0) - ix (2 != -14)Node number 237 (SCATTER_ND) failed to prepare.
To reproduce run this python script:
import cv2
import random
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt
def scaleAndFill(im, new_shape=(640, 640), color=(114, 114, 114), auto=True, scaleup=True, stride=32):
# Resize and pad image while meeting stride-multiple constraints
shape = im.shape[:2] # current shape [height, width]
if isinstance(new_shape, int):
new_shape = (new_shape, new_shape)
# Scale ratio (new / old)
scale = min(new_shape[0] / shape[0], new_shape[1] / shape[1])
if not scaleup: # only scale down, do not scale up (for better val mAP)
scale = min(scale, 1.0)
# Compute padding
new_unpad = int(round(shape[1] * scale)), int(round(shape[0] * scale))
fillW, fillH = new_shape[1] - new_unpad[0], new_shape[0] - new_unpad[1] # padding
if auto: # minimum rectangle
fillW, fillH = np.mod(fillW, stride), np.mod(fillH, stride) # padding
fillW /= 2 # divide padding into 2 sides
fillH /= 2
if shape[::-1] != new_unpad: # resize
im = cv2.resize(im, new_unpad, interpolation=cv2.INTER_LINEAR)
top, bottom = int(round(fillH - 0.1)), int(round(fillH + 0.1))
left, right = int(round(fillW - 0.1)), int(round(fillW + 0.1))
im = cv2.copyMakeBorder(im, top, bottom, left, right, cv2.BORDER_CONSTANT, value=color) # fill border
return im, scale, (fillW, fillH)
#Name of the classes according to class indices.
names = ['person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus', 'train', 'truck', 'boat', 'traffic light',
'fire hydrant', 'stop sign', 'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
'elephant', 'bear', 'zebra', 'giraffe', 'backpack', 'umbrella', 'handbag', 'tie', 'suitcase', 'frisbee',
'skis', 'snowboard', 'sports ball', 'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard',
'tennis racket', 'bottle', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl', 'banana', 'apple',
'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza', 'donut', 'cake', 'chair', 'couch',
'potted plant', 'bed', 'dining table', 'toilet', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'book', 'clock', 'vase', 'scissors', 'teddy bear',
'hair drier', 'toothbrush']
#Creating random colors for bounding box visualization.
colors = {name:[random.randint(0, 255) for _ in range(3)] for i,name in enumerate(names)}
#Load and preprocess the image.
img = cv2.imread("D:\\..\\image1.jpg")
print("oring ImgShape:")
print(img.shape)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
image = img.copy()
image, ratio, dwdh = scaleAndFill(image,(640,640), auto=False)
image = np.expand_dims(image, 0)
print("expanded ImageShape:")
print(image.shape)
image = np.ascontiguousarray(image)
im = image.astype(np.float32)
im /= 255
# Load the TFLite model and allocate tensors.
# interpreter = tf.lite.Interpreter(model_path="./Models\\yolov7-tiny-NHWc_fp32.tflite")
interpreter = tf.lite.Interpreter(model_path="./Models\\yolov7-tiny_org_post_float32.tflite")
#Allocate tensors.
interpreter.allocate_tensors()
# Get input and output tensors.
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()
# Test the model on random input data.
input_shape = input_details[0]['shape']
print("required input shape:")
print(input_shape)
output_shape = output_details[0]['shape']
print("output shape:")
print(output_shape)
interpreter.set_tensor(input_details[0]['index'], im)
interpreter.invoke()
# The function `get_tensor()` returns a copy of the tensor data.
# Use `tensor()` in order to get a pointer to the tensor.
output_data = interpreter.get_tensor(output_details[0]['index'])
## Visualize results
ori_images = [img.copy()]
testOutputSize=0
for i,(batch_id,x0,y0,x1,y1,cls_id,score) in enumerate(output_data):
testOutputSize= testOutputSize+1
print('batch_id: {} clsID: {} score: {}'.format(batch_id,cls_id,score ))
image = ori_images[int(batch_id)]
box = np.array([x0,y0,x1,y1])
box -= np.array(dwdh*2)
box /= ratio
box = box.round().astype(np.int32).tolist()
cls_id = int(cls_id)
score = round(float(score),3)
name = names[cls_id]
color = colors[name]
name += ' '+str(score)
cv2.rectangle(image,box[:2],box[2:],color,2)
cv2.putText(image,name,(box[0], box[1] - 2),cv2.FONT_HERSHEY_SIMPLEX,0.75,[225, 255, 255],thickness=2)
plt.imshow(ori_images[0])
plt.title('TfLite Indications', fontweight ="bold")
plt.show()
print("output size:")
print(testOutputSize)
`
The result is
`
Traceback (most recent call last):
File "D:\..\testOnnx2tf.py", line 89, in <module>
interpreter.invoke()
File "C:\..\site-packages\tensorflow\lite\python\interpreter.py", line 917, in invoke
self._interpreter.Invoke()
RuntimeError: tensorflow/lite/kernels/scatter_nd.cc:65 updates.DimensionsCount() - outer_dims != shape_shape.Dims(0) - ix (2 != -14)Node number 237 (SCATTER_ND) failed to prepare.
` I wonder if we can preserve the output tensor of onnx2tf v1.5.36 and copy it into a fixed size output tensor which will become final the output of the model.
Specifically would it be practical to add another operation at the end of the model that copies the dynamic output size tensor of onnx2tf v1.5.36 of shape initial_output_tensor[num_boxes, 7 ] into a fixed size tensor that is tflite_friendly_output [100,7] in shape? In that case the result would be tflite_friendly_output [100,7] tensor with its content being
tflite_friendly_output [ : num_boxes , 7 ] = initial_output_tensor [num_boxes,7 ] while the
tflite_friendly_output [ num_boxes : , 7 ] would be filled with zeros.
tflite_friendly_output [ : num_boxes , 7 ] = initial_output_tensor [num_boxes,7 ] while the
tflite_friendly_output [ num_boxes : , 7 ] would be filled with zeros.
In fact, that is the last model I posted that implements it. I see that it is an error... :thinking:
When the output of NonMaxSuppressionV4
is variable, the runtime seems to output an error when padding because of the variable number of outputs, which can be zero or 100 depending on the input image. It seems necessary to give up using NonMaxSuppressionV4
.
# YOLOv7 Special fixed outputs
max_output_boxes_per_class = 100
final_output = outputs[0]
output_paddings = tf.zeros(shape=[max_output_boxes_per_class, 7], dtype=tf.float32)
indices = tf.range(0, tf_shape(input_tensor=final_output)[0])
outputs = [
tf.tensor_scatter_nd_update(
tensor=output_paddings,
indices=indices,
updates=final_output,
)
]
For example, I feel one solution would be to borrow mmdetection's NMS and incorporate it into Pytorch's ONNX export logic side in advance. Then, modify a little borrowed nms logic and padding at the end. After 2 days of various attempts, it seems that it would take a very long time to implement a major rewrite of the model structure on the part of this conversion tool.
https://github.com/open-mmlab/mmdetection/blob/master/mmdet/core/post_processing/bbox_nms.py
https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/delegates/gpu/gl/kernels
Tensorflow GPU delegates supports only limited operations. I guess it is almost impossible to implement NMS using those. It looks even tf.range
is not supported, which makes almost impossible to implement neccesary top-k
or sort
action for NMS.
There are two options for now.
NonMaxSuppression
to other operations, it is possible by using tf.image.non_max_suppression_padded
.
The implementation using several sub-operations unlike non_max_suppression_v4
as shown above.non_max_suppression_v4
has pad_to_max_output_size
option.
For now, NonMaxSuppression.py
passing False
and using slice to remove excess indices. After making option for static output for NMS, it is possible to make user to determine maximum box number.Thanks @Hyunseok-Kim0.
I'll give it a try. pad_to_max_output_size
I didn't know there was such an option.
(function)
non_max_suppression_v4(
boxes: Any,
scores: Any,
max_output_size: Any,
iou_threshold: Any,
score_threshold: Any,
pad_to_max_output_size: bool = False,
name: Any | None = None
) -> NonMaxSuppressionV4
Greedily selects a subset of bounding boxes in descending order of score,
pruning away boxes that have high intersection-over-union (IOU) overlap with previously selected boxes. Bounding boxes with score less than score_threshold are removed. Bounding boxes are supplied as [y1, x1, y2, x2], where (y1, x1) and (y2, x2) are the coordinates of any diagonal pair of box corners and the coordinates can be provided as normalized (i.e., lying in the interval [0, 1]) or absolute. Note that this algorithm is agnostic to where the origin is in the coordinate system and more generally is invariant to orthogonal transformations and translations of the coordinate system; thus translating or reflections of the coordinate system result in the same boxes being selected by the algorithm. The output of this operation is a set of integers indexing into the input collection of bounding boxes representing the selected boxes. The bounding box coordinates corresponding to the selected indices can then be obtained using the tf.gather operation. For example:
selected_indices = tf.image.non_max_suppression_v2(
boxes, scores, max_output_size, iou_threshold, score_threshold)
selected_boxes = tf.gather(boxes, selected_indices)
Args:
boxes: A Tensor. Must be one of the following types: half, float32.
A 2-D float tensor of shape [num_boxes, 4].
scores: A Tensor. Must have the same type as boxes.
A 1-D float tensor of shape [num_boxes] representing a single score corresponding to each box (each row of boxes).
max_output_size: A Tensor of type int32.
A scalar integer tensor representing the maximum number of boxes to be selected by non max suppression.
iou_threshold: A Tensor. Must be one of the following types: half, float32.
A 0-D float tensor representing the threshold for deciding whether boxes overlap too much with respect to IOU.
score_threshold: A Tensor. Must have the same type as iou_threshold.
A 0-D float tensor representing the threshold for deciding when to remove boxes based on score.
pad_to_max_output_size: An optional bool. Defaults to False.
If true, the output selected_indices is padded to be of length max_output_size. Defaults to false.
name: A name for the operation (optional).
Returns:
A tuple of Tensor objects (selected_indices, valid_outputs).
selected_indices: A Tensor of type int32.
valid_outputs: A Tensor of type int32.
Excellent.
pad_to_max_output_size = False
pad_to_max_output_size = True
You have to disable slice when using that option, still the dynamic output is generated in the image because of num_valid
if pad_to_max_output_size:
return selected_indices
else:
return selected_indices[:num_valid]
OK. Thanks.
When the output of
NonMaxSuppressionV4
is variable, the runtime seems to output an error when padding because of the variable number of outputs, which can be zero or 100 depending on the input image. It seems necessary to give up usingNonMaxSuppressionV4
.
It seems like the conversion to a static output of a dynamic output network by padding its dynamic output is quite a general approach and it is worth looking into the cause of this error.
If the error happens at the padding step could it be because it cant handle padding with a meaningless 0 size array? Or maybe the issue happens at the other extreme paddingWith(100+)?
Maybe my use of tf.tensor_scatter_nd_update
is just wrong.
In any case, your most recent objective of generating a fixed size output is achieved by merging my pull request above.
We have completed the first step in order to address the issues in a detailed step-by-step process. I will look into the issue of padding errors due to self-preprocessing later.
Thanks!
I made a mistake in releasing the package, so the PyPI version and the Docker version mismatched, but I just released the latest version 1.5.39. https://github.com/PINTO0309/onnx2tf/releases/tag/1.5.39
1.5.38 is synonymous with 1.5.39 for Docker.
yolov7-tiny_org_post_float16_fixed.tflite.zip yolov7-tiny_org_post_float32_fixed.tflite.zip
In order to support GPU Delegate, it is necessary to digest a rather challenging task. There are quite a few other issues besides the fixed size output of the NMS. As Hyunseok-Kim0 also pointed out, this is because there are quite a few OPs that support GPUs.
Realistically, I believe that only the post-processing part should be implemented on the Android JAVA side.
You're amazing, you fixed it! Maybe in the future the GPU Delegate will be better supported as well.
Thank you very much!
Issue Type
Others
onnx2tf version number
1.5.36
onnx version number
1.12.0
tensorflow version number
2.10.1
Download URL for ONNX
pip install onnx==1.12.0
Parameter Replacement JSON
Description
Hi, your library is awesome!
I converted the Yolov7-tiny from PyTorch to TfLite using:
onnx2tf -i yolov7-tiny.onnx -o models-NHWC-final/ -osd -oh5 -cotof
I am trying to use it on an android device. The model works when tested on a PC however the Tensorflow Java API for Android does not support dynamic output models according to their documentation: https://www.tensorflow.org/lite/guide/inference. While the resulting yolo tflite model has a dynamic number of outputs( the number of outputs change with the number of indications/detentions)
On the other hand, if I follow the conversion path PyTorch -> ONNX-> Tensorflow I do get yolov7 with a fixed output size so I suspect it is possible to achieve this with onnx2tf as well while also doing the NcHW to NHWc conversion in the process.
Is there a way to have onnx2tf output a fixed/static output .tflite model for yolov7-tiny?
Thank you