Is there an easy way to convert ONNX or PB from (NCHW) to (NHWC)? #15

Closed AlexeyAB closed 3 years ago

AlexeyAB commented 4 years ago

@PINTO0309 Hi, Nice work with YOLOv4 / tiny!

As I see you use:

I have several questions:

PINTO0309 commented 4 years ago

Thank you for commenting on this for a hobbyist like me who does DeepLearning as a hobby. I am not an engineer or a researcher.

Is there an easy way to convert ONNX or PB from (NCHW) to (NHWC)?

No. By the way, I've already successfully converted NCHW to NHWC, but in a very primitive way I did. Since Tensorflow's Conv2D and several other OPs do not support NCHW, this was accomplished by inserting Transpose OPs before and after each OP. While this method can be made to infer correctly, the inserted Transpose OP resulted in unnecessary overhead and a significant loss of original performance. I used a combination of Keras and OpenVINO's model_optimizer to achieve the NHWC to NCHW conversion. (Converting backwards is easy.)

Is there an easy way to convert TF1-pb to TF2-saved_models.pb ?

Yes. I've described how to do the conversion in some of my blog posts below. [English ver.] [Tensorflow Lite] Various Neural Network Model quantization methods for Tensorflow Lite (Weight Quantization, Integer Quantization, Full Integer Quantization, Float16 Quantization, EdgeTPU). As of May 05, 2020. Alternatively, you can find a repository of perfect tutorials below.

Is NHWC slowing down execution on the GPU?

I'm not sure. I'm not really interested in using a high performance GPU. I only benchmark with low performance CPUs and edge accelerators. However, I've seen blogs in the past where Japanese engineers have done comparative benchmarking between NCHW and NHWC. However, this article does not refer to the performance of reasoning, but rather shows an increase in learning speed. The improvement in learning speed appears to be from a few percent to a few dozen percent. Japanese article: TensorFlow/Kerasでchannels_firstにするとGPUの訓練が少し速くなる話 - @koshian2

How many FPS do you get on Google Coral TPU-Edge and RaspberryPi4 for yolov4-tiny (int8)?

RaspberryPi4 + CPU only + INT8 + Tensorflow Lite (4 threads) + 256x256 with 88ms/inference Performance. Twitter: Ec-oBTBU8AAdERO

RaspberryPi4 + CPU only + INT8 + Tensorflow Lite (4 threads) + 416x416 with 243ms/inference Performance. Twitter: Ecu2ocHVcAEsTK_ Unfortunately, the conversion to models for Coral TPU-Edge was not successful due to a bug in the Tensorflow Lite Converter.

What script did you use to get yolov4_tiny_voc.json?

  1. First, clone the following repositories.
  2. All you have to do is modify the script, change it to the following and run it.
    ### tensorflow-gpu==1.15.2
    from nets.yolo4_tiny import yolo_body
    from keras.layers import Input
    image_input = Input(shape=(416, 416, 3))
    model = yolo_body(image_input, 3, 20)
    json_string = model.to_json()
    open('yolov4_tiny_voc.json', 'w').write(json_string)
AlexeyAB commented 4 years ago

@PINTO0309 Thank you for your huge work as a hobby!

RaspberryPi4 + CPU only + INT8 + Tensorflow Lite (4 threads) + 416x416 with 243ms/inference Performance.

It seems it doesn't work fast on RPi4.

Unfortunately, the conversion to models for Coral TPU-Edge was not successful due to a bug in the Tensorflow Lite Converter.

Do you know if there is a plan to fix this?

Yes. I've described how to do the conversion in some of my blog posts below. [English ver.] [Tensorflow Lite] Various Neural Network Model quantization methods for Tensorflow Lite (Weight Quantization, Integer Quantization, Full Integer Quantization, Float16 Quantization, EdgeTPU). As of May 05, 2020. Alternatively, you can find a repository of perfect tutorials below.

Thanks, it helps a lot.

PINTO0309 commented 4 years ago

@AlexeyAB Thank you for your reply.

Do you know if there is a plan to fix this?

No. I have posted similar issues, but so far I haven't received a definitive answer.

I don't know if Keras' implementation of YoloV4-tiny correctly replicates the original implementation, but I sympathize with you. I'm going to try to build OpenCV / NCNN myself for the first time in a long time. And I'm going to try it in Pi4.

PINTO0309 commented 4 years ago

@AlexeyAB RaspberryPi4 + Ubuntu 19.10 aarch64 + ncnn + CPU only + 4 threads + YoloV4-tiny 416x416 300ms/pred Screenshot 2020-07-25 10:48:28

AlexeyAB commented 4 years ago

@PINTO0309 Thanks!

RaspberryPi4 + Ubuntu 19.10 aarch64 + ncnn + CPU only + 4 threads + YoloV4-tiny 416x416 300ms/pred

It seems yolov4-tiny speed is the same as mobilenet_yolo on RPi4.

Can you try to quantize yolov4-tiny to int8 and test it on RPi4?

If it will not help a lot, it seems we should try to implement yolov4-tiny with Depthwise/Grouped-convolution.

PINTO0309 commented 4 years ago

@AlexeyAB After optimization and INT8 quantization, performance was mysteriously degraded.

RaspberryPi4 + Ubuntu 19.10 aarch64 + ncnn + CPU only + 4 threads + YoloV4-tiny int8 416x416 326ms/pred Screenshot 2020-07-26 02:47:22

AlexeyAB commented 4 years ago

@PINTO0309 Thanks!

RaspberryPi4 + Ubuntu 19.10 aarch64 + ncnn + CPU only + 4 threads + YoloV4-tiny int8 416x416 326ms/pred

So int8 isn't faster on RPi4 + NCNN. We should try to implement yolov4-tiny with Depthwise/Grouped convolutions.

RaspberryPi4 + CPU only + INT8 + Tensorflow Lite (4 threads) + 416x416 with 243ms/inference Performance. RaspberryPi4 + Ubuntu 19.10 aarch64 + ncnn + CPU only + 4 threads + YoloV4-tiny 416x416 300ms/pred

So TensorFlow-Lite is 1.25x faster than NCNN.

nihui commented 4 years ago

Thanks ! As far as I know, the efficiency of ncnn int8 implementation is very poor, and it is normal that the speed is even slower than ncnn fp32. I am currently working on fp16 and gpu acceleration, and I have no plan to optimize the efficiency of int8 in the coming weeks.

Maybe one day, I can't stand the speed of int8 anymore, I will try optimizing it :smiley:

AlexeyAB commented 4 years ago

@nihui Yes, GPU optimization is more important, especially on smartphones.

Are you using Vulkan or self-written functions for int8 inference?

PINTO0309 commented 4 years ago

RaspberryPi4 (2.0GHz overclock) + Ubuntu 20.04 aarch64 + CPU only + INT8 + Tensorflow Lite (4 threads) + 416x416 with 224ms/inference Performance. Screenshot 2020-08-01 20:16:27

PINTO0309 commented 4 years ago

@AlexeyAB I've created a script that automatically converts NCHW to NHWC. I will be adding more layers of support gradually.

AlexeyAB commented 4 years ago

@PINTO0309 Great! So now we can convert any model: PyTorch (NCHW) -> ONNX (NCHW) -> OpenVINO (NCHW) -> TF(pb) (NHWC) -> (NHWC) TFLite/TFJS/TF-TRT ... -> CoreML (NHWC)

PINTO0309 commented 4 years ago

@AlexeyAB Yes. It's hard to understand without reading all the logic, but all the Weights are converted to NHWC at the time of setting to Bias and Kernel for most of the layers. I'm just converting the weight to Const or np.ndarray temporarily and keeping it in dict.

AlexeyAB commented 4 years ago

@PINTO0309 It can be very useful!

model = torch.hub.load( "rwightman/gen-efficientnet-pytorch", "tf_efficientnet_lite3", pretrained=True, exportable=True )

rand_example = torch.rand(1, 3, 256, 256) output1 = model(rand_example)

traced_model = torch.jit.trace(model, rand_example) scripted_model = torch.jit.script(model) torch.onnx.export(model, rand_example, 'model.onnx', opset_version=10)


When I tried to do such conversion of `tf_efficientnet_lite3` model PT->ONNX->TF->TFlite by using then it isn't optimal and it can be run only on Mobile-CPU, but not on Mobile-GPU/NPU:
PINTO0309 commented 4 years ago

@AlexeyAB Thank you for providing useful information. I was just about to attempt the EfficientNet-B0-PyTorch conversion. However, I know that there is still a bug in the OP's conversion to manipulate the axis that prevents the conversion from finishing correctly.

I'm debugging a few things, so please be patient for a moment.

PINTO0309 commented 4 years ago

The midasnet groupcovolusion will probably need to be split with tf.keras.layers.SeparableConv2D or tf.nn.separable_conv2d. It looks like it needs a bit of a tricky implementation.

AlexeyAB commented 4 years ago

Is keras.layers.Conv2D(filters=out_shape, kernel_size=3, data_format=None, groups=groups) with groups > 1 suitable for this?

Just maybe there is different layout [ky][kx][c][n] or [ky][kx][n][c] or something else.

PINTO0309 commented 4 years ago

Oh... I'll try it when I get home!😄

PINTO0309 commented 4 years ago

@AlexeyAB As it turns out, it worked. But, unfortunately, the protocol buffer size limit was exceeded and the timing of saving the model resulted in an error. The size of Midasnet seems to be too large for my inefficient conversion program. :confounded:

WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/tracking/ Model.state_updates (from is deprecated and will be removed in a future version.
Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
2020-10-20 22:16:47.075953: W tensorflow/python/util/] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow/python/training/tracking/ Layer.updates (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
Traceback (most recent call last):
  File "", line 788, in <module>
  File "", line 785, in main

  File "", line 704, in convert

  File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/saved_model/", line 1006, in save
    path, saved_model.SerializeToString(deterministic=True))
ValueError: Message tensorflow.SavedModel exceeds maximum protobuf size of 2GB: 6958463348

The conversion to TFLite was successful. This is Float32, so it is a huge size of 416MB. I haven't checked the operation, so I don't know if I can infer correctly.

AlexeyAB commented 4 years ago


But, unfortunately, the protocol buffer size limit was exceeded and the timing of saving the model resulted in an error. The conversion to TFLite was successful.

Do you mean that Keras-h5 model can't be saved, but Tflite was saved successfully?

Seems to be there is something wrong:

/usr/local/lib/python3.6/dist-packages/tensorflow/lite/python/ in allocate_tensors(self) 241 self._delegates = [] 242 if experimental_delegates: --> 243 self._delegates = experimental_delegates 244 for delegate in self._delegates: 245 self._interpreter.ModifyGraphWithDelegate(

RuntimeError: tensorflow/lite/kernels/ input->dims->data[3] != filter->dims->data[3] (256 != 8)Node number 5 (CONV_2D) failed to prepare.

PINTO0309 commented 4 years ago

@AlexeyAB Thank you.

Do you mean that Keras-h5 model can't be saved, but Tflite was saved successfully?

Yes. It fails to save saved_model and h5.

Seems to be there is something wrong:

Hmmm. It's not easy.

Btw, I also tried converting EfficientNet-lite3, but it seems that the process after the last ReLU6 is not compatible with TFLite. I have not yet confirmed the operation of this one, too.

AlexeyAB commented 4 years ago

I made such conversion of pt-weights to tflite-weights for EfficientNet-Lite3 successfully, and TFlite model works well:

I only converted weights, but the structure is taken from the official repository, there is such ReLU6 implementation tf.nn.relu6:

The same as in your repo tf.nn.relu6:

PINTO0309 commented 4 years ago

@AlexeyAB Conv2D groups - TFLite

Unfortunately, it seems that the current situation is not supported.

AlexeyAB commented 4 years ago

PINTO0309 commented 4 years ago

@AlexeyAB I implemented GoupConvolusion with the standard Conv2D and Split, Concat, although I may have failed to transpose the weights. This model is chaotic.

Midasnet - Float32 - GroupConvolusion - TFLite(.tflite) Screenshot 2020-10-22 00:15:48

AlexeyAB commented 4 years ago

@PINTO0309 Thanks! Yes, it works, but it seems there is something wrong with weights (result at the end):


But result should be: output

PINTO0309 commented 4 years ago

I still think I'm transcribing the weights the wrong way. It's 1AM in Japan, so I'll try again tomorrow. :smile:

PINTO0309 commented 4 years ago

@AlexeyAB I don't know if the conversion was successful, but the result looks good. Is this the result you were hoping for? Since I have directly replaced the Google Drive model, you can simply rerun the Notebook you provided and it should produce the same results.

Please correct just one line below.

img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)# / 255.0

Screenshot 2020-10-22 20:50:23

AlexeyAB commented 4 years ago

@PINTO0309 Great! Yes, it is very similar to baseline result: There is a difference, possibly due to slightly different models.

Also did you try to convert default classifier EfficientNet-lite?

PINTO0309 commented 4 years ago

@AlexeyAB I'm going to start trying to convert EfficientNet-lite today. Currently there seems to be a problem with Gather or Reshape or ShapeOf conversions, so I need to debug it.

PINTO0309 commented 4 years ago

@AlexeyAB Btw, I successfully completed the conversion process to various frameworks two months ago.

I did a crazy implementation, but the conversion to tflite appears to have succeeded. I have not checked the operation. What are the benefits of successfully completing this conversion process?

AlexeyAB commented 4 years ago


What are the benefits of successfully completing this conversion process?

So your converter can be very useful

AlexeyAB commented 4 years ago


Great! It seems your model works well. I compare 3 models EfficientNet-Lite3 with the same dog.jpg image:

  1. 256x256: your tf_efficientnet_lite3_256x256_float32.tflite

  2. 280x280: TensorFlow-Hub

  3. 256x256: Pytorch:

I added Softmax at the end of 1 and 3, because 2 uses Softmax.

There are some differences here, possibly due to different normalization and different network resolutions:

1. TF-cst =  [[839 266 912 505 891]]
2. TF-Hub =  [[839 277 995 912 298]]
3. PT-hub =  [[266 839 376 958 850]]
PINTO0309 commented 4 years ago

@AlexeyAB I spent an hour or so going over the errors in the model, but I couldn't find any mistakes. I converted it to saved_model and then back-converted it to OpenVINO IR again to check the weights and structure. There seems to be no errors in the weight and structure, only the structure of the final processing part of my model seems to have been simply reconfigured.

AlexeyAB commented 4 years ago

@PINTO0309 Sorry, it seems there is no error, my mistake ) Great work!

AlexeyAB commented 4 years ago

@PINTO0309 Hi,

Splitを実装するアイデアがなかなか浮かばないです… これができたらYoloV4 PyTorch が自動コンバートできるようになるんですけどね〜

What is the problem with 'split'? As I understand you successfully used split for Grouped Convolution.

What YoloV4 PyTorch repository do you mean? or or or ?

PINTO0309 commented 4 years ago

@AlexeyAB Yes, it wasn't hard to break down GroupConvolusion into Split and Concat. However, when I try to convert the ONNX in the following repository, there is a standalone Split and the number of Outputs is indeterminate, which makes my implementation difficult. This is because the process after a single Split is not necessarily concatenated. Screenshot 2020-11-03 06:50:21

PINTO0309 commented 4 years ago

@AlexeyAB It's not perfect, but I've written a workflow for converting PyTorch(NCHW) to TensorFlow(NHWC) in an article. [English] Converting PyTorch, ONNX, Caffe, and OpenVINO (NCHW) models to Tensorflow / TensorflowLite (NHWC) in a snap - Qiita

Unfortunately, there is still a bug in the Reshape operation of the 5D tensor that causes YoloV4 and ShuffuleNet conversions to fail.

AlexeyAB commented 4 years ago

@PINTO0309 Hi, Thanks, Great!

Unfortunately, there is still a bug in the Reshape operation of the 5D tensor that causes YoloV4 and ShuffuleNet conversions to fail.

Are you about YOLOv4 or CSP-P5-P7 models? Where is this bug, is it in Pytorch, TFlite, or your script? Can you please give a link to a line of code, where is the problem?

PINTO0309 commented 4 years ago


Are you about YOLOv4 or CSP-P5-P7 models?

I am testing using the models in the following repositories

Where is this bug, is it in Pytorch, TFlite, or your script?

This is a bug in my openvino2tensorflow.

Can you please give a link to a line of code, where is the problem?

Of course. I'm trying every night, but it's hard to solve the problem. If you combine Reshape and Transpose, and the tensor to be transformed is 5D or 6D, the transposition operation is difficult. So far, I haven't come up with any good ideas.

For example, I feel that converting [1,256,13,13] to [1,256,13,1,13,1] would be a very complex operation in TensorFlow, as shown below. Screenshot 2020-11-16 22:24:37

PINTO0309 commented 4 years ago

Converting ONNX generated by the old branch master to .pb is successful, but converting it to tflite seems to cause an error. Hmmm... It's troubling.

AlexeyAB commented 4 years ago


For example, I feel that converting [1,256,13,13] to [1,256,13,1,13,1] would be a very complex operation in TensorFlow, as shown below.

Yes, there are quite complex transformations here when objects from different branches are merged.

Error dump when converting to TensorFlow Lite

It seems that there is also an issue - TFlite doesn't support all TF operations.

Do you get the same issue with u5 branch?

PINTO0309 commented 4 years ago


Do you get the same issue with u5 branch?

I first tried to generate onnx from the u5 branch, but couldn't export to onnx in the first place. I'll try a few more things with the u5 branch.

PINTO0309 commented 4 years ago

I fixed a bug in openvino2tensorflow and succeeded in converting YOLOv4 to tflite, although I have not checked the operation of the conversion to be correct.

I used the onnx YOLOv4 below.

In anticipation of the conversion to the EdgeTPU model, the PReLU is deliberately changed to a combination of Maximum and Minimum.

Model structure diagram of YOLOv4 tflite
![yolov4_float32 tflite](
AlexeyAB commented 4 years ago

@PINTO0309 Great!

In anticipation of the conversion to the EdgeTPU model, the PReLU is deliberately changed to a combination of Maximum and Minimum.

Is it because EdgeTPU doesn't support PReLU?

Can you try to convert yolov4x-mish.onnx to yolov4x-mish.tflite ?

PINTO0309 commented 4 years ago


Is it because EdgeTPU doesn't support PReLU?

Yes. The PReLU was not present in the supported OPs listed at the following URL.

The Transpose at the end was in the way, so I edited OpenVINO's .xml to remove it and then converted it to .tflite. It looks structurally sound, but I'm not sure if it works correctly.

Model structure diagram of yolov4x-mish tflite
![yolov4x-mish_float32 tflite](
PINTO0309 commented 4 years ago

It was very hard work, but it looks like I was able to refurbish openvino2tensorflow to generate the EdgeTPU model of YOLOv4-tiny. I found that there is a bug regarding the Resize OP conversion in either edgetpu_compiler or TFLiteConverter.

AlexeyAB commented 4 years ago

@PINTO0309 Great!

I found that there is a bug regarding the Resize OP conversion in either edgetpu_compiler or TFLiteConverter.

How did you solve or avoid it?

It was very hard work, but it looks like I was able to refurbish openvino2tensorflow to generate the EdgeTPU model of YOLOv4-tiny.

Did you check it, does it produce approximately the same result as source yolov4-tiny model?

What source model do you use, is it yolov4-tiny? Is it Pytorch URL or TensorFlow URL or Darknet URL or OpenVINO URL or TensorRT/ONNX URL yolov4-tiny model?

PINTO0309 commented 4 years ago


How did you solve or avoid it?

I used the TensorFlow v2.x or later converters to pass the full-integer quantization model equivalent to resize_nearest_neighbor or resize_bilinear upsampling to edgetpu_compiler, and the I noticed that resize op is not properly converted to op for edgetpu. The problem was caused by the half_pixel_centers of resize op being true when doing full integer quantization.

So I combined Lamda OP and tf.compat.v1.image.resize_bilinear or tf.compat.v1.image.resize_nearest_neighbor to make the half_pixel_centers I tried how to force it to be set to False. Below are the changes I made to openvino2tensorflow.

I have been playing with converting models that are committed to various repositories, so in this case I converted the models in the following repositories. keras -> openvino -> openvino2tensorflow -> EdgeTPU

The work I carry out is always fickle.

itsmasabdi commented 4 years ago

Hi @PINTO0309

I tried converting the keras model from but had no luck in the end as I received model not quantized when passing the model to the edgetpu_compiler.

Here is the process I followed.

1- convert the keras model to frozen graph (.pb) 2- convert the .pb model to openvino using python --input_model {pb_file} --output_dir {output_dir} --input_shape {input_shape_str} 3- openvino2tensorflow --model_path={model_path} --output_weight_quant_tflite True 4- Run the edge_tpu compiler on the resulting file edgetpu_compiler model_weight_quant.tflite

Is there anything I'm missing here?

Here is a glance to the output model file.

