Invalid Values When Converting Blob to FloatBuffer for ONNX Inference

kono94 commented 1 year ago

Hello, I am trying to get a self-trained YOLOv7-tiny network to run in a Java application. I tried using OpenCV's dnn, but it does not currently include the nms-module and exporting the onnx-model with the detection layer results in a weird error, so I switched to the Onnx inference runtime.

In short, where the problem is the output in the OpenCV inference (for the input, just use the blob created by OpenCV's "blobFromImage"), the input is cumbersome in the Onnx approach as it does not accept Mats, but float-arrays or FloatBuffers.

The problem arises when converting the blob (NCHW Mat) to a FloatBuffer as its values are all 0.0 somehow?! What am I doing wrong?

See the code and output as comments:

import ai.onnxruntime.*
import org.bytedeco.opencv.opencv_core.Mat;
import java.nio.ByteBuffer;
import java.nio.FloatBuffer;
import static org.bytedeco.opencv.global.opencv_imgcodecs.imread;
import static org.bytedeco.opencv.global.opencv_dnn.blobFromImage;
(..)

      SessionOptions opts = new SessionOptions();
      opts.setOptimizationLevel(OptLevel.BASIC_OPT);
      OrtSession session = OnnxNet.ortEnvironment.createSession("model.onnx", opts);

      String inputName = session.getInputNames().iterator().next();

      int batchSize = 1;
      int nrChannels = 3;
      int width = 480;
      int height = 480;
      Mat image = imread("test.jpg");

      // resizing image to 480x480 with border to keep aspect ratio
      // copyMakeBorder(inp, inp, top, bottom, left, right, BORDER_CONSTANT, new Scalar(114,114, 114,0));
      LetterBoxResult letterBoxResult = Utility.makeLetterBox(image, 480);

      Mat inputBlob = blobFromImage(image,
          1/255,
          new Size(480, 480),
          new Scalar(0.0),
          true, false, CV_32F);

      System.out.println("Total flatten size: " + image.total() * image.channels()); // 691200
      byte[] return_buff = new byte[(int) (image.total() * image.channels())];
      ByteBuffer bb =  image.createBuffer();
      System.out.println("byte buffer get(0): " + bb.get(0)); // 114, first channel of the first pixel of the border
      bb.get(return_buff);

      System.out.println("Blob depth: " + inputBlob.arrayDepth()); // 32 => CV_32F
      FloatBuffer fb = inputBlob.createBuffer();
      System.out.println("Floatbuffer capacity: " + fb.capacity()); // 691200 => correct size
      System.out.println("FloatBuffer get(0): " + fb.get(0)); // 0.0, indeed 0.0 for all entries

      OnnxTensor test = OnnxTensor.createTensor(OnnxNet.ortEnvironment, fb, new long[]{batchSize, nrChannels, height, width});

It works perfectly, when I am "converting" the Mat to a 4d Java float-array "by hand" but I want to avoid this costly looping...

      float[][][][] testData = new float[batchSize][nrChannels][height][width];

      for (int h = 0; h < height; h++) {
          for (int w = 0; w < width; w++) {
              for (int c = 0; c < nrChannels; c++) {
                  testData[0][c][h][w] = (float) ((int) (return_buff[width * nrChannels * h + w * nrChannels + c]) + 128) / 255;
              }
          }
      }

  OnnxTensor test = OnnxTensor.createTensor(OnnxNet.ortEnvironment, testData); //works perfectly fine

Thank you in advance for any help!

saudet commented 1 year ago

That's a lot easier to do with indexers, and they are more efficient as well: http://bytedeco.org/news/2014/12/23/third-release/

But if you need to use the "official" Java API, @Craigacp would be guy to ask about that.

Craigacp commented 1 year ago

So presumably the Mat.createBuffer call hits this - https://github.com/bytedeco/javacpp-presets/blob/master/opencv/src/main/java/org/bytedeco/opencv/opencv_core/AbstractArray.java#L46? If you give ORT a valid buffer it should be fine, but I don't know how JavaCPP converts from the C++ Mat object into a java.nio.Buffer, as it happens before any ORT code executes.

saudet commented 1 year ago

Adam, how many times do I need to tell you before you remember? JavaCPP doesn't do anything with data from OpenCV, ORT, TF, PyTorch, or any other library mapped in the presets. But the data from OpenCV, for example, might not be "compact", that is, it can have arbitrary strides. I don't believe ORT offers any support for that, so that's why I recommend using JavaCPP to work around that, but I may be mistaken, so that's why I'm asking you. What are the limitations of data supported by your bindings?

Craigacp commented 1 year ago

ORT needs dense row major data. However I think @kono94 is saying that when they convert the data from the Mat into a FloatBuffer it's all zeros when inspected using the methods on the FloatBuffer, which is before any ORT code is executed.

Craigacp commented 1 year ago

One thing I would check is that the ONNX model you want to use does in fact require float inputs. Many image models accept int8 or unit8 inputs, which you can wrap the byte array you already have into a ByteBuffer then create the tensor. If the model does require float inputs, then it's a fairly straightforward transformation to put the cast & rescale on the front of the ONNX model by rewriting it's protobuf slightly.

Is there a way to validate that inputBlob has the right values in it? The codepath you use to access the image elements in the array is quite different to the one which emits the float buffer, and so maybe one step of that transformation is failing.

saudet commented 1 year ago

ORT needs dense row major data. However I think @kono94 is saying that when they convert the data from the Mat into a FloatBuffer it's all zeros when inspected using the methods on the FloatBuffer, which is before any ORT code is executed.

Ah, I see, looks like the problem is this line:

     1/255,
@kono94 That's equal to 0, you probably wanted to write 1.0/255.0.

Still, it's probably faster to use indexers anyway since it doesn't look like we can coerce blobFromImage() to use external buffers either.

kono94 commented 1 year ago

Oh man, such a silly mistake! Thank you very much @saudet and @Craigacp for the help!

Using the blob approach instead of the for-loops even improved the detection results somehow.

Two questions are still on my mind:

In terms of performance, should I be using the official Microsoft maven Java package ai.onnxruntime.* or org.bytedeco.onnxruntime.*; like in the example https://github.com/bytedeco/javacpp-presets/blob/master/onnxruntime/samples/CXXApiSample.java
@saudet Isn't every Indexer .get() call a single JNI-Call, so I end up with 700k JNI calls? Or do I misunderstand you? You want me to loop through the data with an indexer to fill the Java 4D array, no?

saudet commented 1 year ago

In terms of performance, should I be using the official Microsoft maven Java package ai.onnxruntime.* or org.bytedeco.onnxruntime.*; like in the example https://github.com/bytedeco/javacpp-presets/blob/master/onnxruntime/samples/CXXApiSample.java

It's only "official" in the sense that @Craigacp is from Oracle, and it's designed to be easy to use with a small number of features, making it easier for Microsoft to accept pull requests, but it's not designed for performance, for example, we can't access buffers allocated on GPU memory. For most workloads the overhead probably doesn't matter though.

@saudet Isn't every Indexer .get() call a single JNI-Call, so I end up with 700k JNI calls? Or do I misunderstand you? You want me to loop through the data with an indexer to fill the Java 4D array, no?

No, that uses sun.misc.Unsafe, it's fast: http://bytedeco.org/news/2014/12/23/third-release/

Craigacp commented 1 year ago

In terms of performance, should I be using the official Microsoft maven Java package ai.onnxruntime.* or org.bytedeco.onnxruntime.*; like in the example https://github.com/bytedeco/javacpp-presets/blob/master/onnxruntime/samples/CXXApiSample.java

It's only "official" in the sense that @Craigacp is from Oracle, and it's designed to be easy to use with a small number of features, making it easier for Microsoft to accept pull requests, but it's not designed for performance, for example, we can't access buffers allocated on GPU memory. For most workload the overhead probably doesn't matter though.

It's official in the sense that it is the supported ORT Java API, and developed in the ORT source tree as part of the ORT project (along with the Python & C# interfaces). The fact that it has an Oracle copyright on it is irrelevant.

The plan is to eventually add IOBinding support which will allow the persistence of tensors on GPUs, though not to the extent that JavaCPP provides. Training support and better CPU memory pinning are going to happen first though.

I'll also point out that it is used in production at several massive tech companies, and they don't complain about performance.

saudet commented 1 year ago

It's official in the sense that it is the supported ORT Java API, and developed in the ORT source tree as part of the ORT project (along with the Python & C# interfaces). The fact that it has an Oracle copyright on it is irrelevant.

No, it's very relevant. Microsoft can't easily accept pull requests from random dudes. The fact that you are at Oracle does help a lot.

The plan is to eventually add IOBinding support which will allow the persistence of tensors on GPUs, though not to the extent that JavaCPP provides. Training support and better CPU memory pinning are going to happen first though.

I'll also point out that it is used in production at several massive tech companies, and they don't complain about performance.

Yes, I know, the overhead of your bindings is not typically relevant for server applications, but it's very relevant, for example, in embedded or mobile applications. The fact that Oracle, Microsoft, and other large corporations running ONNX Runtime on the cloud where performance isn't too relevant doesn't change the fact that all these corporations don't care about embedded and mobile applications, at least when it comes to Java. Please keep an open mind!

bytedeco / javacv

Invalid Values When Converting Blob to FloatBuffer for ONNX Inference #1951