Get error with PadV2 unsupported

fullm00n1 commented 2 years ago

I get the following error when trying to load a tflite graph that was generated from converting an ONNX model. It would be helpful if someone could help point me in the right direction:

Error: Failed to parse operator #23 within subgraph #0 error: Operator not supported. subgraph:0 operator:23 opcode_index:6 opcode:60 / PADV2 at function ParseUnsupportedOperator [armnn-devenv/arm nn/src/armnnTfLiteParser/TfLiteParser.cpp:923]

I am running ArmNN 21.11 and Computer Library 21.11

MikeJKelly commented 2 years ago

Hi @fullm00n1

I've just looked and we don't currently support PADV2 on the TFLiteParser, we do support PAD so it should be easy to add support for PADV2. I'll see if I can get a patch up today.

ArmNN can only handle PADV2's with a single constant_value but as you're coming from ONNX that shouldn't be a problem.

Best regards, Mike

fullm00n1 commented 2 years ago

How does one tell if my model only has PADV2's with a single constant value? We believe the PadV2 instances are due to the onnx-tf converter functionality in the Tensorflow tools but aren't aware of how to avoid the PadV2 as part of the conversion process.

MikeJKelly commented 2 years ago

Hi @fullm00n1

There is a review with PADV2 support at:

https://review.mlplatform.org/c/ml/armnn/+/6908

Can you see if it solves your problem?

From my reading of the ONNX documentation at https://github.com/onnx/onnx/blob/master/docs/Operators.md#Pad ONNX only supports PAD with a single constant value. If there are multiple constant_values present in your model then the TFLite parser will throw an exception: "Multiple padding values are not supported in PADV2".

Best regards, Mike

fullm00n1 commented 2 years ago

I integrated the changes in and ran the application I have that uses the armnn libraries. I first verified the application and parser with a tflite model that doesn't have the PadV2. That model ran as expected. I then ran the newer model with the PadV2 layers. I hit an assertion in the AddPadLayer of the TfLiteParser. It looks to be something in the pad descriptor based on the call stack and the compilers limited symbols reported.

My data type for all layers in the model is Float32.

The specific assertion was a malloc call:

malloc.c:2406: sysmalloc: Assertion `(old_top == initial_top (av) && old_size == 0) || ((unsigned long) (old_size) >= MINSIZE && prev_inuse (old_top) && ((unsigned long) old_end & (pagesize - 1)) == 0)' failed.

MikeJKelly commented 2 years ago

That might be a problem trying to copy the data for the constant_value or the paddings from the TFLite model. Can you run your application under valgrind? That might give us a better idea of what's going wrong. https://valgrind.org/

Otherwise can you tell us a little more about your PADV2 layer? You can visualise your TFLIte model using Netron: https://netron.app/

And then look for the PADV2 nodes in the Graph, select them and confirm that it has values for the pad constant value and the paddings?

fullm00n1 commented 2 years ago

Running valgrind might be hard as we are running a Xilinx Zynq Ultrascale+ (custom ARM64 environment). Also, netron isn't a real option as our network is proprietary and we have restrictions on what tools / apps we can use with our graph/model. I have attached a sanitized report from tensorflow of the graph report post ONNX to TF translation with all the Pad and PadV2 layer info from our model. Does this help?

tensorflow_report

fullm00n1 commented 2 years ago

I tried compiling and running the valgrind tool but it chokes because of my custom linux build for the custom platform we are running on.

fullm00n1 commented 2 years ago

I can do some old school print statements with count in the ParsePad function if that would help? Not exactly sure what to instrument up but given that it appears to break at the m_Network->AddPadLayer call. Is there something specific to instrument on the inputs to that call?

MikeJKelly commented 2 years ago

It looks like the const_values are missing from the PADV2 (Tensor 6 PadV2/constant_values has a Shape of []). I've updated https://review.mlplatform.org/c/ml/armnn/+/6908 to hopefully better handle this. Could you please check if it prevents the issue with the failed assertion?

fullm00n1 commented 2 years ago

Mike,

The updated code shows the same sysmalloc assertion. After tracing things down, it appears that the following line is the problem for the PadV2 case:

      ::memcpy(padBuffer.data(), bufferPtr->data.data(), padTensorInfo.GetNumBytes());

The mem copy seems to be to cause for the assertion by sysmalloc in malloc.c.

If I do this to that line:

      if (opcode == tflite::BuiltinOperator_PAD)
          {
          ::memcpy(padBuffer.data(), bufferPtr->data.data(), padTensorInfo.GetNumBytes());
      }

Then the sysmalloc assertion does not occur. However, when I place the above conditional, as you might expect, the model doesn't pass validation.

Is it possible your assumption about the structure of the tensors is not correct for the PadV2?

Thanks for your help on this and I look forward to helping getting this solved.

fullm00n1 commented 2 years ago

I was able to get tensorboard to show the graph of the tensorflow model that we converted to tensorflow lite. I have attached what the PadV2 looks like in tensorboard.

And below are what tensorboard reflect are the attributes for each of the parts of the PadV2

MikeJKelly commented 2 years ago

It looks like the problem is with paddings data type (Int64). I'm not sure why that's int64 rather than int32 but looking at https://www.tensorflow.org/api_docs/python/tf/pad TF seems to expect "paddings to be A Tensor of type int32.", in python at least.

In our TFLite parser we end up trying to memcpy the uint64 values there into a uint32 vector which causes a crash. I've updated the review with code to work around this and a test where the paddings are int64 and it seems to be working now. Can you try it with your model please?

https://review.mlplatform.org/c/ml/armnn/+/6908

fullm00n1 commented 2 years ago

Mike,

Again, thanks for your help on this issue.

The TfLiteParser fix appears to have solved the memory allocation issue. Now, my call to CreateNetworkFromBinaryFile executes successfully. However, I was not able to run inference using the model because my call to "armnn::Optimize" fails. It fails in Layer::VerifyShapeInferenceType. I placed a print statement at the beginning of that method to determine which layer is the issue. This was the output:

I am not sure why the Sub op has a problem. To review my conversion process, I first take our ONNX model and run onnx-tf on it to convert it to a Tensorflow Pb saved model. I then import the saved model in Tensorflow and use the converter within Tensorflow to convert it to a tensorflow lite model. That is what I am using with the ArmNN library when I call CreateNetworkFromBinaryFile.

So at this point, I am still not able to run the model for inference and verify the PadV2 operates. Suggestions on what might help diagnose the problem during validation?

There is only one Sub op in the model. In the Tensorflow model, the Sub op shows up in Tensorboard as shown below:

Running the tensorflow.lite.tools.visualize on the tensorflow lite mode, I get this info regarding the Sub Op in the Tensorflow lite converted model:

I feel like we are getting closer.

MikeJKelly commented 2 years ago

If you're using ShapeInferenceMethod::ValidateOnly then ArmNN won't be able to load models with dynamic shapes.

If you set the ShapeInferenceMethod to ShapeInferenceMethod::InferAndValidate instead of ShapeInferenceMethod::ValidateOnly then ArmNN will attempt to figure out valid shapes when the shapes are dynamic.

Are you running your model through ExecuteNetwork? If you are then you can run the program with the --infer-output-shape command line switch to use ShapeInferenceMethod::InferAndValidate

If you're running it from your own code then you'll have to switch your ShapeInferenceMethod to InferAndValidate while you're optimizing your networks. Something like this:

OptimizerOptions optOptions;
optOptions.m_shapeInferenceMethod = armnn::ShapeInferenceMethod::InferAndValidate;
IOptimizedNetworkPtr optNet = Optimize(*net, backends, runtime->GetDeviceSpec(), optOptions);

fullm00n1 commented 2 years ago

Mike, We are running the API in our own code. I tried to add the OptimizerOptions to the Optimize call but got the same error:

Here is a snippet of code from our application that I am using to create and load the network. Am I missing something?

      // Import the TensorFlowLite model.
      using IParser = armnnTfLiteParser::ITfLiteParser;
      auto armnnparser(IParser::Create());
      armnn::INetworkPtr network = armnnparser->CreateNetworkFromBinaryFile(programOptions.modelPath.c_str());

      // Find the binding points for the input and output nodes
      using BindingPointInfo = armnnTfLiteParser::BindingPointInfo;
      const std::vector<BindingPointInfo> inputBindings  = { armnnparser->GetNetworkInputBindingInfo(0, inputName) };
      const std::vector<BindingPointInfo> outputBindings = { armnnparser->GetNetworkOutputBindingInfo(0, outputName) };

      // ------------------------------------------------------------------------
      // Optimize graph and load the optimized graph onto a compute device
      // ------------------------------------------------------------------------

      // Optimize the network for a specific runtime compute device, e.g. CpuAcc, GpuAcc
      armnn::IRuntime::CreationOptions options;
      armnn::IRuntimePtr runtime(armnn::IRuntime::Create(options));
      armnn::OptimizerOptions optOptions;
      optOptions.m_shapeInferenceMethod == armnn::ShapeInferenceMethod::InferAndValidate;
      armnn::IOptimizedNetworkPtr optimizedNet = armnn::Optimize(*network,
                                programOptions.computeDevice,
                                runtime->GetDeviceSpec(),
                                optOptions);
      // Load the optimized network onto the runtime device
      armnn::NetworkId networkId;
      runtime->LoadNetwork(networkId, std::move(optimizedNet));

fullm00n1 commented 2 years ago

I may not be understanding something correctly, but the call to Optimize with the InferAndValidate option seems to trigger optimize to call Graph::InferTensorInfos. At the end of the loop in InferTensorInfos, it appears that the ValidateOnly section executes because in looks at the layer->m_ShapeInferenceMethod member of the layer which is set to ValidateOnly by default when the layer is created.

Am I missing some step that would change the layer->m_ShapeInferenceMethod to something other than ValidateOnly?

MikeJKelly commented 2 years ago

Apologies, I asked around and there's a flag you need to set on the TFLiteParser too.

     armnnTfLiteParser::ITfLiteParser::TfLiteParserOptions parserOptions;
     parserOptions.m_InferAndValidate = true;

     // Import the TensorFlowLite model.
     using IParser = armnnTfLiteParser::ITfLiteParser;
     auto armnnparser(IParser::Create(parserOptions));

Then on all created layers layer->m_ShapeInferenceMethod will be InferAndValidate.

However, I can't be sure that we will be able to support your dynamic tensor shapes as our support for dynamic tensors is quite simple. Can you output your model with static shapes?

fullm00n1 commented 2 years ago

Do you know how I can determine were the dynamic tensor shapes are being inserted? We go from Matlab to ONXX to Tensorflow to Tensorflow lite. It appears that the onnx-tf converter may be the one creating the dynamic shapes when going from ONNX to Tensorflow and you can't really control the conversion. Do you have any suggestion on possibly a simple conversion within Tensorflow to convert from dynamic to static tensors?

fullm00n1 commented 2 years ago

I was able to get Netron installed locally. Is there a way within Netron to see if a tensor is dynamic or static?

MikeJKelly commented 2 years ago

Are you still getting an error when parsing the TFLite model?

Dynamic tensors will have one or more -1 in their shape_signatures.

fullm00n1 commented 2 years ago

I was able to finally run our model and got reasonable results. Interestingly, in Netron, the ONNX models shows no dynamic tensors if the -1 on shape signatures is an indicator. It appears that the onnx-tf converter as part of tensor flow seems to insert the dynamic tensors in. This seems to be a change in that converter in the last year or two from what we can tell.

TeresaARM commented 2 years ago

Hi @fullm00n1,

I understand your problem is resolved. Please let us know otherwise.

TeresaARM commented 2 years ago

Hi @fullm00n1,

I am closing the issue as It seems your problem has been resolved. Please re-open the issue otherwise and thank you for reporting this

Kindest regards

ARM-software / armnn

Get error with PadV2 unsupported #603