Unity-Technologies / barracuda-release

Other
566 stars 77 forks source link

Warning when importing .onnx neural network concerning Resize layer #159

Closed CoaNewco closed 3 years ago

CoaNewco commented 3 years ago

When I import onnx model that incorporates Resize layer I get a warning. Now, I know that internally model gets converted to NN and only then it's used, so I tried to manually convert it and save it using ModelWriter.Save(). When I do this, Resize layer becomes Resample2D but still I get warnings that point me to Resize layer. ModelWarning

When we discard Resize layers from the network everything works fine but the output is truncated and we don't get the wanted results. Could you help me with this issue?

FlorentGuinier commented 3 years ago

Hi CoaNewco,

In Resize node ONNX definition https://github.com/onnx/onnx/blob/master/docs/Operators.md#Resize there is a nearest_mode parameters. This warning mean we only support round_prefer_floor for it. In other word if the network is using something else like round_prefer_ceil we will still apply round_prefer_floor (and thus the warning). A workaround if possible is to change the network to use round_prefer_floor on those Resize nodes, is that something you can do?

Have a great day Florent

CoaNewco commented 3 years ago

I will get right back at you. Thanks for the advice!

CoaNewco commented 3 years ago

On an unrelated issue, can I truncate or somehow optimize ArgMax() function? Currently it's giving me the biggest overhead of 100+ milliseconds. Both of my networks work in range from 5-10 milliseconds, so ArgMax is giving me the biggest headache. I'm trying to lower the execution of the whole code to approximately 40ms or lower.

FlorentGuinier commented 3 years ago

Humm interresting there is a bunch of performance caveat with ArgMax() atm, for example on the Compute/GPU backends it introduced extra transposes when not done in the channel axis. However those timing seems worst that than, is it possible for you to share your network please? I would like to take a look.

Also what backend are you using?

Thanks:)

CoaNewco commented 3 years ago

decoder.onnx net on we transfer. I'm using ComputePrecompiled backend. Any advice on these issues are more than welcome. Thank you a lot!

AlexRibard commented 3 years ago

@CoaNewco I do not see ArgMax in your model? Are you using Tensor.ArgMax for the final output? If you append a ArgMax directly in your model it will use the GPU backend. And not the CPU one Can you tell me if this function is optimized enough for you? I've been putting off optimizations for ArgMax due to time constraints for 1.3.3. So if the current implementation is not fast enough I'll push some optimizations for ArgMax/Min next release

AlexRibard commented 3 years ago

FYI @CoaNewco I replaces nearest_mode to round_prefer_floor and coordinate_transformation_mode to half_pixel and your model runs and is valid to 10^-5 compared to onnxruntime

CoaNewco commented 3 years ago

Dear Alex, on behalf of NewCo. company, we are extremely satisfied with everything Barracuda is offering. We managed to execute two intertwined models in less than 10ms, which is extraordinary. This is all running on PC and our goal, also our next step , is to push everything to Android platform(we're talking about real time semantic segmentation).

I don't have ArgMax inside my model. I'm using TensorExtensions.ArgMax() on final output. We had some problems implementing ArgMax directly in our model, essentially, we are using TensorFlow framework to create a model which we are then exporting to onnx and then using it in Unity/Barracuda. We are extremely satisfied with Barracuda inference time. Comparing to OpenCVforUnity plug-in implementation and usage of models, we are currently recording 30 times faster inference. Our main problem is defintely ArgMax which is running on CPU, for now. Can I replace attribute values inside of Unity? And would you be so kind to share the corrected model with me?

AlexRibard commented 3 years ago

yes ok, that's what I thought. Here is what you can do:

Model model = = ModelLoader.Load(...);

// appending an additional layer
ModelBuilder modelBuilder = new ModelBuilder(model);
// Reducing over H and W, (NHWC so axis=1, axis=2)
modelBuilder.Reduce(Layer.Type.ArgMax, "output", "output_argmax1", axis:1); // NHWC -> N1WC
// maybe you need to reduce over two dimensions
modelBuilder.Reduce(Layer.Type.ArgMax, "output_argmax2", "output_argmax1", axis:2); // N1WC -> N11C
modelBuilder.Output("output_argmax2")

worker = WorkerFactory.CreateWorker(workerType, modelBuilder.model);

let me know

AlexRibard commented 3 years ago

also if you want a bit more perf, try changing ComputeInfo.channelsOrder to ChannelsOrder.NCHW should speed up convolutions. FYI this mode is a bit experimental, so do check that the inference is correct. (I've checked with your model, things looks fine however)

CoaNewco commented 3 years ago

Dear Alex, I've had some problems with modification of my model. TBH, I haven't used onnxruntime. We export our models from PyTorch directly to onnx, and in PyTorch framework we don't have option to change nearest_mode and coordinate_transformation_mode. I've been having some troubles with setting up onnxruntime, so I would be much obliged if you could pass me modified model or point me to some documentation that could help me get things rolling. Thank you VERY MUCH for all the help so far!

AlexRibard commented 3 years ago

Ok so two part: For editing the model for it to run, here is the model and said code to edit it https://drive.google.com/file/d/1EJqSKjL_AqugS2brbL1NP2DbqKsAOHlX/view?usp=sharing

import onnx

model = onnx.load("CoaNewco_decoder.onnx")

graph = model.graph
for node in graph.node:
    if node.op_type in ['Resize']:
        for i in range(len(node.attribute)):
            if(node.attribute[i].name == 'nearest_mode'):
                node.attribute[i].s = b'round_prefer_floor'
            if(node.attribute[i].name == 'coordinate_transformation_mode'):
                node.attribute[i].s = b'half_pixel'
            print(node.attribute[i])

onnx.save(model, "CoaNewco_decoder_2.onnx")
AlexRibard commented 3 years ago

And here is the inference code, by appending ArgMax as an op to the model. I checked it runs on the GPU. (slowish path as I mentioned, but much faster than CPU) Inference.txt

CoaNewco commented 3 years ago

Thank you greatly @AlexRibard ! I will set it all up and get back to you!

CoaNewco commented 3 years ago

Hello @AlexRibard. We've used the last week to try to set it all up and we've had some success' and some difficulties. We have used the code you have provided us with to implement ArgMax inside of our model. That worked like a charm, but we still have some problems reading the values from ArgMax, so the benefits aren't all that great. With the current model, our output is 38x23 so we still have 874 values that we're putting inside Mat(OpenCV for Unity plugin), and we're still losing between 100-140ms, depending on a frame. Is there some way of plucking out the values for individual axis? Or any other way whatsoever of optimizing this step?

Concerning the layers, we've stumbled into some problems using Resize layer. Even after setting the proper modes for Resize attributes(coordinate transformation mode to half-pixel and nearest mode to round_prefer_floor), our execution time jumps up to 4500ms. Using similar decoder model without Resize layers work excellent and both models combined work in as little as 3-4ms.

On another point. We've tried using Upsample instead of Resize layer and first of all, we get 'undefined' shape for output[, , , ] and second we have problem with scales being of type float32[] while Barracuda needs integer type. How can we fix this problem? Do you know how big result difference may be? And how can we amortize these result impacts?

Bearing in mind that our goal is image semantic segmentation in real time, currently we have really big issues with models containing Resize or Upsample layers. Other than this issues, we're utterly satisfied with Barracuda and really looking forward to fixing these issues working together or individually.

Coa with whole NewCo, thank you a lot. We are approaching the finalization of the product involving Barracuda and neural networks so when we get the direction from you, we will be closer to the release date of the application.

AlexRibard commented 3 years ago

Few points: I would need to fully profile your network to be sure about the numbers. But let me try to answer them.

AlexRibard commented 3 years ago

I'll give a shot at profiling your model to see were the bottlenecks are

AlexRibard commented 3 years ago

@CoaNewco update on this. I've profiled your model:

I think all these are on us really. There isn't that clean of a workaround for the Softmax. The transposes should be automatically removed by us, but atm Softmax is blocking them

CoaNewco commented 3 years ago

Dear @AlexRibard, first of all, many thanks for all your help so far.

In this link I have shared with you two models using WeTransfer. There you can find encoder and decoder onnx models that we're using for semantic segmentation. Bear in mind that we've converted this models from PyTorch using torch.onnx module in Python. Training and re-training will be done in PyTorch also. The ultimate goal is real-time semantic segmenation on mobile devices.

Considering everything told so far when we import our PyTorch exported to onnx models in Unity we run into some problems. First of all, our inference time is is between 7000-8000ms. We also have slightly different precision results when running our code in Python vs. Unity.

The models we've shared with you definitely work but we have attributes mismatch considering Resize layer(coordinate_transformation_mode is originally pytorch_half_pixel and nearest_mode attribute is floor by default). Also, when we tried replacing Resize with Upsample we have problems with scales data type which is natively float32[] while Barracuda expects integer values, of this too you're also aware. When we modify Resize layer attributes using code you've provided us with we also get different results comparing to Python code. PyTorch currently does not support prefer_round_floor and half_pixel attribute values, so there's that. Do you have any pointers of how we can modify our models to get ourselves closer to the Python results without changing the already trained model structure(at least not drastically)? This models structure is going to be implemented in our final product.

Thank you for all your time and patience. We're eager to hear from you soon!

Cheers from Coa with all of NewCo team :)

AlexRibard commented 3 years ago

Ok, thanks. I'll take a look at it.

CoaNewco commented 3 years ago

UPDATE

Good day @AlexRibard, sorry if I'm bothering you, but just wanted to get you a little update we've done. In this link you can find updated decoder model. We've removed softmax from our model to get some performance and so far it's looking better. Could you share any advice on how to maximize optimizations(???) bearing in mind that our goal is to push models with all of post-processing on mobile devices(Android and iOS) but so far the performance is really slow(testing on Samsung S20 Ultra and Huawei P30 Pro. Haven't got to do any testing on iOS so far).

And lastly, could you give me any pointers on how to use StartManualSchedule() effectively? We had no luck finding any closer explanation in Barracuda docs.

Once again, sorry if I'm bothering or overwhelming you, and again, thanks for all your help so far!

AlexRibard commented 3 years ago

Thanks. I'll check your new model. I've optimized softmax a bit for your last model. It's now running better, but not mobile speed I'm afraid.

AlexRibard commented 3 years ago

The main performance problem with your model is layer 135 Conv: it's doing a conv 1x80x80x4096 with a 3x3x4096x512 That number of channels results in a very costly operation. This takes 235ms compared to layer 111 512 -> 150 channels. layer 73 for example 2048 -> 512 takes 0.22ms

We could probably have a more optimized operation for this case (high channels, low-ish spatial dim and output channels). But in your case it will most likely still be slow on mobile. I'd recommend splitting up that convolutions in smaller convolutions (channel wise) if possible

AlexRibard commented 3 years ago

StartManualSchedule details can be found here: https://github.com/Unity-Technologies/barracuda-release/blob/release/1.0.0/Documentation~/ModelExecution.md The way to use it is the following:

    private Tensor ExecuteInParts(IWorker worker, Tensor I, int syncEveryNthLayer = 5)
    {
        var schedule = worker.StartManualSchedule(I);
        var it = 0;
        bool hasMoreWork;

        do
        {
            hasMoreWork = schedule.MoveNext();
            if (++it % syncEveryNthLayer == 0)
                worker.FlushSchedule(blocking:true);

        } while (hasMoreWork);

        return worker.CopyOutput();
    }

This will schedule 5 layers per frame. In your case it would not help that much, because it's really that one op layer 135 that is the bottleneck

CoaNewco commented 3 years ago

Thank you a lot @AlexRibard. One last(hopefully, and big sorry if I'm annoying) question for you is, how do you debug models? How can I check which layer takes how much time, etc, etc. Every bit of information is important. Thanks again.

AlexRibard commented 3 years ago

We should probably have a dedicated page on the docs explaining. You need to use the Unity profiler. https://docs.unity3d.com/Manual/Profiler.html It can show CPU/GPU timings and memory use. We have counters per layer so you can get a sense of what takes what amount of time. You can also profile on mobile. https://docs.unity3d.com/Manual/profiler-profiling-applications.html Although on Apple you need XCode

CoaNewco commented 3 years ago

Thank you again for all your advice.