Closed CoaNewco closed 3 years ago
Hi CoaNewco,
In Resize node ONNX definition https://github.com/onnx/onnx/blob/master/docs/Operators.md#Resize there is a nearest_mode
parameters. This warning mean we only support round_prefer_floor
for it. In other word if the network is using something else like round_prefer_ceil
we will still apply round_prefer_floor
(and thus the warning). A workaround if possible is to change the network to use round_prefer_floor
on those Resize nodes, is that something you can do?
Have a great day Florent
I will get right back at you. Thanks for the advice!
On an unrelated issue, can I truncate or somehow optimize ArgMax() function? Currently it's giving me the biggest overhead of 100+ milliseconds. Both of my networks work in range from 5-10 milliseconds, so ArgMax is giving me the biggest headache. I'm trying to lower the execution of the whole code to approximately 40ms or lower.
Humm interresting there is a bunch of performance caveat with ArgMax() atm, for example on the Compute/GPU backends it introduced extra transposes when not done in the channel axis. However those timing seems worst that than, is it possible for you to share your network please? I would like to take a look.
Also what backend are you using?
Thanks:)
decoder.onnx net on we transfer. I'm using ComputePrecompiled backend. Any advice on these issues are more than welcome. Thank you a lot!
@CoaNewco I do not see ArgMax in your model? Are you using Tensor.ArgMax for the final output? If you append a ArgMax directly in your model it will use the GPU backend. And not the CPU one Can you tell me if this function is optimized enough for you? I've been putting off optimizations for ArgMax due to time constraints for 1.3.3. So if the current implementation is not fast enough I'll push some optimizations for ArgMax/Min next release
FYI @CoaNewco I replaces nearest_mode
to round_prefer_floor
and coordinate_transformation_mode
to half_pixel
and your model runs and is valid to 10^-5 compared to onnxruntime
Dear Alex, on behalf of NewCo. company, we are extremely satisfied with everything Barracuda is offering. We managed to execute two intertwined models in less than 10ms, which is extraordinary. This is all running on PC and our goal, also our next step , is to push everything to Android platform(we're talking about real time semantic segmentation).
I don't have ArgMax inside my model. I'm using TensorExtensions.ArgMax() on final output. We had some problems implementing ArgMax directly in our model, essentially, we are using TensorFlow framework to create a model which we are then exporting to onnx and then using it in Unity/Barracuda. We are extremely satisfied with Barracuda inference time. Comparing to OpenCVforUnity plug-in implementation and usage of models, we are currently recording 30 times faster inference. Our main problem is defintely ArgMax which is running on CPU, for now. Can I replace attribute values inside of Unity? And would you be so kind to share the corrected model with me?
yes ok, that's what I thought. Here is what you can do:
Model model = = ModelLoader.Load(...);
// appending an additional layer
ModelBuilder modelBuilder = new ModelBuilder(model);
// Reducing over H and W, (NHWC so axis=1, axis=2)
modelBuilder.Reduce(Layer.Type.ArgMax, "output", "output_argmax1", axis:1); // NHWC -> N1WC
// maybe you need to reduce over two dimensions
modelBuilder.Reduce(Layer.Type.ArgMax, "output_argmax2", "output_argmax1", axis:2); // N1WC -> N11C
modelBuilder.Output("output_argmax2")
worker = WorkerFactory.CreateWorker(workerType, modelBuilder.model);
let me know
also if you want a bit more perf, try changing ComputeInfo.channelsOrder
to ChannelsOrder.NCHW
should speed up convolutions.
FYI this mode is a bit experimental, so do check that the inference is correct.
(I've checked with your model, things looks fine however)
Dear Alex, I've had some problems with modification of my model. TBH, I haven't used onnxruntime. We export our models from PyTorch directly to onnx, and in PyTorch framework we don't have option to change nearest_mode and coordinate_transformation_mode. I've been having some troubles with setting up onnxruntime, so I would be much obliged if you could pass me modified model or point me to some documentation that could help me get things rolling. Thank you VERY MUCH for all the help so far!
Ok so two part: For editing the model for it to run, here is the model and said code to edit it https://drive.google.com/file/d/1EJqSKjL_AqugS2brbL1NP2DbqKsAOHlX/view?usp=sharing
import onnx
model = onnx.load("CoaNewco_decoder.onnx")
graph = model.graph
for node in graph.node:
if node.op_type in ['Resize']:
for i in range(len(node.attribute)):
if(node.attribute[i].name == 'nearest_mode'):
node.attribute[i].s = b'round_prefer_floor'
if(node.attribute[i].name == 'coordinate_transformation_mode'):
node.attribute[i].s = b'half_pixel'
print(node.attribute[i])
onnx.save(model, "CoaNewco_decoder_2.onnx")
And here is the inference code, by appending ArgMax as an op to the model. I checked it runs on the GPU. (slowish path as I mentioned, but much faster than CPU) Inference.txt
Thank you greatly @AlexRibard ! I will set it all up and get back to you!
Hello @AlexRibard. We've used the last week to try to set it all up and we've had some success' and some difficulties. We have used the code you have provided us with to implement ArgMax inside of our model. That worked like a charm, but we still have some problems reading the values from ArgMax, so the benefits aren't all that great. With the current model, our output is 38x23 so we still have 874 values that we're putting inside Mat(OpenCV for Unity plugin), and we're still losing between 100-140ms, depending on a frame. Is there some way of plucking out the values for individual axis? Or any other way whatsoever of optimizing this step?
Concerning the layers, we've stumbled into some problems using Resize layer. Even after setting the proper modes for Resize attributes(coordinate transformation mode to half-pixel and nearest mode to round_prefer_floor), our execution time jumps up to 4500ms. Using similar decoder model without Resize layers work excellent and both models combined work in as little as 3-4ms.
On another point. We've tried using Upsample instead of Resize layer and first of all, we get 'undefined' shape for output[, , , ] and second we have problem with scales being of type float32[] while Barracuda needs integer type. How can we fix this problem? Do you know how big result difference may be? And how can we amortize these result impacts?
Bearing in mind that our goal is image semantic segmentation in real time, currently we have really big issues with models containing Resize or Upsample layers. Other than this issues, we're utterly satisfied with Barracuda and really looking forward to fixing these issues working together or individually.
Coa with whole NewCo, thank you a lot. We are approaching the finalization of the product involving Barracuda and neural networks so when we get the direction from you, we will be closer to the release date of the application.
Few points: I would need to fully profile your network to be sure about the numbers. But let me try to answer them.
100-140ms lost to reading the result, do note that fetching the result implies a synch between the CPU and GPU. So after we schedule the network, you're going to need to wait on the GPU to finish in order to get the result on this frame. If you are not reading the result and everything stays on the GPU, your true execution time might be hidden because of that. One solution would be to have one frame delay between the CPU and GPU. ie: CPU schedule frame 1. on frame 2 before scheduling again, read the last frame GPU result.... Or keep everything on the GPU. (ie dump the tensor into a texture)
For the resize perf that is indeed interesting. Did you profile how much time we spend on resize? Or could it be because of the shape difference?
float Upsample is not supported atm. So there isn't much workaround for this now. I wouldn't be too focused on Upsample vs Resize. Internally it's calling the same kernel
I'll give a shot at profiling your model to see were the bottlenecks are
@CoaNewco update on this. I've profiled your model:
SoftMax
at the end is hitting slow CPU pathTranspose
flanking it are also expensiveResize
is not that much, the last one takes 5ms, but that isn't much compared to Conv_81
which takes 33ms
Overall without those ops, model takes 44ms
totalI think all these are on us really. There isn't that clean of a workaround for the Softmax. The transposes should be automatically removed by us, but atm Softmax is blocking them
Dear @AlexRibard, first of all, many thanks for all your help so far.
In this link I have shared with you two models using WeTransfer. There you can find encoder and decoder onnx models that we're using for semantic segmentation. Bear in mind that we've converted this models from PyTorch using torch.onnx module in Python. Training and re-training will be done in PyTorch also. The ultimate goal is real-time semantic segmenation on mobile devices.
Considering everything told so far when we import our PyTorch exported to onnx models in Unity we run into some problems. First of all, our inference time is is between 7000-8000ms. We also have slightly different precision results when running our code in Python vs. Unity.
The models we've shared with you definitely work but we have attributes mismatch considering Resize layer(coordinate_transformation_mode is originally pytorch_half_pixel and nearest_mode attribute is floor by default). Also, when we tried replacing Resize with Upsample we have problems with scales data type which is natively float32[] while Barracuda expects integer values, of this too you're also aware. When we modify Resize layer attributes using code you've provided us with we also get different results comparing to Python code. PyTorch currently does not support prefer_round_floor and half_pixel attribute values, so there's that. Do you have any pointers of how we can modify our models to get ourselves closer to the Python results without changing the already trained model structure(at least not drastically)? This models structure is going to be implemented in our final product.
Thank you for all your time and patience. We're eager to hear from you soon!
Cheers from Coa with all of NewCo team :)
Ok, thanks. I'll take a look at it.
UPDATE
Good day @AlexRibard, sorry if I'm bothering you, but just wanted to get you a little update we've done. In this link you can find updated decoder model. We've removed softmax from our model to get some performance and so far it's looking better. Could you share any advice on how to maximize optimizations(???) bearing in mind that our goal is to push models with all of post-processing on mobile devices(Android and iOS) but so far the performance is really slow(testing on Samsung S20 Ultra and Huawei P30 Pro. Haven't got to do any testing on iOS so far).
And lastly, could you give me any pointers on how to use StartManualSchedule() effectively? We had no luck finding any closer explanation in Barracuda docs.
Once again, sorry if I'm bothering or overwhelming you, and again, thanks for all your help so far!
Thanks. I'll check your new model.
I've optimized softmax
a bit for your last model. It's now running better, but not mobile speed I'm afraid.
The main performance problem with your model is layer 135 Conv: it's doing a conv 1x80x80x4096 with a 3x3x4096x512 That number of channels results in a very costly operation. This takes 235ms compared to layer 111 512 -> 150 channels. layer 73 for example 2048 -> 512 takes 0.22ms
We could probably have a more optimized operation for this case (high channels, low-ish spatial dim and output channels). But in your case it will most likely still be slow on mobile. I'd recommend splitting up that convolutions in smaller convolutions (channel wise) if possible
StartManualSchedule
details can be found here:
https://github.com/Unity-Technologies/barracuda-release/blob/release/1.0.0/Documentation~/ModelExecution.md
The way to use it is the following:
private Tensor ExecuteInParts(IWorker worker, Tensor I, int syncEveryNthLayer = 5)
{
var schedule = worker.StartManualSchedule(I);
var it = 0;
bool hasMoreWork;
do
{
hasMoreWork = schedule.MoveNext();
if (++it % syncEveryNthLayer == 0)
worker.FlushSchedule(blocking:true);
} while (hasMoreWork);
return worker.CopyOutput();
}
This will schedule 5 layers per frame. In your case it would not help that much, because it's really that one op layer 135 that is the bottleneck
Thank you a lot @AlexRibard. One last(hopefully, and big sorry if I'm annoying) question for you is, how do you debug models? How can I check which layer takes how much time, etc, etc. Every bit of information is important. Thanks again.
We should probably have a dedicated page on the docs explaining. You need to use the Unity profiler. https://docs.unity3d.com/Manual/Profiler.html It can show CPU/GPU timings and memory use. We have counters per layer so you can get a sense of what takes what amount of time. You can also profile on mobile. https://docs.unity3d.com/Manual/profiler-profiling-applications.html Although on Apple you need XCode
Thank you again for all your advice.
When I import onnx model that incorporates Resize layer I get a warning. Now, I know that internally model gets converted to NN and only then it's used, so I tried to manually convert it and save it using ModelWriter.Save(). When I do this, Resize layer becomes Resample2D but still I get warnings that point me to Resize layer.
When we discard Resize layers from the network everything works fine but the output is truncated and we don't get the wanted results. Could you help me with this issue?