Unity-Technologies / barracuda-release

Other
567 stars 77 forks source link

Tensor.AsFloat() getting 2 float points twice slower then Execute time of custom CNN. #251

Closed Tigran1983 closed 2 years ago

Tigran1983 commented 2 years ago

Hello dear Team,

Are there any way to get output of Neural Network more faster?

For my custom networks 1 and 2 float points of network outputs getting 2-3 times slower then execution of the networks. I made huge research, but still can't find any way to do this. Please let me know solution to this issue.

It's not real - network execution took 3 ms, output.AsFloats() (with only 2 floats) took 6-7 ms.

Best Regards, Tigran

AlexRibard commented 2 years ago

Are you executing the model on the GPU ? And measuring the inference speed on the CPU? (I'll assume yes for both, so feel free to provide more details about your use case)

If you are measuring execution time with a CPU clock, this only measures the CPU overhead to schedule the model. The GPU execution time is not captured by the measurement, since it is running asynch in the background. You'll capture part of the cost if inference last more than 30ms By downloading the data from the GPU to CPU on the same frame, you are forcing the CPU to wait on the GPU to finish + download. So the time you are measuring now capture the whole execution cost. Which differs greatly from the scheduling cost you previously measured.

You can check the total inference time with the unity profiler. You can check the CPU time (model scheduling cost) and GPU time (model execution time). This should +/- match the time you are reading with the download.

AlexRibard commented 2 years ago

If you could share your model, we can check it out and see if there is any issues with it, or if there are optimizations we can do to make inference faster for you. On what hardware are you running inference?

Tigran1983 commented 2 years ago

Hello, Many thanks for quick response, I am sending ONNX file. I am measuringon CPU, with real time in Unity code, using Time.realtimeSinceStartup. Measure single line of code: var temp = Time.realtimeSinceStartup; m_Worker.Execute(input_wrist); Print( Time.realtimeSinceStartup - temp);

and

var temp = Time.realtimeSinceStartup; float[] outp = Output.AsFloats(); or Output.ToReadOnlyArray(); Print( Time.realtimeSinceStartup - temp);

And I created worker as described below: m_Worker = WorkerFactory.CreateWorker(m_RuntimeModel, WorkerFactory.Device.CPU);

Because this is giving fastest inference time within other methods (it is faster then m_Worker = WorkerFactory.CreateWorker(WorkerFactory.Type.CSharp, m_RuntimeModel);

epoch_3300.zip

Best Regards, Tigran

Tigran1983 commented 2 years ago

Hello dear team,

As you can see from my example, Tensor.AsFloat() took 2-3 times longer then network inference, and this is the difference with OpenCVDnn, if you can resolve such issue, then whole inference time by Barracuda in my opinion will not slower then OpenCVDnn. Please, let me know if you'll resolve this issue.

Best Regards, Tigran

FlorentGuinier commented 2 years ago

Hi @Tigran1983

I would advice looking into what exactly is taking the time using https://docs.unity3d.com/Manual/Profiler.html.

However the behavior you see is probably expected: Worker.Execute() on the CPU (as above) is using Burst and is only scheduling the work in the form of many jobs. When calling AsFloats() the main thread is then forced to block until those jobs are all done, witch result in the main thread being actively stalled and thus the timing you see.

Hope it helps Florent

Tigran1983 commented 2 years ago

Hi Florent,

In this case, if Execute() only scheduling jobs, then why this process takes about 2 ms? This is very big time for work scheduling. In general, I understand, that there are no any way to optimize such process related AsFloats() call (for example for my network attached). You can close issue.

Thanks and Regards, Tigran

FlorentGuinier commented 2 years ago

@Tigran1983

if Execute() only scheduling jobs, then why this process takes about 2 ms?

Very good point! I looked at how the model perform locally and the 2ms are actually from a memory optimization we perform when executing Conv2D on CPU via Burst. In short our main convolution algorithm is using temporary mem buffers that it explicitely release after ensuring the related jobs are done. This is a tradeoff to ensure peak memory remain low during the execution of the model, but indeed it create syncronisation points between main and workers thread. This explain the time you can see on the main thread even without requesting the output of the network. On the bright side on my system (i7)+ this network i can see that the worker thread have a quite high occupancy still.

In general, I understand, that there are no any way to optimize such process related AsFloats()

We are always looking to optimize Barracuda in term of speed and memory and you can expect improvement in general in the future release :) That being said the current behavior is indeed as expected. Thus i'm closing the issue. Thanks a lot for your feedback!