NickSwardh / YoloDotNet

YoloDotNet - A C# .NET 8.0 project for Classification, Object Detection, OBB Detection, Segmentation and Pose Estimation in both images and videos.
GNU General Public License v3.0
158 stars 28 forks source link

Benchmarking GPU vs. CPU: Unexpected Results #11

Open NickSwardh opened 4 months ago

NickSwardh commented 4 months ago

Originally posted by @niclasastrom in https://github.com/NickSwardh/YoloDotNet/issues/9#issuecomment-2024052658

Performance using the GPU is worse than using the CPU. I have an RTX 4070, running Windows 11 Pro. The latest OS- and NVidia driver updates are installed.

I expected higher throughput when using the GPU, but I could be wrong. What performance can I expect, CPU vs GPU?

For example, the classification test took 130ms on the CPU and 572ms on the GPU. Do you know if this is expected?

I added a couple of lines to measure compute time:

var stopWatch=new Stopwatch();
stopWatch.Start();
List<Classification> results = yolo.RunClassification(image, 3); // Get top 5 classifications. Default = 1
stopWatch.Stop();
Console.WriteLine("Elapsed time: "+stopWatch.ElapsedMilliseconds);

Thanks for your input. If this follow-up question doesn't fit the topic, please forgive me and I will try to file my question somewhere else.

NickSwardh commented 4 months ago

This is normal behavior when running the very first inferences on the GPU. During startup, the CPU has to prepare and prime the GPU by copying the inputs, which, in turn, increases the execution time. When you run additional inferences after the first, things appear to go fast as expected.

Quote from the ONNX runtime Docs about CPU vs GPU execution

When working with non-CPU execution providers, it’s most efficient to have inputs (and/or outputs) arranged on the target device (abstracted by the execution provider used) prior to executing the graph (calling Run()). When the input is not copied to the target device, ORT copies it from the CPU as part of the Run() call. Similarly, if the output is not pre-allocated on the device, ORT assumes that the output is requested on the CPU and copies it from the device as the last step of the Run() call. This eats into the execution time of the graph, misleading users into thinking ORT is slow when the majority of the time is spent in these copies.

As the docs also states, this can be addressed by allocating memory to the GPU prior to execution.

I'm currently adding a new option to YoloDotNet to prime the GPU with allocated memory before execution. In my own tests with my RTX 3060, I get these approx results for the first inference:

Classification, ex:

CPU: 66ms GPU without allocated GPU memory: 541ms GPU: with allocated GPU memory: 13ms

Object Detection, ex:

CPU: 245ms GPU without allocated GPU memory: 4715ms GPU: with allocated GPU memory: 60ms

niclasastrom commented 4 months ago

Wow! That's impressive! Thanks for the explanation. I will certainly download the new version as soon as it becomes available.