GPU Performance Worse than CPU Performance

toita86 commented 1 month ago

Hello,

I am currently using the YoloDotNet NuGet package to test the performance of YOLO models, I'm doing this testing for my degree thesis. However, I have encountered an issue where the GPU performance is significantly worse than the CPU performance.

Environment:

YoloDotNet version: v2.0
GPU: 4090
CUDA/cuDNN version: cuda 11.8 and cudnn 8.9.7
.NET version: 8

Steps to Reproduce:

var sw = new Stopwatch();
for (var i = 0; i < 500; i++)
{
    var file = $@"C:\Users\Utente\Documents\assets\images\input\frame_{i}.jpg";

    using var image = SKImage.FromEncodedData(file);
    sw.Restart();
    var results = yolo.RunObjectDetection(image, confidence: 0.25, iou: 0.7);
    sw.Stop();
    image.Draw(results);

    image.Save(file.Replace("input", $"output_{yolo_version}{version}_{target}").Replace(".jpg", $"_detect_{yolo_version}{version}_{target}.jpg"),
        SKEncodedImageFormat.Jpeg);
    times.Add(sw.Elapsed.TotalMilliseconds);
    Console.WriteLine($"Time taken for image {i}: {sw.Elapsed.TotalMilliseconds:F2} ms");

This is the way I'm taking the time measure for the detections

Expected Behavior is that the inference using the GPU should be faster than inference using the CPU. But the performance are not improving using the GPU.

To load the model i use this setup in the GPU case

yolo = new Yolo(new YoloOptions
{
    OnnxModel = @$"C:\Users\Utente\Documents\assets\model\yolov{yolo_version}{version}_{target}.onnx",
    ModelType = ModelType.ObjectDetection,  // Model type
    Cuda = true,                           // Use CPU or CUDA for GPU accelerated inference. Default = true
    GpuId = 0,                               // Select Gpu by id. Default = 0
    PrimeGpu = true,                       // Pre-allocate GPU before first. Default = false
});
Console.WriteLine(yolo.OnnxModel.ModelType);
Console.WriteLine($"Using GPU for version {yolo_version}{version}");

Performance Metrics:

GPU Inference Time: Total time taken for version m: 21124.52 ms Average time per image for version m: 42.25 ms

CPU Inference Time: Total time taken for version m: 18869.73 ms Average time per image for version m: 37.74 ms

@NickSwardh I would appreciate any assistance or guidance in resolving this issue. Please let me know if you need any further information.

Thank you.

toita86 commented 1 month ago

Just to visualize better the problem i have printed this graphs

time_taken_comparison_yolo8_CPU time_taken_comparison_yolo8_GPU

toita86 commented 1 month ago

@NickSwardh sorry to bother, but I'm al little bit in a hurry.

NickSwardh commented 1 month ago

Hi @toita86 , I'm currently out of town for a couple of weeks. Weird, never seen this before. Are you using CUDA 1.8 and cuDNN 8.9?

Try to disable the GPU memory allocation (PrimeGpu = false) and let Onnx-Runtime do the memory allocation automatically on the first inference. This is the default behaviour. This will make the very first inference take longer time due to the allocation process but any inference after that will be fast. Measure your performance from the second inference and see how that goes.

toita86 commented 1 month ago

Thank you for your suggestion and for your time. This behaviour was really confusing at first. The first run were really good and amd with performance as espected. After two weeks the this anomaly stared. I will post updates as soon as possible!

toita86 commented 1 month ago

@NickSwardh Little update, nothing to do the same output whit PrimeGPU=false. The inferences with the GPU still have worst performance.

toita86 commented 1 month ago

I have tried a different machine (with a fresh installation of win, cuda and cudnn) for running the test but there is the same anomaly behaviour. It's driving me crazy tbh. My suspect its the ONNX runtime but I really don't know how to test it and how to fix it.

@NickSwardh

toita86 commented 1 month ago

@NickSwardh I am really sorry to disturb, but I have looked a little deeper and approx the first 50/60 inferences have a good performance and than starts degrading slowly to turn stable around 75/77ms per inference. photo_2024-07-29_13-15-34

its possible that it is something with the memory allocation of the model or of the images?

toita86 commented 1 month ago

Hi I have an update.

Thanks to @ChristophRackwitz on StackOverflow that suggested to disable this piece of the code

image.Draw(results);

    image.Save(file.Replace("input", $"output_{yolo_version}{version}_{target}").Replace(".jpg", $"_detect_{yolo_version}{version}_{target}.jpg"),
        SKEncodedImageFormat.Jpeg);

After disabling this part the timings per detection started to be on point. I'm not really sure why but my suspect is that the Skiasharp for all the I/O operations generated a huge use of memory slowing down the image passing from the CPU to the GPU.

This is the code that I have written to have as much things ready and to not do nothing else than detections in the loop.

I have also forced the disposal of the image after the loop to be sure to have freed the memory for the next model inference.

// Load all images into memory
for (var i = 1; i < 500; i++)
{
    string file = $@"C:\Users\Eduard\Documents\assets\images\input\frame_{i}.jpg";
    using var stream = new FileStream(file, FileMode.Open, FileAccess.Read);
    var image = SKImage.FromEncodedData(stream);
    images.Add(i, image);
}

// Run inference on in-memory images
foreach (var kvp in images)
{
    int i = kvp.Key;

    sw.Restart();
    var results = yolo.RunObjectDetection(kvp.Value, confidence: 0.25, iou: 0.7);
    sw.Stop();
    Console.WriteLine($"Time taken for image {i}: {sw.Elapsed.TotalMilliseconds:F2} ms");
    rawResults.Add((i, results, sw.Elapsed.TotalMilliseconds));
}

// Dispose of images to free memory
foreach (var kvp in images)
{
    kvp.Value.Dispose();
}

// Process results outside the loop
foreach (var rawResult in rawResults)
{
    int imageIndex = rawResult.imageIndex;
    var results = rawResult.results;
    double elapsedMilliseconds = rawResult.elapsedMilliseconds;

    times.Add(elapsedMilliseconds);

    if (results.Count > 0)
    {
        var labels = new List<string>();
        var confidences = new List<double>();
        foreach (var result in results)
        {
            labels.Add(result.Label.Name.ToString());
            confidences.Add(result.Confidence);
        }
        detections.Add((
           imageIndex,
           labels,
           confidences
         ));
    }
}

Now the times for the YOLOv8 size m are

Total time of execution 11825.34ms
And an average time of 23.70ms

toita86 commented 1 month ago

@NickSwardh When you have some time, if you have any more in depth explanation it will be really nice just to understand the situation better.

toita86 commented 1 month ago

@NickSwardh, sorry to bother you again, but I have one last question before closing this issue. Is there any particular reason why yolov9 is not compatible? or not implemented?

toita86 commented 2 weeks ago

Hi @NickSwardh, I want to be quick; when you have time, please answer my question about YOLOv9. I need it to write a paragraph for my thesis.

NickSwardh commented 1 week ago

Hi @toita86,

I have tried to reproduce the results you are getting but to no avail. One thing I noticed in your code that might cause problems you're having (the snippet you commented out):

image.Draw(results);

    image.Save(file.Replace("input", $"output_{yolo_version}{version}_{target}").Replace(".jpg", $"_detect_{yolo_version}{version}_{target}.jpg"),
        SKEncodedImageFormat.Jpeg);

Is that you are missing using statements which might cause problems since the objects are not disposed of. Many of the SkiaSharp objects must be disposed when you're done with them to prevent unexpected behaviour and performance issues. By setting a using statment you ensure the correct use of an IDisposable instance. Another problem is that you're not drawing the results on your image correctly.

When loading an image, make sure you either add a using statement or invoke Dispose() on the objects manually to make sure resources are released to prevent unexpected stuff from happening, eg:

// Create a new yolo instance with a using statement to ensure the object is disposed of correctly
using var yolo = new Yolo(options);

// Load image with a using-statement to ensure the object is disposed of correctly
using var image = SKImage.FromEncodedData(image);

...

Same thing goes for when you want to draw the detected results on your image. eg:

// Draw the results to a new image with a using statement to ensure the object is disposed of correctly
using var resultImage = image.Draw(results);
resultImage.Save($@"C:\Users\Eduard\Documents\assets\images\input\frame_{i}.jpg", SKEncodedImageFormat.Jpeg);

Is there any particular reason why yolov9 is not compatible? or not implemented?

I'll add it to the next update ;)

toita86 commented 1 week ago

Hi @NickSwardh,

Thank you so much for the suggestion! After a more in-depth look, it was exactly a memory management problem. This was one of my first attempts at using C#, so I made many mistakes. However, I ended up with a nice performance and consistency.

I'll add it to the next update ;)

Perfect, Just what I have imagined.

I'm putting the finishing touches on my thesis. These answers complete the evaluation. Thank you again!

NickSwardh commented 1 week ago

That's awesome, @toita86, thank you for letting me know :)

NickSwardh / YoloDotNet

GPU Performance Worse than CPU Performance #19