Deci-AI / super-gradients

Easily train or fine-tune SOTA computer vision models with one open source training library. The home of Yolo-NAS.
https://www.supergradients.com
Apache License 2.0
4.54k stars 496 forks source link

Running YOLO-NAS GIL free #1730

Closed BennySemyonovAB closed 9 months ago

BennySemyonovAB commented 9 months ago

💡 Your Question

hi, i want to run yolo nas for inference of huge sets of images really really fast. thinking about running in GIL free interpeter to run on true multi threaded enviroment. does anybody tried it? any common issues? which GIL free interpeter do you suggest?

thanks!

Versions

No response

BloodAxe commented 9 months ago

"Really Really fast" and "Python" in once sentence is a rare combination. GIL probably isn't a major bottleneck to be honest. A python inference based on pyre pytorch model is never going to even close to what you can get with specialized inference packages. In order to get the maximum inference efficiency my advice would be to export model to ONNX format and then use ONNXRuntime or TensorRT or OpenVINO for maximum inference speed. You can check how to export model here https://github.com/Deci-AI/super-gradients/blob/master/documentation/source/models_export.md and then choose inference engine that fits your needs.

BennySemyonovAB commented 9 months ago

hi, appreciate the quick answer! we tried using ONNX but unfortunately our images are in different sizes which disables ONNX support. we cant even process in batch sizes because of inconsistencies of inference. it would be helpful to get more understanding why "GIL probably isn't a major bottleneck to be honest.", so we can come to a better solution in our case. thanks :) Benny

@BloodAxe not sure how to reopen the issue

BloodAxe commented 9 months ago

it would be helpful to get more understanding why "GIL probably isn't a major bottleneck to be honest."

To start I should ask - how did you came up with the conclusion that GIL-free is going to help? Do you have any profiling stats that proves waiting for GIL lock is actually hurting performance?

Model inference (Safe to say it is true for any CNN/Transformer model) is a computational-heavy task, so most likely bottleneck is a CPU or GPU. If you are using CPU for inference, pytorch is able to utilize OMP-based multithreading to computation so adding multiprocessing may help but only to some extent, do not expect to get N-times increase in requests per second by having N processes instead of 1.

My best suggestion would be to benchmark the code first and see is it a CPU-bound task and if so - what are the slowest parts of your pipeline.

unfortunately our images are in different sizes which disables ONNX support.

If your input images tend to have different resolution then you either limited to inference with bs=1 or you actually may want to preprocess your inputs to a common size that you want to use for inference. This actually may greatly improve the performance of the inference, especially on GPU. Let's say you limit the maximum inference size to 1024x1024px. Now at inference stage you can either resize or resize & pad images to 1024x1024 which allows you to make a batch of images. Obviously you should apply scale/unpad coordinates of the predictions to match the original resolution of each input.

Sometimes this batching can give a significant speed improvement when using GPU. Especially if you do actual inference using ONNXRuntime or TensorRT. This should be benchmarked to see what is the optimal inference image size and batch size.

BennySemyonovAB commented 9 months ago

hi thanks for the response :)

To start I should ask - how did you came up with the conclusion that GIL-free is going to help? Do you have any profiling stats that proves waiting for GIL lock is actually hurting performance?

I want to write the inference section "GIL Free" and compare against the "regular" inference to test if GIL is really problematic or not. If you know of better way testing GIL locks it will be really helpful. We are running with GPU's and we want to try and make it faster.

This actually may greatly improve the performance of the inference, especially on GPU.

It is a good idea indeed but unfortunately as my colleague discussed in #1690 it is not possible in our case :(

thanks again! Benny :)

@BloodAxe