Deci-AI / super-gradients

Easily train or fine-tune SOTA computer vision models with one open source training library. The home of Yolo-NAS.
https://www.supergradients.com
Apache License 2.0
4.58k stars 507 forks source link

Low FPS with predict_webcam. Running on RTX 4090 #959

Closed niconielsen32 closed 1 year ago

niconielsen32 commented 1 year ago

Describe the bug

Have tried out all the models and can't get over 20 fps with the predict_webcam function on a RTX4090 GPU. For comparison the yolov8 models run at 100-120 fps.

Video

https://www.youtube.com/watch?v=_ON9oiT_G0w&t=7s

dagshub[bot] commented 1 year ago

Join the discussion on DagsHub!

shaydeci commented 1 year ago

Hey @niconielsen32 , first thanks for the support on your Youtube video (:

Generally, benchmarking prediction on the PyTorch model is a very bad practice… The model is changed very much when converted to ONNX and then compiled to TRT. For example, the RepVGG blocks and BN are folded. this has a huge affect on the inference time.

We implemented predict mainly for visual demonstration of the model capabilities rather then for benchmarking. Nevertheless, there is still work to be done to make predict() run faster, and we will update it soon.

If you want to experience YoloNas where it really shines- I suggest following our QAT/PTQ tutorial (we will soon also add a notebook for it) and observe its performance on T4. Let me know if you have any other questions.

niconielsen32 commented 1 year ago

Yeah im not looking at benchmarking any models or is referring to that. But people are using these functions and the pt models for initial testing and to see how the models work. I will def use tensorrt for optimization and utilize the quantization of this model but people looking for these models for smaller projects will most likely not go down that path when you can use other models directly with good performance.

The issue request was mainly because i cant see how its possible to only run 20 fps on a 4090 even with the PT models. So just to make sure the functionality around the model and function is not causing a huge FPS drop. Even doe its just a function, its still the main function for testing out the model when just playing around.

Louis-Dupont commented 1 year ago

Hi @niconielsen32 , To add some more information, it seems that YoloV8 fuses its Conv2d and BatchNorm2d by default, while YoloNAS requires it to be done explicitly. You can fuse some of the blocks yourself by calling model.prep_model_for_conversion(input_size=(640, 640)), which gives a performance boost. (640, 640) because it's the model input size. Also note that in our benchmark we fuse more blocks, but we need to update our API to allow users to do it themselves. This would give an extra boost when running predict on the torch model. You can find implementation details here if you are interested.

Eventually, we will do all of this automatically when calling model.predict(...).

I think @shaydeci already covered everything else I had in mind. Feel free if you have more questions

ofrimasad commented 1 year ago

fixed in this PR https://github.com/Deci-AI/super-gradients/pull/998

One more thing that I think we overlooked here. When you are predicting on a webcam stream, you are bounded by the webcam FPS. The Macbook Pro M1 for example, limits you to around 25 FPS. most other laptops give around 20 FPS. The most high-end webcams I could find on Amazon limit you to 30-60 FPS. so I am not sure how other Yolos presented 120 FPS.