Deci-AI / super-gradients

Easily train or fine-tune SOTA computer vision models with one open source training library. The home of Yolo-NAS.
https://www.supergradients.com
Apache License 2.0
4.54k stars 496 forks source link

Speed is considerably slower than advertised #1633

Closed siddagra closed 10 months ago

siddagra commented 11 months ago

💡 Your Question

I have tried the .predict() function as well as the ONNXRuntime with tensorRT/CUDA backends. The speed of yolo-NAS-S is abysmally low compared to yolov7-tiny. its running at 10ms/img using ONNX (int8) and 20 milliseconds with the .predict() function. Way slower than advertised.

I am also unsure how to actually run this code using tensorRT engine with valid inputs. otherwise I would test that as well.

FYI yolov7-tiny was running at a speed of 1ms per image using there TensorRT notebook example.

The benchmark at https://docs.deci.ai/super-gradients/latest/documentation/source/BenchmarkingYoloNAS.html#step-1-export-yolonas-to-onnx is fine and all, but how do u run actual image inputs with the tensorRT exported model?

Versions

No response

siddagra commented 11 months ago

For me trtexec is not compiling so i tried using python tensorrt library. I get the following error:

    [11/11/2023-19:31:19] [TRT] [E] 1: [runtime.cpp::parsePlan::314] Error Code 1: Serialization (Serialization assertion plan->header.magicTag == rt::kPLAN_MAGIC_TAG failed.)
Traceback (most recent call last):
  File "/media/user123/WD_SN550/yolov7/test_speed.py", line 89, in <module>
    model = TrtModel(trt_engine_path)
  File "/media/user123/WD_SN550/yolov7/test_speed.py", line 31, in __init__
    self.inputs, self.outputs, self.bindings, self.stream = self.allocate_buffers()
  File "/media/user123/WD_SN550/yolov7/test_speed.py", line 51, in allocate_buffers
    for binding in self.engine:
TypeError: 'NoneType' object is not iterable

Essentially the following code returns None:

        with open(engine_path, 'rb') as f:
            engine_data = f.read()
        engine = trt_runtime.deserialize_cuda_engine(engine_data)
siddagra commented 11 months ago

Just tested TensorRT (trtexec) and it is considerably faster than ONNX. about 1ms (GPU compute latency) 1.7ms (full latency) for YoloNAS-S int8. But There is no documentation on how to use TensorRT for images.

dizcza commented 11 months ago

Same here. I've quantized YoloV8-Pose Nano to INT8 with NMS built-in. The inference engine is C++ ONNXRuntime in both cases. Below are FPS:

Original YoloV8-Pose-Nano + manual NMS: 7.0 YOLO-NAS V8 Pose Nano INT8 + built-in NMS: 3.4

Same for YoloV8 Pose S.

Even the OpenCV DNN engine with custom NMS does the inference faster...

Come on guys...

BloodAxe commented 11 months ago

We report numbers for TensorRT inference engine for GPU and OpenVINO for CPU. I'm not sure why you expecting advertised latency in a different inference engine.

BloodAxe commented 11 months ago

Just tested TensorRT (trtexec) and it is considerably faster than ONNX. about 1ms (GPU compute latency) 1.7ms (full latency) for YoloNAS-S int8. But There is no documentation on how to use TensorRT for images.

@siddagra TensorRT itself has plenty of documentation on how to run inference. I agree, giving user an ONNX file and leaving him alone with TensorRT may not be the smoothest user experience for new users and documentation on our side can be improved. At this point I can't give you an estimate when this documentation can be added as we already have quite a lot of work ahead of us. However we are open for external contributions so if you have the willingness to contribute to the docs explaining how to run inference usign TRT - that would be a great win for everyone.

siddagra commented 11 months ago

Thank you for getting back to me. I got it working using the yolov7 code. Perhaps I can submit a PR if taking yolov7 code is not against licensing rules (They have a GNU GENERAL PUBLIC LICENSE) https://github.com/WongKinYiu/yolov7/blob/main/LICENSE.md

Speed is solid at 1.6ms for YoloNAS-S int8

Though, at the moment, it's making too many bounding boxes. So, I should likely retest till I get better results.

siddagra commented 11 months ago

Same here. I've quantized YoloV8-Pose Nano to INT8 with NMS built-in. The inference engine is C++ ONNXRuntime in both cases. Below are FPS:

Original YoloV8-Pose-Nano + manual NMS: 7.0 YOLO-NAS V8 Pose Nano INT8 + built-in NMS: 3.4

Same for YoloV8 Pose S.

Even the OpenCV DNN engine with custom NMS does the inference faster...

Come on guys...

Isn't yolov8 a different model?

TensorRT gives a significant speed-up. With image loading and full end-2-end inference, it can be about 2ms, I think. I will post how to do TensorRT inference shortly, if/when possible.

dizcza commented 11 months ago

Isn't yolov8 a different model?

I don't know if the authors of YOLO-NAS have reused weights of YoloX original series, it doesn't matter as long as their bench is true. Because they do compare their solution with YoloV8 in particular. I'm referring to this pic.

I haven't tried the OpenVINO CPU engine on C++ - porting my ONNXRuntime code to OpenVINO and making sure that it runs as expected on all three platforms, Linux, Windows, and Android, would require me quite a lot of extra work to do. I do see from the picture I linked that YOLO-NAS is actually slower at the expense of better accuracy.

@BloodAxe are the official benchmarks (as in the picture) for YOLO-NAS-POSE-XXX performed with or without quantization?

dizcza commented 11 months ago

TensorRT gives a significant speed-up.

Is it true for CPU as well?

BloodAxe commented 11 months ago

@dizcza TensorRT works only for NVidia GPUS, for CPU accelerated inference you can use ONNXRuntime or (recommended) OpenVINO.

A benchmarks that we report in pic and table are for fp16 models, without NMS using batch size 1 (To have apple-to-apple comparison with Yolo V8).

For quantized models the speed gain would be higher, similarly to YoloNAS.

Btw, https://github.com/Deci-AI/super-gradients/blob/master/documentation/source/BenchmarkingYoloNAS.md this page should be helpful.

Phyrokar commented 10 months ago

Same here. I've quantized YoloV8-Pose Nano to INT8 with NMS built-in. The inference engine is C++ ONNXRuntime in both cases. Below are FPS: Original YoloV8-Pose-Nano + manual NMS: 7.0 YOLO-NAS V8 Pose Nano INT8 + built-in NMS: 3.4 Same for YoloV8 Pose S. Even the OpenCV DNN engine with custom NMS does the inference faster... Come on guys...

Isn't yolov8 a different model?

TensorRT gives a significant speed-up. With image loading and full end-2-end inference, it can be about 2ms, I think. I will post how to do TensorRT inference shortly, if/when possible.

I've been struggling with the same problems you describe here for months. If you could provide that that would be amazing!

BloodAxe commented 10 months ago

Closing as stale