GroundingDINO Inference speed

IDEA-Research / GroundingDINO

[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"

https://arxiv.org/abs/2303.05499

Apache License 2.0

6.74k stars 684 forks source link

GroundingDINO Inference speed #132

Open lsn199603 opened 1 year ago

lsn199603 commented 1 year ago

GroundingDINO Inference result is very good. However, the inference speed is 5FPS，Is it possible to improve the inference speed by pre-encoded text ? Looking forward to your reply！

SlongLiu commented 1 year ago

It is a good point. I believe we can improve the throughout by technique optimizations. It would be helpful if you'd like to provide PRs.

kenhuang1964 commented 1 year ago

Hey @lsn199603 , does GroundingDINO work on live video captures?

lsn199603 commented 1 year ago

Hey @lsn199603 , does GroundingDINO work on live video captures?

Hello, I only tested mp4 file video, not rstp video stream

kenhuang1964 commented 1 year ago

Hey @lsn199603 , does GroundingDINO work on live video captures?

Hello, I only tested mp4 file video, not rstp video stream

Awesome thanks! Is the implementation for mp4 file video similar to YOLO video object detection implementation?

lsn199603 commented 1 year ago

thanks

Yes, the prompt needs to be configured in advance

kenhuang1964 commented 1 year ago

thanks

Yes, the prompt needs to be configured in advance

Thank you!

Nancis1130 commented 1 year ago

Have you made any progress on pre-encoding?

farukcankaya commented 2 weeks ago

Hey @lsn199603, if you don’t mind, could you share the specifications you used to achieve 5 FPS? Specifically:

What were the image dimensions and the size of the text prompt?
Which GPU and CPU did you use for the test?

In my test, with an input image of 1200x1800, DINO detects 5 objects, and the prompt includes 13 categories (e.g., "xxx., yyy., zzz.,...") totaling 133 characters.

Each inference takes approximately 2 seconds on a Tesla T4 GPU (g4dn.xlarge).
I converted the model to ONNX (thanks to @wenyi5608) and ran it on Triton, where the GPU inference takes around 500 ms. I can’t reach 5 FPS, only getting close to 3 FPS.