IDEA-Research / GroundingDINO

[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
https://arxiv.org/abs/2303.05499
Apache License 2.0
6.74k stars 684 forks source link

what is load_image doing internally and how to apply the same operation to frames from video #370

Open llealgt opened 16 hours ago

llealgt commented 16 hours ago

Doing some testing I noticied that doing inference returns very different results for the same image but loaded with different methods:

As I said, both methods give you a tensor to pass to the model, but they return very different results(method2 usually are bad), I inspected the shape of the image returned by both cases and they are different so defintelly there are transformations going on inside load_image, my question is: what is happening inside load_image? so I can replicate it in other scripts

My end goal is to run the model on video, I mean running the model on frames in the video, so I cannot use load_image because they are not images from disk, they are obtained from the video, so I need to understand what is happening inside_load image so I can emulate that behavior on the frames of the video.

Thanks

llealgt commented 14 hours ago

Ignore my message, for some moment I forgot the code is avaialble for me to see: https://github.com/IDEA-Research/GroundingDINO/blob/856dde20aee659246248e20734ef9ba5214f5e44/groundingdino/util/inference.py#L39