NielsRogge / Transformers-Tutorials

This repository contains demos I made with the Transformers library by HuggingFace.
MIT License
8.45k stars 1.32k forks source link

OWLv2 with Input box image guided detection #364

Closed theodu closed 8 months ago

theodu commented 8 months ago

Hi @NielsRogge, Thank you for your huge work bringing on HuggingFace transformers OWLv2 and a very nice tutorial.

I am interested to run image guided detection with an input bounding box around an object, as done in the OWL-ViT original author notebook.

However, with experiments, I see that it does not wok to just give the box-cropped image as the query image (it gives very bad results) and the code on huggingFace does not accept to give an input box.

In this line of code, we see however that all the logic is implemented and we always take an query box of (0,0,1,1). Are you planning the extend the guided image detection to be able to change that input ?

thank you!

NielsRogge commented 8 months ago

Hi,

I've just updated my OWLv2 notebook to look exactly like the one you link, so there shouldn't be any differences. The original notebook also does not work on a box-cropped image. The reason for that is that the model expects a certain patch embedding as conditioning.

Hence you need to first run the model on the query image to get a patch embedding. The authors had a hardcoded method for that for v1 (the line of code you refer to), however for v2 they take the patch with the highest objectness score (as v2 has an objectness head on top of the vision encoder). My PR here is adding support for that.

theodu commented 8 months ago

I had a look at the notebook and the PR. It works great when the query image represents on big object at the center of the image.

In my case, I try to predict similar objects on images similar to the following one from the DOTAV2 dataset: The idea would be to locate just on plane on the input image and to find most of them on the same image.

When I run the top objectness on this image, I have boxes that are far from the input that I would like to give to the model (just a single plane). Your image_guided_detection_v2 takes the box with the highest objectness so it wouldn't address this issue, or am I missing something ? A solution would be to follow what you do in your notebook with the cats and to take a box that both have a big IOU with the input box that I give and a big objectness score ?

small_planes

Capture d’écran 2023-10-31 à 19 20 44
NielsRogge commented 8 months ago

It seems that the images you're trying on have objects which are pretty small, so I'd advise to use SAHI to perform object detection on smaller cropped images.

Additionally, as the notebook shows, you can also take an index of another patch with a high objectness score, you don't necessarily need to take the one with the highest score (although the method I'm adding in the PR above will take the one with the highest score).

theodu commented 8 months ago

Thanks a lot for your answers!