Running OWLViTv2 on non-square images provides upwards shifted bounding boxes

NielsRogge / Transformers-Tutorials

This repository contains demos I made with the Transformers library by HuggingFace.

MIT License

9.16k stars 1.42k forks source link

Running OWLViTv2 on non-square images provides upwards shifted bounding boxes #356

Open Thomas-Malchers opened 11 months ago

Thomas-Malchers commented 11 months ago

Hey, I am trying to run your owlvit v2 notebook on my local machine. whenever i am using an image that is not of squared shape, such as the cat example that you also use, the resulting bounding boxes are slightly shifted upwards. I tried multiple images, filetypes, etc. and the problem still persists. Is there some parameter i am missing? The problem also does not seem to be in the plotting itself as the coordinates of the bounding boxes are wrong already.

test

test_2

NielsRogge commented 11 months ago

Thanks for reporting, will look into it.

In the meantime could you try using OwlViTProcessor instead of Owlv2Processor and see whether you get better results?

Thomas-Malchers commented 11 months ago

Thanks for the quick reply. In the meantime, i had a look at the OwlViT v1 Processor and when interchanging the processor in this notebook: https://github.com/huggingface/notebooks/blob/main/examples/zeroshot_object_detection_with_owlvit.ipynb with the updated v2 models/ processors it seems to work. I assume that the direct access on the results instead of calling results = processor.post_process_object_detection(outputs=outputs, target_sizes=target_sizes, threshold=0.2) makes the difference. Edit: Nevermind, my notebook was still in a older state, using the above code, also does not help in solving the problem.

NielsRogge commented 11 months ago

Yeah I've run the original Colab by the authors on the cats image, when visualizing the bounding boxes they visualize them on the preprocessed (padded + resized) image, not on the original image.

So we need to do the postprocessing on the padded image rather than the original one. Will look more into this over the weekend