IDEA-Research / GroundingDINO

[ECCV 2024] Official implementation of the paper "Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection"
https://arxiv.org/abs/2303.05499
Apache License 2.0
6.54k stars 672 forks source link

Poor detections with SwinB backbone #128

Open danigarciaoca opened 1 year ago

danigarciaoca commented 1 year ago

Hi everyone,

I was testing Grounding Dino with SwinB backbone and I found that, unexpectedly, it detects less objects than Grounding Dino SwinT. For example, when asking for "light bulb", G-DINO SwinT finds almost all bulbs but G-DINO SwinB founds no one. Any explanation?

IMAGE_PATH = "weights/dog-3.jpeg"
TEXT_PROMPT = "light bulb"
BOX_TRESHOLD = 0.35
TEXT_TRESHOLD = 0.25

G-DINO SwinT (backbone = "swin_T_224_1k"): annotated_image_1

G-DINO SwinB (backbone = "swin_B_384_22k") annotated_image_2

I also tried tunning thresholds but it randomly detects other objects. Any idea? Am I missing any configuration detail?

Thanks in advance!

rentainhe commented 1 year ago

Hello, do you use the same box_threshold for both Swin-T and Swin-B, you can set a lower threshold for Swin-B model

danigarciaoca commented 1 year ago

Hi! Yes, I also tried lowering the threshold and a lot of false positives are triggered. Some examples:

BOX_TRESHOLD = 0.10: annotated_image

BOX_TRESHOLD = 0.15: annotated_image

I also tried some other thresholds and I don't get to reach similar detection results (with light bulb) as for SwinT version.

Maybe I'm missing any other code change? I just changed load_model line to look like this: model = load_model("groundingdino/config/GroundingDINO_SwinB_cfg.py", "weights/groundingdino_swinb_cogcoor.pth") both files downloaded from the links found at the repo.

It looks weird for me because swin_B_384_22k is supposed to be trained with 22k classes and bigger input image size, so I was expecting clearly better results. Perhaps you can replicate these same results with the image of the dog.

Many thanks for the support!

rohit901 commented 1 year ago

Hi @dankresio I tried the swin_B model on your image and the default threshold values and I'm getting the same results as you. It is not able to detect light bulb. I even tried changing the prompt to just "bulb", that also did not work. However it is able to detect other classes like dog, sunglasses. TEXT_PROMPT = "dog . bulb . sunglasses ." annotated_image2

psykokwak-com commented 1 year ago

Same here. I tested the two models on thousands of images and swin_T works better than swin_B

danigarciaoca commented 1 year ago

Hi @rohit901. The problem is that:

  1. Even if it can detect some other classes like "dog" or "sunglasses" correctly, they are detected with lower confidence when using SwinB than when using SwinT (however, confidence changes could be caused due to backbone architecture changes) and some others are missing.
  2. SwinB model is supposed to be more powerful thant SwinT.

Just to give some context:

When increasing architecture size and the number of classes used for pre-training the backbone, you let the backbone learn richer features, which leads to better performance When increasing input image size, you let the model to improve spatial resolution and so, detect smaller objects

So, to sum up, rationally it does not make sense that "G-DINO SwinB" flavour works worse than "G-DINO SwinT".

Btw, @psykokwak-com thanks for corroborating the issue!

@rentainhe @HaoLiuHust , is there any idea of releasing SwinL backbone for G-DINO? or an improved version of SwinB backbone.

Thanks in advance, and great work with this repo! I'm enjoying a lot testing it.

laisimiao commented 1 year ago

I also have found this problem after testing some pictures

rohit901 commented 1 year ago

Guys did anyone of you were able to replicate the results mentioned in the paper on the LVIS or LVIS minival dataset?? Unfortunately I'm not able to replicate those results and I'm getting very poor AP on LVIS when trying to evaluate with SwinB.. is it because of my Box/Text thresholds? I'm using 0.32 Box and 0.3 Text thresholds.. Splitting 1203 LVIS classes into length of 125 each to pass as text prompt due to max token limit of BERT.

passing the batched data as follows

image = image.repeat(len(TEXT_PROMPT_LIST), 1, 1, 1) # repeat same image in first dimension to cover all lvis classes. batch dim covers all lvis classes
with torch.no_grad():
     output = model(image, captions = TEXT_PROMPT_LIST)
danigarciaoca commented 1 year ago

@rohit901 I think the main point is that Swin-T weights are fine tuned in COCO while Swin-B weights are not (just my conclusion, doesn't have to be the correct one)

rohit901 commented 1 year ago

@danigarciaoca, I wanted to replicate the zero-shot results mentioned in Table 3 of the paper on LVIS, where I think even Swin-T is getting AP of 25.6 / 27.4.. from my experiments I was not able to replicate that AP..

I will be checking their latest code on zero-shot COCO eval and try to report back.

danigarciaoca commented 1 year ago

I'll stay tuned! thanks @rohit901

rohit901 commented 1 year ago

I'm discussing the mAP result replication in this separate thread https://github.com/IDEA-Research/GroundingDINO/issues/147 if you're interested to follow on @danigarciaoca..

It seems we aren't supposed to be filtering out low confidence/noisy box prediction with thresholds when calculating mAP metric, it is only for visualization purposes I guess?

aixiaodewugege commented 1 year ago

@rohit901 I think the main point is that Swin-T weights are fine tuned in COCO while Swin-B weights are not (just my conclusion, doesn't have to be the correct one)

I think Swin-B weights are finetuned in COCO, so its open-set detection ability is heavily inferenced, for COCO only have 80 class. That is why you can detect dog and glasses well with Swin-B weights rather than light bulb~~

rohit901 commented 1 year ago

@rohit901 I think the main point is that Swin-T weights are fine tuned in COCO while Swin-B weights are not (just my conclusion, doesn't have to be the correct one)

I think Swin-B weights are finetuned in COCO, so its open-set detection ability is heavily inferenced, for COCO only have 80 class. That is why you can detect dog and glasses well with Swin-B weights rather than light bulb~~

the authors @rentainhe @SlongLiu can confirm

DianCh commented 1 year ago

May I ask where you guys downloaded the SwinB weight? The instruction in README only provided this link:

mkdir weights
cd weights
wget -q https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
cd ..
iamnotthatbob commented 1 year ago

May I ask where you guys downloaded the SwinB weight? The instruction in README only provided this link:

mkdir weights
cd weights
wget -q https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
cd ..

It's in the latest release (i.e. v0.1.0-alpha2). The direct link being: https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha2/groundingdino_swinb_cogcoor.pth.

Also, you would need to use this config file instead of the one used in most tutorials with SwinT.