Open danigarciaoca opened 1 year ago
Hello, do you use the same box_threshold
for both Swin-T and Swin-B, you can set a lower threshold for Swin-B model
Hi! Yes, I also tried lowering the threshold and a lot of false positives are triggered. Some examples:
BOX_TRESHOLD = 0.10
:
BOX_TRESHOLD = 0.15
:
I also tried some other thresholds and I don't get to reach similar detection results (with light bulb) as for SwinT version.
Maybe I'm missing any other code change? I just changed load_model line to look like this:
model = load_model("groundingdino/config/GroundingDINO_SwinB_cfg.py", "weights/groundingdino_swinb_cogcoor.pth")
both files downloaded from the links found at the repo.
It looks weird for me because swin_B_384_22k is supposed to be trained with 22k classes and bigger input image size, so I was expecting clearly better results. Perhaps you can replicate these same results with the image of the dog.
Many thanks for the support!
Hi @dankresio
I tried the swin_B model on your image and the default threshold values and I'm getting the same results as you. It is not able to detect light bulb. I even tried changing the prompt to just "bulb", that also did not work. However it is able to detect other classes like dog, sunglasses.
TEXT_PROMPT = "dog . bulb . sunglasses ."
Same here. I tested the two models on thousands of images and swin_T works better than swin_B
Hi @rohit901. The problem is that:
Just to give some context:
When increasing architecture size and the number of classes used for pre-training the backbone, you let the backbone learn richer features, which leads to better performance When increasing input image size, you let the model to improve spatial resolution and so, detect smaller objects
So, to sum up, rationally it does not make sense that "G-DINO SwinB" flavour works worse than "G-DINO SwinT".
Btw, @psykokwak-com thanks for corroborating the issue!
@rentainhe @HaoLiuHust , is there any idea of releasing SwinL backbone for G-DINO? or an improved version of SwinB backbone.
Thanks in advance, and great work with this repo! I'm enjoying a lot testing it.
I also have found this problem after testing some pictures
Guys did anyone of you were able to replicate the results mentioned in the paper on the LVIS or LVIS minival dataset?? Unfortunately I'm not able to replicate those results and I'm getting very poor AP on LVIS when trying to evaluate with SwinB.. is it because of my Box/Text thresholds? I'm using 0.32 Box and 0.3 Text thresholds.. Splitting 1203 LVIS classes into length of 125 each to pass as text prompt due to max token limit of BERT.
passing the batched data as follows
image = image.repeat(len(TEXT_PROMPT_LIST), 1, 1, 1) # repeat same image in first dimension to cover all lvis classes. batch dim covers all lvis classes
with torch.no_grad():
output = model(image, captions = TEXT_PROMPT_LIST)
@rohit901 I think the main point is that Swin-T weights are fine tuned in COCO while Swin-B weights are not (just my conclusion, doesn't have to be the correct one)
@danigarciaoca, I wanted to replicate the zero-shot results mentioned in Table 3 of the paper on LVIS, where I think even Swin-T is getting AP of 25.6 / 27.4.. from my experiments I was not able to replicate that AP..
I will be checking their latest code on zero-shot COCO eval and try to report back.
I'll stay tuned! thanks @rohit901
I'm discussing the mAP result replication in this separate thread https://github.com/IDEA-Research/GroundingDINO/issues/147 if you're interested to follow on @danigarciaoca..
It seems we aren't supposed to be filtering out low confidence/noisy box prediction with thresholds when calculating mAP metric, it is only for visualization purposes I guess?
@rohit901 I think the main point is that Swin-T weights are fine tuned in COCO while Swin-B weights are not (just my conclusion, doesn't have to be the correct one)
I think Swin-B weights are finetuned in COCO, so its open-set detection ability is heavily inferenced, for COCO only have 80 class. That is why you can detect dog and glasses well with Swin-B weights rather than light bulb~~
@rohit901 I think the main point is that Swin-T weights are fine tuned in COCO while Swin-B weights are not (just my conclusion, doesn't have to be the correct one)
I think Swin-B weights are finetuned in COCO, so its open-set detection ability is heavily inferenced, for COCO only have 80 class. That is why you can detect dog and glasses well with Swin-B weights rather than light bulb~~
the authors @rentainhe @SlongLiu can confirm
May I ask where you guys downloaded the SwinB weight? The instruction in README only provided this link:
mkdir weights
cd weights
wget -q https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
cd ..
May I ask where you guys downloaded the SwinB weight? The instruction in README only provided this link:
mkdir weights cd weights wget -q https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth cd ..
It's in the latest release (i.e. v0.1.0-alpha2). The direct link being: https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha2/groundingdino_swinb_cogcoor.pth
.
Also, you would need to use this config file instead of the one used in most tutorials with SwinT.
Hi everyone,
I was testing Grounding Dino with SwinB backbone and I found that, unexpectedly, it detects less objects than Grounding Dino SwinT. For example, when asking for "light bulb", G-DINO SwinT finds almost all bulbs but G-DINO SwinB founds no one. Any explanation?
G-DINO SwinT (backbone = "swin_T_224_1k"):
G-DINO SwinB (backbone = "swin_B_384_22k")
I also tried tunning thresholds but it randomly detects other objects. Any idea? Am I missing any configuration detail?
Thanks in advance!