Why not use the text guidance for the OV-LVIS setting in the config?

CVMI-Lab / CoDet

(NeurIPS2023) CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection

109 stars 7 forks source link

Why not use the text guidance for the OV-LVIS setting in the config? #18

Open kinredon opened 2 months ago

kinredon commented 2 months ago

When I was reproducing the results on OV-LVIS, I found the default config(e.g., r50 ov-lvis) makes the text guidance to be False. I am wondering if this is a mistake or another reason for choosing this. In the paper, the text guidance has a significant impact on the OV-COCO.

machuofan commented 2 months ago

Yes, we do not use text guidance for the OV-LVIS setting by default, because we observe text guidance has little impact in this case. As discussed in the paper, COCO-Caption suffers from data bias (usually more than one co-occurring concept in a sub-group). Text guidance is particularly useful to overcome such data bias. But for CC3M in OV-LVIS, there is no such bias and text guidance is not necessary.

kinredon commented 2 months ago

@machuofan Thanks for your timely reply. I am now reproducing results on ov lvis, the AP on novel classes is around 22.0. I am wondering if my cc3m have some issues, I downloaded the cc3m dataset from Hugging Face, the number of images in my downloaded data is around 2.9M which is less than official 3.3M. How many images do you have downloaded? Could you share some advices for me?

machuofan commented 1 month ago

Actually, I only got 2.8M images for the CC3M dataset. But the LVIS APr metric typically has high variance. Maybe you can have another try to see if the results are different.

kinredon commented 1 month ago

Hi @machuofan, how many GPU resources are needed to train the EVA Large model, and how long for the model training? Are the transfer experiments based on large EVA or swin base?

machuofan commented 1 month ago

Hi @kinredon, I cannot clearly remember the number, it roughly takes 8 V100s to train for 3-5 days. The transfer experiments are based on R50 backbone.

kinredon commented 1 month ago

@machuofan Thanks for your quick response. I tried using the default config of EVA L in this repo, https://github.com/CVMI-Lab/CoDet/blob/main/configs/CoDet_OVLVIS_EVA_4x.yaml, It seems the model cannot train on V100 because the memory cost is larger than 32GB. There remains 65GB in one GPU, but the model training still fails. Is this config for 8 A100 GPUs?

machuofan commented 1 month ago

Oh, I see. Then it should be A100 80G GPUs. Sorry for the mistake.