Open kinredon opened 2 months ago
Yes, we do not use text guidance for the OV-LVIS setting by default, because we observe text guidance has little impact in this case. As discussed in the paper, COCO-Caption suffers from data bias (usually more than one co-occurring concept in a sub-group). Text guidance is particularly useful to overcome such data bias. But for CC3M in OV-LVIS, there is no such bias and text guidance is not necessary.
@machuofan Thanks for your timely reply. I am now reproducing results on ov lvis, the AP on novel classes is around 22.0. I am wondering if my cc3m have some issues, I downloaded the cc3m dataset from Hugging Face, the number of images in my downloaded data is around 2.9M which is less than official 3.3M. How many images do you have downloaded? Could you share some advices for me?
Actually, I only got 2.8M images for the CC3M dataset. But the LVIS APr metric typically has high variance. Maybe you can have another try to see if the results are different.
Hi @machuofan, how many GPU resources are needed to train the EVA Large model, and how long for the model training? Are the transfer experiments based on large EVA or swin base?
Hi @kinredon, I cannot clearly remember the number, it roughly takes 8 V100s to train for 3-5 days. The transfer experiments are based on R50 backbone.
@machuofan Thanks for your quick response. I tried using the default config of EVA L in this repo, https://github.com/CVMI-Lab/CoDet/blob/main/configs/CoDet_OVLVIS_EVA_4x.yaml, It seems the model cannot train on V100 because the memory cost is larger than 32GB. There remains 65GB in one GPU, but the model training still fails. Is this config for 8 A100 GPUs?
Oh, I see. Then it should be A100 80G GPUs. Sorry for the mistake.
When I was reproducing the results on OV-LVIS, I found the default config(e.g., r50 ov-lvis) makes the text guidance to be False. I am wondering if this is a mistake or another reason for choosing this. In the paper, the text guidance has a significant impact on the OV-COCO.