ExplainableML / Vision_by_Language

[ICLR 2024] Official repository for "Vision-by-Language for Training-Free Compositional Image Retrieval"
MIT License
37 stars 2 forks source link

prompt problem #4

Closed zhaojingj closed 4 months ago

zhaojingj commented 4 months ago

When I tried to replicate the results of your experiment, I found that I could not achieve the results given in the paper using the individual prompts you provided in prompt.py. For example, for CIRCO's test set, using laion 2b's ViT-G/14 model, gpt4, and blip2, I was only able to reach 22.76, while the results in the paper reached 26.77. May I ask what is wrong with my experiment and which prmopt should be used to achieve the results in this paper?

sgk98 commented 4 months ago

Hey, firstly thanks for your interest in our work! I just ran this experiment again, the only difference was I used gpt-3.5-turbo instead of gpt-4, and the results are quite close to the one reported in the paper. Command: python src/main.py --dataset circo --split test --dataset-path $datapath --preload img_features captions mods --llm_prompt prompts.structural_modifier_prompt --clip ViT-bigG-14 Results:

Recall@10: 48.88
Recall@25: 63.25
Recall@50: 73.38

mAP@5: 25.77
mAP@10: 26.64
mAP@25: 29.05
mAP@50: 30.14

This is quite close to the result in the paper (mAP@5 of 26.77), I am guessing the minor difference in the result, would be due to the changing nature of the LLMs. Feel free to re-open the issue if you have any other concerns with this.

zhaojingj commented 4 months ago

I’m so interested in your work and would like to know more details In Table 1, ViT-G/14∗ CIReVL 26.77 27.59 29.96 31.03. Is this model ViT-bigG-14 or ViT-G-14? And the ViT-B/32 CIReVL 14.94 15.42 17.00 17.82, is this model uses OpenCLIP weights(laion 2B)?

sgk98 commented 4 months ago

Good question, we just realized recently that this actually makes a big difference. In the paper, the ViT-B/32 and the ViT-L/14 both use the OpenAI CLIP weights, and I believe you should be able to get very similar numbers to the ones reported in the paper if you rerun the command now.

Interestingly, the OpenCLIP models are actually much better for text-image retrieval, so if you use the OpenCLIP ViT-L-14, the mAP@5 is around 21 (as opposed to 18 for the OpenAI model). On Fashion-IQ, these improvements from the OpenCLIP model are even more pronounced (>5 percentage point difference etc.).

For the G/14, we use the ViT-bigG-14 with the laion-2B weights which is what you can also see in src/main.py. And again, I believe this should be close enough to the reported results. The DataComp checkpoints or the EVA-CLIP models might perform even better, but we haven't tried that out.

I hope this helps, and I would be happy to hear from you about your experiments here!

zhaojingj commented 4 months ago

Thank you very much for your reply. Your work has given me a great inspiration in the application of LLMs in ZS-CIR. Wish you every success in your future research.