chunmeifeng / SPRC

【ICLR 2024, Spotlight】Sentence-level Prompts Benefit Composed Image Retrieval
59 stars 3 forks source link

Clarification on Vision Backbone Architecture in SPRC (Paper Reported:ViT-L/14, Source Code:ViT-g/14) #4

Open tjddus9597 opened 5 months ago

tjddus9597 commented 5 months ago

I recently had the pleasure of reading your paper submitted to ICLR, which was selected as a spotlight. The insights and methodologies discussed were both enlightening and inspiring.

However, upon examining the source code and associated checkpoint files, I discovered a significant discrepancy that could potentially impact the integrity of the reported results and the fairness of comparisons made within the paper.

The paper states that the SPRC model employs the ViT-L/14 as its vision backbone architecture. Yet, the default settings in the source code and the architecture details inferred from the checkpoint files suggest the use of the EVA-CLIP ViT-g/14 model instead. This discrepancy was confirmed through the examination of the vision model's weights, which correspond to a depth of 40 and a dimension of 6144, characteristics unique to the ViT-g/14 model. The performance gap between the Eva-Clip-G/14 and Clip-L/14 models is substantial, leading to potentially unfair comparisons with existing composed image retrieval methods.

I believe this reporting error was not intentional. I guess that the default declaration function for BLIP-2 in the Lavis library, load_model_and_preprocess(name=args.blip_model_name, model_type="pretrain"), might have been utilized without recognizing that the 'pretrain' argument specifies the use of the ViT-g/14 model. Given the significant performance improvements and the influence your paper has already had, this oversight could lead to misunderstandings and inadvertently set a misleading benchmark for subsequent research.

In light of the above, I respectfully suggest that the experiments be re-conducted using the ViT-L as initially reported, and the findings be updated accordingly.

I want to emphasize that my intention is not to criticize but to ensure the integrity and reliability of influential research within our community. Correcting this discrepancy is not only in the best interest of maintaining scientific accuracy but also serves as a constructive step towards enhancing the credibility and utility of the findings for future explorations.

If there has been any misunderstanding on my part regarding the architecture used, I am open to correction and deeply apologize for any confusion caused.

Thank you for your attention to this matter. I look forward to your response and any corrective actions you deem appropriate.

chunmeifeng commented 5 months ago

Hi Sungyeon Kim, Thanks for your email. We will check it and get back to you.

Thanks again for your attention. Any questions feel free to ask.

Best regards Chunmei

Sungyeon Kim @.***> 于2024年4月3日周三 15:43写道:

I recently had the pleasure of reading your paper submitted to ICLR, which was selected as a spotlight. The insights and methodologies discussed were both enlightening and inspiring.

However, upon examining the source code and associated checkpoint files, I discovered a significant discrepancy that could potentially impact the integrity of the reported results and the fairness of comparisons made within the paper.

The paper states that the SPRC model employs the ViT-L as its vision backbone architecture. Yet, the default settings in the source code and the architecture details inferred from the checkpoint files suggest the use of the EVA-CLIP ViT-g/14 model instead. This discrepancy was confirmed through the examination of the vision model's weights, which correspond to a depth of 40 and a dimension of 6144, characteristics unique to the ViT-g/14 model. The performance gap between the Eva-Clip-G/14 and Clip-L/14 models is substantial, leading to potentially unfair comparisons with existing composed image retrieval methods.

I believe this reporting error was not intentional. My hypothesis is that the default declaration function for BLIP-2 in the Lavis library, load_model_and_preprocess(name=args.blip_model_name, model_type="pretrain"), might have been utilized without recognizing that the 'pretrain' argument specifies the use of the ViT-g/14 model. Given the significant performance improvements and the influence your paper has already had, this oversight could lead to misunderstandings and inadvertently set a misleading benchmark for subsequent research.

In light of the above, I respectfully suggest that the experiments be re-conducted using the ViT-L as initially reported, and the findings be updated accordingly. Should the performance with ViT-L/14 exhibit a significant drop or fail to outperform existing methods, it would be crucial to address this in the spirit of scientific accuracy and fairness.

I want to emphasize that my intention is not to criticize but to ensure the integrity and reliability of influential research within our community. Correcting this discrepancy is not only in the best interest of maintaining scientific accuracy but also serves as a constructive step towards enhancing the credibility and utility of the findings for future explorations.

Thank you for your attention to this matter. I look forward to your response and any corrective actions you deem appropriate.

— Reply to this email directly, view it on GitHub https://github.com/chunmeifeng/SPRC/issues/4, or unsubscribe https://github.com/notifications/unsubscribe-auth/AR75XN7GU47EF43UBOBCNB3Y3OXIVAVCNFSM6AAAAABFUYJR3SVHI2DSMVQWIX3LMV43ASLTON2WKOZSGIZDEMJVGU2TKMI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

tjddus9597 commented 4 months ago

Thank you for your quick reply. I would appreciate it if you could leave an answer that clarifies this issue.

chunmeifeng commented 4 months ago

Hi Kim, Thanks for your follow-up! We will update you when we get back to the office.

Sungyeon Kim @.***> 于2024年4月14日周日 15:31写道:

Thank you for your quick reply. I would appreciate it if you could leave an answer that clarifies this issue.

— Reply to this email directly, view it on GitHub https://github.com/chunmeifeng/SPRC/issues/4#issuecomment-2053946218, or unsubscribe https://github.com/notifications/unsubscribe-auth/AR75XN7RP2KPU4R7FAIJNL3Y5IWGNAVCNFSM6AAAAABFUYJR3SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANJTHE2DMMRRHA . You are receiving this because you commented.Message ID: @.***>

baiyang4 commented 4 months ago

Hi Kim,

Thank you for your insightful comments and for bringing the discrepancy in our manuscript to our attention. We sincerely apologize for the erroneous statement regarding the vision backbone architecture. We had assumed that utilizing model_type="pretrain" in the LAVIS BLIP-2 framework for model loading would default to the ViT-L model, which led to the misrepresentation in our manuscript. We have since revised the statement in the manuscript to reflect the accurate architecture used.

In response to your concerns, we've conducted a comparison of the results between ViT-L and ViT-G architectures. Despite the performance difference observed, we believe it's essential to highlight that ViT-L remains competitively performant. Below is a summary of the comparison:

Recall@k k=1 Recall@k k=5 Recall@k k=10 Recall@k k=50 Recall sub@k k=1 Recall sub@k k=2 Recall sub@k k=3 Avg.
VIT-L 50.70 80.65 88.77 97.64 79.59 91.90 96.77 80.12
VIT-G 51.96 82.12 89.74 97.69 80.65 92.31 96.60 81.39

Different vision backbones for CIRR testset

Dress R@10 Dress R@50 Shirt R@10 Shirt R@50 Toptee R@10 Toptee R@50 Average R@10 Average R@50 Avg.
VIT-L 45.81 70.40 51.62 72.52 55.69 77.21 51.04 73.38 62.21
VIT-G 49.18 72.43 55.64 73.89 59.35 78.58 54.92 74.97 64.85

Different vision backbones for F-IQ dataset

We acknowledge the importance of ensuring the accuracy and integrity of our research findings, and we deeply appreciate your efforts in bringing this matter to our attention. Your feedback will undoubtedly contribute to the refinement of our work and enhance its credibility within the research community.

Additionally, to rectify the oversight, we've uploaded the ViT-L pretrained model and corresponding code to ensure transparency and reproducibility.

Thank you once again for your diligence and understanding. Please do not hesitate to reach out if you have any further questions or concerns.

Best regards,

yytinykd commented 4 months ago

I want to use the 'clip_L' model, and I modified the 'vit_model="clip_L"' in the blip2_qformer_cir_align_prompt.py file. However, when running the code, it always uses the 'eva_clip_g' model. Can you help me with this?

baiyang4 commented 4 months ago

to run clip_L, adding this --backbone pretrain_vitL to your training script, refer to the blip_fine_tune_2.py parser.add_argument("--backbone", type=str, default="pretrain", help="pretrain for vit-g, pretrain_vitL for vit-l")

yytinykd commented 4 months ago

Hello, in your paper, you mentioned using "Ours+CLIP" and "Ours+BLIP". Could you please specify which versions of the pre-trained visual encoder models for CLIP and BLIP were used? image

tjddus9597 commented 4 months ago

Hello, Baiyang. Thank you for your prompt reply.

It seems that using a ViT-L backbone results in a significant performance drop. In the FIQ, its performance (62.21) is competitive with TG-CIR (62.21), ViT-B/16 and Re-ranking (62.15), ViT-B/16 with 384 Image. Additionally, I have confirmed that the last column of the average metric report for TG-CIR is incorrect—the calculations for the average R@10 and R@50 and the final average do not align. In the CIRR, the performance (80.12) also falls short compared to CoVR-BLIP, ViT-L/16 (80.81) and Re-ranking (80.90), ViT-B/16 with 384.

Although the paper claims that the proposed method achieves state-of-the-art, a fair comparison suggests that it is competitive or does not achieve the best performance.

It hasn't been updated on arXiv yet. Do you have any plans to revise the manuscript?

Best regards,