LCFractal / TGDT

Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal Contrastive Training
MIT License
18 stars 2 forks source link

The results cannot be reproduced #3

Open Cross-ModalRetrieval opened 5 months ago

Cross-ModalRetrieval commented 5 months ago

What an outstanding job, I tried to reproduce the code of this repository and painfully realized that I can't reproduce the results of the paper, the rsum differs by tens of points, what can I do to get the results of the paper? Remarkably, the results in FLICKR30K and COCO 1K are exactly the same, which is surprising.

JlandJ commented 5 months ago

I also find the same Issues. First: the results in FLICKR30K and COCO 1K are exactly the same in the paper. Second: The results cannot be reproduced, lower 10 points in my results.

LCFractal commented 5 months ago
Thank you for your interest in our work. Apologies to you for the camera ready essay editing problems flickr 30k result is wrong (It is anomalous for GLS-GL to have a much higher result than GLS-L). The correct result is as follows (You can report the following results when citing the paper): T2I I2T
R@1 R@5 R@10 R@1 R@5 R@10 R@sum
GLS-G 55.6 83.1 89.4 70.3 91.4 95.5 483.1
GLS-L 61.3 86.0 91.4 76.8 93.2 96.4 505.1
GLS-GL 62.8 87.6 92.9 78.0 94.4 97.3 513.0

We provide trained models for replication. Maybe you need to check the features and environment. Our experiments run on V100 GPUs and CentOS.

JlandJ commented 5 months ago

Thank you for your response!Why is there such a big difference in training with the training command provided in the “readme”, and at the same time testing with the parameters saved by the training, the effect is extremely low on local in FLICKR30K. Do I need to add and change anything else to train from scratch?Looking forward to hearing from you!