Visualized BGE on COCO/Flickr

zwhus commented 1 month ago

Thank you for your excellent work. I used Visualized BGE for testing on COCO/Flickr, but the retrieval performance was quite poor. However, I used the stage2 weights. I would like to ask if you have tried testing the stage1 results on COCO/Flickr? How were the results?

JUNJIE99 commented 1 month ago

Since the motivation for Visualized BGE is to provide a powerful general text embedding model with visual capabilities, thereby providing hybrid-modal retrieval capabilities, rather than for cross-modal retrieval tasks between text and images, it's normal that the performance on COCO/Flickr is not good.

The Stage 1 model weights of Visualized-BGE-base are still slightly weaker than OpenAI-CLIP-base on COCO. This is because: our first stage of contrastive learning did not use a batch size as large as CLIP, nor did we train as many steps as CLIP; in addition, our text encoder is in a frozen state (thus preserving the original text embedding capabilities of BGE).

zwhus commented 1 month ago

Thanks, can you provide hyparameters about stage1? such as learning rate, batch_size and so on.

JUNJIE99 commented 1 month ago

The batch size is set at 16K with a learning rate of 2e-5, incorporating a strategy of linear decay. For a more comprehensive understanding of the training details, please refer to Appendix B of our paper, which provides an in-depth explanation.

FlagOpen / FlagEmbedding

Visualized BGE on COCO/Flickr #975