FlagOpen / FlagEmbedding

Retrieval and Retrieval-augmented LLMs
MIT License
7.04k stars 514 forks source link

Visualized BGE Evaluation #948

Open zwhus opened 3 months ago

zwhus commented 3 months ago

Hi, i want to reproduce the result of Visualized BGE, but zero-shot benchmark not clear, such as WebQA. Can you provide evaluation dataset and codes for zero-shot benchmark. Thanks!

JUNJIE99 commented 3 months ago

Sure, I will release the relevant evaluation datasets and evaluation code soon. I will inform you when it is complete. Thank you for your attention and patience.

zwhus commented 3 months ago

Thanks, can you provide a time for me?

JUNJIE99 commented 3 months ago

Hello, the WebQA dataset and evaluation code have been made available at here. Should you have any further questions, please feel free to reach out.

We will also be progressively uploading other evaluation datasets.

Thanks.

zwhus commented 3 months ago

Thanks, I will try it

zwhus commented 2 months ago

Thanks, Can you provide other benchmark such as FashIQ and CIRR? I notice CIRR score is 23.9(R1) in Pixword, but in paper is 23.42(R5)

JUNJIE99 commented 2 months ago

I believe you're referring to Pic2Word. The R@1 results of Pic2Word are based on the test set, and the test corpus of CIRR only contains 2,316 images. However, our tests are conducted using the entire CIRR image corpus, which includes 21,551 images. The size of the corpus is ten times that of the test corpus, which will inevitably lead to differences in the metrics.

The datasets for CIRR and FashionIQ (including label files and all images) have been updated in this link. The format of all benchmark files is similar, so if you're in a hurry, you can make simple adjustments based on the WebQA code.

zwhus commented 2 months ago

Thank you for your quick response. I have noticed this detail. However, there is no CIRR image in the link. After I installed Pix2Word and downloaded it, there are 16,939 training images and 2,315 for testing and validation, with a total of 21,569 images. It seems there is a slight difference. Is there any additional operation required?

JUNJIE99 commented 2 months ago

I apologize, it seems that the CIRR image upload was interrupted earlier due to network issues. The re-upload has now been completed. You should be able to see it in this link.

Regarding the slight difference, we did not perform any additional operations on the CIRR dataset. I checked the CIRR dataset paper, and they reported a total of 21,552 images in Table 2. This number is closer to the size of our corpus.

zwhus commented 2 months ago

Thanks, When evaluating other datasets, are the parameters used consistent with those for WebQA, such as k? I made some simple modifications for FashionIQ, but the metrics show a slight difference. If the parameters are consistent, I will carefully check the code.

JUNJIE99 commented 2 months ago

Thanks, When evaluating other datasets, are the parameters used consistent with those for WebQA, such as k? I made some simple modifications for FashionIQ, but the metrics show a slight difference. If the parameters are consistent, I will carefully check the code.

Yes, since we only calculate up to the top-100, setting k to 100 would be sufficient.

In addition, you need to pay attention to modifying the arguments in the model.encode_* within the index and search functions. For example, for FashionIQ, the corpus_type in thesearch function should be changed to mm_it because its query is image-text data.

zwhus commented 2 months ago

Thank you, I was able to reproduce the results. I mistakenly compared the results of m3 with the base.

JUNJIE99 commented 2 months ago

Great!