This repository contains the evaluation code, fine-tuning code and datasets for reproducing the results presented in the ACL 2024 paper, VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval.
📢 The original inference code of VISTA (also known as Visualized BGE) can be found in FlagEmbedding.
Please follow the steps below to reproduce the results of Visualized-BGE-M3 on the WebQA dataset in the zero-shot evaluation setting:
Download the WebQA dataset here.
In our evaluation process, we built our corpus utilizing all candidates from the WebQA dataset, which encompassed both the training and validation sets. This corpus included both text-only and image-text candidates. To ensure the accuracy of the results, we implemented deduplication across all candidates. For text-only candidates, we made certain that each unique piece of text was represented just once within the corpus. In the case of image-text candidates, we also carried out deduplication. The criterion for this was that both the image ID and the associated text had to be identical. Hence, candidates sharing the same image ID but with differing texts were maintained as distinct candidates.
Clone the repository from FlagEmbedding, and place all files from the webqa/BGE_M3 directory in the ./FlagEmbedding/Visual
directory.
Configure the paths for the model weights, image directory and .jsonl
files in eval_webqa.py
. Then, run eval_webqa.py
. The corresponding result in the paper is the Hybrid Corpus Recall@5.
We will continue to organize and upload more datasets and related code. If you have any questions or encounter any issues, please feel free to raise an issue.
We have released the core code for fine-tuning VISTA, which includes the Stage2-training phase and downstream task fine-tuning as detailed in our paper. You can comprehend the configuration of various training parameters by referring to the bash scripts in the provided folder. It's important to note that during the Stage2-training phase, we utilized a multi-task alternating training approach, and the dataset file employs a relatively complex invocation strategy. As such, it's crucial that you set the dataloader_num_worker to 1, as failing to do so may cause the code to malfunction.
You'll find the fine-tune data format of the CIRR dataset in the downstream fine-tuning folder. We encourage you to refer to this and the dataset file, and adjust it to suit your specific requirements.
If you find this repository useful, please consider giving a star ⭐ and citation
@article{zhou2024vista,
title={VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval},
author={Zhou, Junjie and Liu, Zheng and Xiao, Shitao and Zhao, Bo and Xiong, Yongping},
journal={arXiv preprint arXiv:2406.04292},
year={2024}
}