Equivariant Similarity for Vision-Language Foundation Models

Tan Wang, Kevin Lin, Linjie Li, Chung-Ching Lin, Zhengyuan Yang, Hanwang Zhang, Zicheng Liu, Lijuan Wang
Nanyang Technological University, Microsoft Corporation

Our proposed EqBen is the first benchmark to focus on "visual-minimal change" to diagnose the Vision-Language foundation models.

News

Release a light EqBen data (~100G -> ~23G) by converting the png image format to jpg.
Randomly check one sample of EqBen with the latest LLaVA-1.5, still stuggle!

Latest LLaVA-1.5 response for a random sample of the proposed EqBen which is still wrong.
Due to the unstablity of the CodaLab platform, we decide to publish the whole annotation of EqBen. Please check here for more details.
Our Paper has been accepted by ICCV 2023 (Oral) !
We perform some toy experiments for recent popular Multimodal Large Language Model (MLLM) on EqBen and observe much inferior results of MiniGPT4 (see below). We also release a small EqBen subset to quantitatively measure the performances of MLLM. See the qualitative and quantitative results below or just check our new paper version on ArXiv.

MiniGPT4 response for a random sample of the proposed EqBen and it is totally wrong.

We can clearly see that current open-source MLLM gives a kind of "random answers" for EqBen data.

About

This study explores the concept of equivariance in vision-language foundation models (VLMs), focusing specifically on the multimodal similarity function that is not only the major training objective but also the core delivery to support downstream tasks. Unlike the existing image-text similarity objective which only categorizes matched pairs as similar and unmatched pairs as dissimilar, equivariance also requires similarity to vary faithfully according to the semantic changes. Our key contributions are three-fold:

A novel benchmark named EqBen (Equivariant Benchmark) to benchmark VLMs with visual-minimal change samples.
A plug-and-play regularization loss EqSim (Equivariant Similarity Learning) to improve the equivariance of current VLMs.
Our toolkit (this repo) provide an one-stop evaluation: not only for our EqBen, but also for previous related benchmarks (Winoground, VALSE, etc).

ToDo List

[x] Opensource the light version of EqBen
[x] Opensource the whole annotation of the EqBen
[x] Update the EqSim implementation for FIBER
[x] Update the subset of EqBen for MLLM model evaluation
[x] Add a subset (10% of the full EqBen, ~25k image-text pairs) for the ease of visualization/validation (Please check it here)

What can you get from this Repo?

🙋‍♂️ Q: I just want to check the samples in EqBen.

😁 A: No problem! Please check the examples in the below figure. Or you may want to check our paper for more construction details.
🙋‍♂️ Q: I want to try EqBen. How can I quickly use it?

😻 A: Great! This repo can help you to add EqBen evaluation into your codebase within a few lines of code. Please follow the steps here.
🙋‍♂️ Q: I want to also evaluate my VL model on previous Winoground or VALSE dataset.

✌️ A: We also support the convenient evaluation script of Winoground and VALSE. Please follow the steps here.
🙋‍♂️ Q: I want to try your proposed algorithm EqSim.

🌟 A: Please check the implementation of EqSim here.

EqBen

Welcome to the EqBen, which helps to benchmark your Vision-Language Pretrained (VLP) Model effectively and efficiently with a kind of fine-grained image-text matching task. Compared to recent works (Winoground and VALSE) focusing on minimal semantic changes in captions, EqBen pivots on diverse visual-minimal changes, automatically curated from time-varying visual contents in natural videos and synthetic engines with more precise control.

Core Design of our EqBen: "Visual-Minimal Change"

This repo contains an one-stop and ready-to-use pypi toolkit, supporting multiple evaluation needs.

Installation & Usage

pip install eqben

It's all set! Then it can be easily inserted into your VL model framework with little code addition. Here we provide a code template and examples (#1 and #2) for 2 popular VL models (CLIP and FIBER).

For the specific evaluation step, the users need to further download the data. Please check the following sections for details.

EqBen

The overview of our proposed benchmark EqBen, which consists of 5 sub-datasets and can be categorized to natural and synthetic.

1. Data Download

Full-Test Set: the user can download the EqBen raw image data (tar.gz file, ~100G) and annotation (after randomize) (200M) via Google Drive. [UPDATE-2023-09] The original annotation is the annotation after randomize (non-public) for the total fairness. And the users are required to upload the results json/np file to CodaLab for getting the final results. Due to the unstability of CodaLab, we decide to public the whole original annotation. This annotation file formalized similar to Winoground and can be downloaded here.
Light Full-Test Set: to improve the usability, we also provide a light version of EqBen by converting all the png image to the jpg using convert. Feel free to download here. But please note that you may make some small revisement to the path in the annotation (change the .png to .jpg).
Sub-Test Set: we also provide a 10% subset (~25K image-text pairs) for the ease of visualization and validation. The label of the EqBen sub-set is opensource and the format follows the winoground style. But please note that the samples in the subset is randomly sorted and not be classified to each category. Please down the raw image data (tar.gz file, ~10G) and annotation via Google Drive.

2. Modify Data Path

Please refer to the template (example) to modify the data path and annotation path. Then follow the example to insert EqBen evaluation code into your VL model framework.

3. Submit to Server for Score

Running the evaluation script to get the score.npy file, then please submit to our CodaLab server after zip to obtain the final score. More details about the server evaluation please check the CodaLab website.

[UPDATE-2023-09] : As we have totally public the original annotation file, now you have 2 new options for getting the results:

After running the evaluation script and get the score.npy file, directly run python evaluate_eqben_locally.py to get the results. This is actaully what we do on CodaLab server.
Refer to the Winoground and directly use the raw original annotation file to evaluate the results by yourself.

Winoground & VALSE

The overview of the VALSE evaluation set which focuses on the textual minimal change.

Our toolkit also supports the previous Winoground and VALSE benchmark. You can easily import them with following steps.

1. Data Download

The user can download the raw data by following the official website of Winoground and VALSE.

2. Modify Data Path

Please refer to the template (example) to modify the data path and annotation path. Then follow the example to insert EqBen evaluation toolkit into your VL model framework.

3. Run the Script and Check the Score

The users can just check the offline score output.

EqSim

Our EqSim stems from an intuitive example as below, where we depict the similarity scores produced by the current SOTA VLMs FIBER (pretrained on open-source data).

We can find that, FIBER mistakenly assigns a higher similarity score to ${I_1,T_2}$ rather than ${I_1,T_1}$ ($3.83$ v.s. $3.79$). Furthermore, the changes in similarity scores guided by the semantic change (2$\leftrightarrow$3) are highly inconsistent ($+0.04$ v.s. $-1.81$). Therefore, the key idea of our EqSim is to regularize the consistency between the two simiarity changes.

Please check the sub-folder for implementation.

Acknowledgement

We thank the valuable disscusions with Ziyi Dou. We thank the opensource projects of Winoground, VALSE, METER, FIBER and CLIP.

Wangt-CN / EqBen

readme

Equivariant Similarity for Vision-Language Foundation Models

News

About

ToDo List

What can you get from this Repo?

EqBen

Installation & Usage

EqBen

1. Data Download

2. Modify Data Path

3. Submit to Server for Score

Winoground & VALSE

1. Data Download

2. Modify Data Path

3. Run the Script and Check the Score

EqSim

Acknowledgement