Wangt-CN / EqBen

[ICCV'23 Oral] The introduction and toolkit for EqBen Benchmark
Apache License 2.0
125 stars 1 forks source link


Equivariant Similarity for Vision-Language Foundation Models

Tan WangKevin LinLinjie LiChung-Ching LinZhengyuan YangHanwang ZhangZicheng LiuLijuan Wang
Nanyang Technological University,   Microsoft Corporation

Our proposed EqBen is the first benchmark to focus on "visual-minimal change" to diagnose the Vision-Language foundation models.


News


About

This study explores the concept of equivariance in vision-language foundation models (VLMs), focusing specifically on the multimodal similarity function that is not only the major training objective but also the core delivery to support downstream tasks. Unlike the existing image-text similarity objective which only categorizes matched pairs as similar and unmatched pairs as dissimilar, equivariance also requires similarity to vary faithfully according to the semantic changes. Our key contributions are three-fold:

  1. A novel benchmark named EqBen (Equivariant Benchmark) to benchmark VLMs with visual-minimal change samples.
  2. A plug-and-play regularization loss EqSim (Equivariant Similarity Learning) to improve the equivariance of current VLMs.
  3. Our toolkit (this repo) provide an one-stop evaluation: not only for our EqBen, but also for previous related benchmarks (Winoground, VALSE, etc).



ToDo List

What can you get from this Repo?



EqBen

Welcome to the EqBen, which helps to benchmark your Vision-Language Pretrained (VLP) Model effectively and efficiently with a kind of fine-grained image-text matching task. Compared to recent works (Winoground and VALSE) focusing on minimal semantic changes in captions, EqBen pivots on diverse visual-minimal changes, automatically curated from time-varying visual contents in natural videos and synthetic engines with more precise control.



Core Design of our EqBen: "Visual-Minimal Change"

​

This repo contains an one-stop and ready-to-use pypi toolkit, supporting multiple evaluation needs.


Installation & Usage

pip install eqben

It's all set! Then it can be easily inserted into your VL model framework with little code addition. Here we provide a code template and examples (#1 and #2) for 2 popular VL models (CLIP and FIBER).

For the specific evaluation step, the users need to further download the data. Please check the following sections for details.


EqBen

The overview of our proposed benchmark EqBen, which consists of 5 sub-datasets and can be categorized to natural and synthetic.

​

1. Data Download
2. Modify Data Path

Please refer to the template (example) to modify the data path and annotation path. Then follow the example to insert EqBen evaluation code into your VL model framework.

3. Submit to Server for Score

Running the evaluation script to get the score.npy file, then please submit to our CodaLab server after zip to obtain the final score. More details about the server evaluation please check the CodaLab website.

[UPDATE-2023-09] : As we have totally public the original annotation file, now you have 2 new options for getting the results:


Winoground & VALSE

The overview of the VALSE evaluation set which focuses on the textual minimal change.

​

Our toolkit also supports the previous Winoground and VALSE benchmark. You can easily import them with following steps.

1. Data Download

The user can download the raw data by following the official website of Winoground and VALSE.

2. Modify Data Path

Please refer to the template (example) to modify the data path and annotation path. Then follow the example to insert EqBen evaluation toolkit into your VL model framework.

3. Run the Script and Check the Score

The users can just check the offline score output.



EqSim

Our EqSim stems from an intuitive example as below, where we depict the similarity scores produced by the current SOTA VLMs FIBER (pretrained on open-source data).

We can find that, FIBER mistakenly assigns a higher similarity score to ${I_1,T_2}$ rather than ${I_1,T_1}$ ($3.83$ v.s. $3.79$). Furthermore, the changes in similarity scores guided by the semantic change (2$\leftrightarrow$3) are highly inconsistent ($+0.04$ v.s. $-1.81$). Therefore, the key idea of our EqSim is to regularize the consistency between the two simiarity changes.

Please check the sub-folder for implementation.


Acknowledgement

We thank the valuable disscusions with Ziyi Dou. We thank the opensource projects of Winoground, VALSE, METER, FIBER and CLIP.