[Clarification] Discrepancy between VALSE's Single Image Setup and Winoground's Two-Image Setup

Wangt-CN / EqBen

[ICCV'23 Oral] The introduction and toolkit for EqBen Benchmark

Apache License 2.0

125 stars 1 forks source link

[Clarification] Discrepancy between VALSE's Single Image Setup and Winoground's Two-Image Setup #4

Closed rabiulcste closed 9 months ago

rabiulcste commented 11 months ago

I've been exploring the VALSE dataset and I've observed that its structure seems similar to the Winoground dataset. However, I noticed a key difference: while Winoground employs two images, the VALSE dataset appears to use only one.

Could someone clarify how the second image is derived or if there's an underlying reason for this design choice?

Thanks in advance!

Wangt-CN commented 9 months ago

For the Winoground, the second image is manually collected, so the scale is very small. While for our EqBen, we borrow the idea from the video data format, which provides large-scale paired images that can be used to diagnose the MLLM.