Closed rabiulcste closed 9 months ago
For the Winoground, the second image is manually collected, so the scale is very small. While for our EqBen, we borrow the idea from the video data format, which provides large-scale paired images that can be used to diagnose the MLLM.
I've been exploring the VALSE dataset and I've observed that its structure seems similar to the Winoground dataset. However, I noticed a key difference: while Winoground employs two images, the VALSE dataset appears to use only one.
Could someone clarify how the second image is derived or if there's an underlying reason for this design choice?
Thanks in advance!