Adding MobileCaptureVQA to the benchmark

Hi team, I'd be interested to see whether we could add the MobileCaptureVQA dataset on this benchmark.

This VQA dataset focused on mobile capture (i.e. images taken from a phone), that aims at assessing models on extraction capabilities specifically for mobile capture. Contrarily to existing VQA benchmarks (DocVQA, ChartVQA), it puts the emphasis on mobile-capture-specific noise such as bad lighting, document skew, and provides a much higher variability of text in the wild (can be a receipt, a bottle of wine, food packaging, etc..). Similarly to other VQA datasets, it is meant to be purely extractive, i.e. the answer to the question is written somewhere in the image (which allows for easy scoring).

The dataset is already available on HuggingFace: https://huggingface.co/datasets/arnaudstiegler/mobile_capture_vqa It contains ~850 questions for ~120 unique images.

I'd be happy to contribute the code to add the dataset if there's any interest!

Here's one sample from the dataset (question/answers is at the top) Screenshot 2024-06-19 at 3 58 53 PM

EvolvingLMMs-Lab / lmms-eval

Adding MobileCaptureVQA to the benchmark #127