Hi team,
I'd be interested to see whether we could add the MobileCaptureVQA dataset on this benchmark.
This VQA dataset focused on mobile capture (i.e. images taken from a phone), that aims at assessing models on extraction capabilities specifically for mobile capture.
Contrarily to existing VQA benchmarks (DocVQA, ChartVQA), it puts the emphasis on mobile-capture-specific noise such as bad lighting, document skew, and provides a much higher variability of text in the wild (can be a receipt, a bottle of wine, food packaging, etc..). Similarly to other VQA datasets, it is meant to be purely extractive, i.e. the answer to the question is written somewhere in the image (which allows for easy scoring).
Hi, feel free to contribute dataset and benchmarks into our pipeline. Once you create a PR, we will try to review your code and working on it for a merge
Hi team, I'd be interested to see whether we could add the MobileCaptureVQA dataset on this benchmark.
This VQA dataset focused on mobile capture (i.e. images taken from a phone), that aims at assessing models on extraction capabilities specifically for mobile capture. Contrarily to existing VQA benchmarks (DocVQA, ChartVQA), it puts the emphasis on mobile-capture-specific noise such as bad lighting, document skew, and provides a much higher variability of text in the wild (can be a receipt, a bottle of wine, food packaging, etc..). Similarly to other VQA datasets, it is meant to be purely extractive, i.e. the answer to the question is written somewhere in the image (which allows for easy scoring).
The dataset is already available on HuggingFace: https://huggingface.co/datasets/arnaudstiegler/mobile_capture_vqa It contains ~850 questions for ~120 unique images.
I'd be happy to contribute the code to add the dataset if there's any interest!
Here's one sample from the dataset (question/answers is at the top)