Inconsistent POPE expected number of examples

Thank you for your work, this package has been very helpful! However, I noticed an issue with the expected number of examples for POPE when using the full version:

As you can see here in the dataset configurations file, pope-full expects 9000 total examples, including 3000 each from the adversarial/popular/random splits.
However, the dataset registry's download URL links to a file with 2910 examples for POPE's random split, not 3000. When we download this file, we encounter an inconsistency at the scoring step when checking for the expected number of examples.
The official POPE repo's file for the random split actually does contain 3000 examples. They had a commit from October 2023 that seems to have updated from 2910 -> 3000 examples (here), so maybe that's how this inconsistency came up. I tried overriding the dataset registry URL to download this file instead, but then encountered an issue elsewhere in your repo: the POPE eval harness (correctly?) expects just 2910 examples for the random split, which is hardcoded here.

It seems like the best solution would be to use the most up-to-date POPE dataset, and then update this line from the eval harness to reflect the correct number of examples. I can make a PR if that sounds right, but I thought I'd run it by you first.

TRI-ML / vlm-evaluation

Inconsistent POPE expected number of examples #15