allenai / ir_datasets

Provides a common interface to many IR ranking datasets.
https://ir-datasets.com/
Apache License 2.0
318 stars 42 forks source link

Add ms marco dev small subset? #24

Closed jobergum closed 3 years ago

jobergum commented 3 years ago

Thanks a lot for a great resources!

Seems like the small dev set with 6,980 questions is commonly used so would be nice to have that split?

seanmacavaney commented 3 years ago

Yeah, that one is missing. Thanks for bringing it up!

It would make sense to add as msmarco-passage/dev/small (seeing as it's a subset of msmarco-passage/dev).

From what I can tell, that one is only available bundled up with collectionandqueries.tar.gz. There's both qrels.dev.small.tsv and queries.dev.small.tsv in there.

At the same time, could also add queries.eval.small.tsv as msmarco-passage/eval/small (queries only). I'm not sure what it's used for, though.

jobergum commented 3 years ago

Thanks for the quick feedback! This is a wonderful project. Thanks for making it.

seanmacavaney commented 3 years ago

@jobergum added! They'll be rolled in with a bunch of other additions in the (now poorly-named) clueweb12 branch.

One thing that's missing are the scoreddocs for msmarco-passage/dev/small and msmarco-passage/dev/small -- document IDs to use when testing under the official re-ranking setting. I couldn't find such files prepared anywhere. They wouldn't be so hard to create on-the-fly, but I'm also not sure how useful they would be here.

jobergum commented 3 years ago

Nice, thanks a lot @seanmacavaney!