allenai / ir_datasets

Provides a common interface to many IR ranking datasets.
https://ir-datasets.com/
Apache License 2.0
309 stars 42 forks source link

Add ClueWeb09/ClueWeb12 diversity track data #197

Closed grodino closed 2 years ago

grodino commented 2 years ago

Dataset Information: ClueWeb09/12, (see #1) diversity tracks

The clueweb dataset and correcponding trec-web tasks are already implemented in ir_datasets but not the diversity track. I begun an implementation of the trec-web-2013/2014 diversity tasks (see fork).

Links to Resources:

Dataset ID(s) & supported entities:

The diversity track relies on the already implemented queries and document. Yet, the standard qrels are not suitable since here, the qrels are also relative to subtopics. This is why in the implementation, I created a new entity TrecSubQrel (for trec subtopic query relevance).

Checklist

Mark each task once completed. All should be checked prior to merging a new dataset.

Additional comments/concerns/ideas/etc.

This is a first draft of this proposition, I'd be glad to hear any idea or suggestion ! When we reach a suitable structure, I can also handle the trec web diversity track related to clueweb09 (2009-2012)

seanmacavaney commented 2 years ago

This looks great, thanks for contributing! Just a few minor notes:

grodino commented 2 years ago

Hi, thanks, for these comments !

This is also the case for the query_id of TrecQuery and TrecQrel (str instead of int).

Would you prefer to preserve the current id types documented in the code (str) or correct them ?

In the mean time, I'm opening a PR (with the subtopic ID as a str)

seanmacavaney commented 2 years ago

Nice catch with the subtopic_id. I think it's best as a string as well. Could you make that correction?