Add ClueWeb09/ClueWeb12 diversity track data

grodino commented 2 years ago

Dataset Information: ClueWeb09/12, (see #1) diversity tracks

The clueweb dataset and correcponding trec-web tasks are already implemented in ir_datasets but not the diversity track. I begun an implementation of the trec-web-2013/2014 diversity tasks (see fork).

Links to Resources:

Dataset ID(s) & supported entities:

The diversity track relies on the already implemented queries and document. Yet, the standard qrels are not suitable since here, the qrels are also relative to subtopics. This is why in the implementation, I created a new entity TrecSubQrel (for trec subtopic query relevance).

Checklist

Mark each task once completed. All should be checked prior to merging a new dataset.

[x] Dataset definition (in ir_datasets/datasets/[topid].py)
[ ] Tests (in tests/integration/[topid].py)
[x] Metadata generated (using ir_datasets generate_metadata command, should appear in ir_datasets/etc/metadata.json)
[x] Documentation (in ir_datasets/etc/[topid].yaml)
- [ ] Documentation generated in https://github.com/seanmacavaney/ir-datasets.com/
[x] Downloadable content (in ir_datasets/etc/downloads.json)
- [x] Download verification action (in .github/workflows/verify_downloads.yml). Only one needed per topid.
- [x] Any small public files from NIST (or other potentially troublesome files) mirrored in https://github.com/seanmacavaney/irds-mirror/. Mirrored status properly reflected in downloads.json.

Additional comments/concerns/ideas/etc.

This is a first draft of this proposition, I'd be glad to hear any idea or suggestion ! When we reach a suitable structure, I can also handle the trec web diversity track related to clueweb09 (2009-2012)

seanmacavaney commented 2 years ago

This looks great, thanks for contributing! Just a few minor notes:

It looks like the subtopic ID is a str not an int -- can change type definition in TrecSubQrel
Looks like downloads.json was reformatted, making the changes difficult to see. Mind reverting the file to its previous version and making the changes in the file without reformatting it?
I am happy to take care of generating the remaining checklist items if you open a PR.

grodino commented 2 years ago

Hi, thanks, for these comments !

I pulled the latest commits and reverted the json formatting, it should be clearer now.
For the subtopic ID type, I agree, the type I declared is inconsistent with the type in TrecSubtopic. However, there seem to be an inconsistency between the data and the TrecSubtopic.number type :
- if you look in the TrecWeb2013-2014 (clueweb12), the file qrels.all.txt (which is the one downloaded by TrecSubQrels) contains tuples (query_id: int, subtopic_id: int, doc_id: str, relance: int)
- Same for TrecWeb2009-2012 (clueweb09)

This is also the case for the query_id of TrecQuery and TrecQrel (str instead of int).

Would you prefer to preserve the current id types documented in the code (str) or correct them ?

In the mean time, I'm opening a PR (with the subtopic ID as a str)

seanmacavaney commented 2 years ago

Nice catch with the subtopic_id. I think it's best as a string as well. Could you make that correction?

allenai / ir_datasets

Add ClueWeb09/ClueWeb12 diversity track data #197