Add clueweb12 diversity task datasets

grodino commented 2 years ago

Following the issue #197

Add dataset clueweb12/trec-web-2013/diversity
Add dataset clueweb12/trec-web-2014/diversity

Question : do we correct the type of query_id and subtopic_id ? (I'd be glad to do it)

Dataset Information: ClueWeb09/12, (see #1) diversity tracks

The clueweb dataset and correcponding trec-web tasks are already implemented in ir_datasets but not the diversity track. I begun an implementation of the trec-web-2013/2014 diversity tasks (see fork).

Links to Resources:

Dataset ID(s) & supported entities:

The diversity track relies on the already implemented queries and document. Yet, the standard qrels are not suitable since here, the qrels are also relative to subtopics. This is why in the implementation, I created a new entity TrecSubQrel (for trec subtopic query relevance).

Checklist

Mark each task once completed. All should be checked prior to merging a new dataset.

[x] Dataset definition (in ir_datasets/datasets/[topid].py)
[ ] Tests (in tests/integration/[topid].py)
[x] Metadata generated (using ir_datasets generate_metadata command, should appear in ir_datasets/etc/metadata.json)
[x] Documentation (in ir_datasets/etc/[topid].yaml)
- [ ] Documentation generated in https://github.com/seanmacavaney/ir-datasets.com/
[x] Downloadable content (in ir_datasets/etc/downloads.json)
- [x] Download verification action (in .github/workflows/verify_downloads.yml). Only one needed per topid.
- [x] Any small public files from NIST (or other potentially troublesome files) mirrored in https://github.com/seanmacavaney/irds-mirror/. Mirrored status properly reflected in downloads.json.

Additional comments/concerns/ideas/etc.

This is a first draft of this proposition, I'd be glad to hear any idea or suggestion ! When we reach a suitable structure, I can also handle the trec web diversity track related to clueweb09 (2009-2012)

seanmacavaney commented 2 years ago

This looks great to me!

I added tests for the datasets that have been implemented and added the diversity qrels (for all 6 years) to irds-mirror: https://github.com/seanmacavaney/irds-mirror/commit/c607cf875d2b1abd394a1b208cfbee34a07bb4de

Looks fine to go ahead and add the remaining years. I'll hold off 'till the end to generate the documentation.

I also added you to the list of contributors in the readme. Please make sure that the name and affiliation are correct.

Thanks again for your help!

grodino commented 2 years ago

Hi ! Thanks for adding the tests. I tried running them myself but kept having relative import errors. Is there a specific way to run some tests locally ?

I added the TrecWeb2009-2012 diversity tasks and corrected the affiliation (I have to update my github :D). Because I did not manage to run the tests locally, I did not write them for the new datasets. However with some pointers, I could write a Running tests locally section in the readme and write them for this case.

seanmacavaney commented 2 years ago

Sorry for the delay in the review. It looks great -- just had to import TrecSubQrel in the cw12 test.

The trick for running tests locally is to invoke it as a module like so:

python -m test.integration.clueweb12

There's probably a better way. Feel free to create a PR for a "Running tests locally" section in the README :D.

allenai / ir_datasets

Add clueweb12 diversity task datasets #198