Closed grodino closed 2 years ago
This looks great, thanks for contributing! Just a few minor notes:
TrecSubQrel
Hi, thanks, for these comments !
TrecSubtopic
. However, there seem to be an inconsistency between the data and the TrecSubtopic.number
type :
TrecSubQrels
) contains tuples (query_id: int, subtopic_id: int, doc_id: str, relance: int)
This is also the case for the query_id
of TrecQuery
and TrecQrel
(str
instead of int
).
Would you prefer to preserve the current id types documented in the code (str
) or correct them ?
In the mean time, I'm opening a PR (with the subtopic ID as a str
)
Nice catch with the subtopic_id. I think it's best as a string as well. Could you make that correction?
Dataset Information: ClueWeb09/12, (see #1) diversity tracks
The clueweb dataset and correcponding trec-web tasks are already implemented in ir_datasets but not the diversity track. I begun an implementation of the trec-web-2013/2014 diversity tasks (see fork).
Links to Resources:
Dataset ID(s) & supported entities:
The diversity track relies on the already implemented queries and document. Yet, the standard qrels are not suitable since here, the qrels are also relative to subtopics. This is why in the implementation, I created a new entity
TrecSubQrel
(for trec subtopic query relevance).Checklist
Mark each task once completed. All should be checked prior to merging a new dataset.
ir_datasets/datasets/[topid].py
)tests/integration/[topid].py
)ir_datasets generate_metadata
command, should appear inir_datasets/etc/metadata.json
)ir_datasets/etc/[topid].yaml
)ir_datasets/etc/downloads.json
).github/workflows/verify_downloads.yml
). Only one needed pertopid
.downloads.json
.Additional comments/concerns/ideas/etc.
This is a first draft of this proposition, I'd be glad to hear any idea or suggestion ! When we reach a suitable structure, I can also handle the trec web diversity track related to clueweb09 (2009-2012)