Open SimonLupart opened 5 months ago
Awesome!
What's the corpus download process like? We can handle the case like we do for other licensed datasets: provide instructions in the software, and ask them to link the downloaded file somewhere that ir-datasets can pick it up.
As far as the structure goes -- can you clarify if the dataset is a typical clueweb22 split, or a special subset for ikat?
If the former, we have a PR already set for CW22, and it should go under there. Something like clueweb22/trec-ikat-2023
and clueweb22/trec-ikat-2024
If the latter, then it's probably a different top-level dataset? And it'd probably be structured like: ikat/trec-2023
and ikat/trec-2024
(or similar)
I realized that we already have an agreement for cw22, so I can request a copy and check :)
yes, the raw dataset is included in clueweb22 (clueweb22-iKAT), but it needs a lot of processing to create the passages splits. So instead, we had a processed version, hosted on your server https://ikattrecweb.grill.science/, that could be accessed by contacting Andrew Ramsay to get the credential.
as for the hierarchy, if we want to add the queries and qrels I don't think we can do it under clueweb22/trec-ikat-2023, so it might be better to have a dedicated one?
I have integrated the code, @seanmacavaney can you have a check? I am not sure about the next steps for the PR to be accepted.
Hierarchy is a following:
trec-ikat/2023
-> doc collection - qrels - both train and test queries
trec-ikat/2023/judged
-> subset of queries with relevance judgement in the qrels from NIST assessors.
trec-ikat/2023/judged/ptkb
-> qrels of the ptkb (see ikat description)
As for the doc collection, we kindly ask people to link the 16 downloaded chunks of the collection in the folder .ir_datasets/trec-ikat/TREC-Ikat-CW22-passage/
(.jsonl.bz2)
Getting the license to use the collection can be time-consuming and would be handled by CMU, not the iKAT organizers. Please follow these steps to get your data license ASAP:
- Sign the license form available on the ClueWeb22 project web page: https://lemurproject.org/clueweb22/obtain.php and send the form to CMU for approval (jlm4@andrew.cmu.edu).
- Once you have the license, send a mail to Andrew Ramsay [andrew.ramsay@glasgow.ac.uk](mailto:andrew.ramsay@glasgow.ac.uk) to have access to a download link with the preprocessed iKAT passage collection (here are the 16 chunks)
Dataset Information:
The purpose is to add the processed TREC iKAT collection (same collection for years 2023 and 2024 [subset of ClueWeb22-B]). The Shared Task of iKAT can be defined as personalized retrieval-based "candidate response retrieval" in context of the conversation. Collection with around 116,838,987 passages (with id in the form: clueweb22-en0004-50-00170:0).
Links to Resources:
Guidelines from year 2023: https://www.trecikat.com/guidelines/ Overview of year 2023: https://arxiv.org/abs/2401.01330 Github of year 2023: https://github.com/irlabamsterdam/iKAT Test topics and qrels 2023: https://trec.nist.gov/data/ikat2023.html
Dataset ID(s) & supported entities:
We can provide with the documents, and flatten version of the conversation:
trec_ikat23/doc
: collection of passages, 116M passagestrec_ikat23/queries
: the flatten conversations (156 entrees from the 24 topics)trec_ikat23/qrels
: qrels from the flatten conversationsChecklist
Mark each task once completed. All should be checked prior to merging a new dataset.
ir_datasets/datasets/[topid].py
)tests/integration/[topid].py
)ir_datasets generate_metadata
command, should appear inir_datasets/etc/metadata.json
)ir_datasets/etc/[topid].yaml
)ir_datasets/etc/downloads.json
).github/workflows/verify_downloads.yml
). Only one needed pertopid
.downloads.json
.Additional comments/concerns/ideas/etc.
The collection requires a licence approved by CMU, is it possible to restrict the access of the collection? (more details below)