TREC iKAT 2023/2024 - Githubissues

SimonLupart commented 5 months ago

Dataset Information:

The purpose is to add the processed TREC iKAT collection (same collection for years 2023 and 2024 [subset of ClueWeb22-B]). The Shared Task of iKAT can be defined as personalized retrieval-based "candidate response retrieval" in context of the conversation. Collection with around 116,838,987 passages (with id in the form: clueweb22-en0004-50-00170:0).

Links to Resources:

Guidelines from year 2023: https://www.trecikat.com/guidelines/ Overview of year 2023: https://arxiv.org/abs/2401.01330 Github of year 2023: https://github.com/irlabamsterdam/iKAT Test topics and qrels 2023: https://trec.nist.gov/data/ikat2023.html

Dataset ID(s) & supported entities:

We can provide with the documents, and flatten version of the conversation: trec_ikat23/doc : collection of passages, 116M passages trec_ikat23/queries : the flatten conversations (156 entrees from the 24 topics) trec_ikat23/qrels : qrels from the flatten conversations

Checklist

Mark each task once completed. All should be checked prior to merging a new dataset.

[x] Dataset definition (in ir_datasets/datasets/[topid].py)
[x] Tests (in tests/integration/[topid].py)
[x] Metadata generated (using ir_datasets generate_metadata command, should appear in ir_datasets/etc/metadata.json)
[x] Documentation (in ir_datasets/etc/[topid].yaml)
- [ ] Documentation generated in https://github.com/seanmacavaney/ir-datasets.com/
[x] Downloadable content (in ir_datasets/etc/downloads.json)
- [ ] Download verification action (in .github/workflows/verify_downloads.yml). Only one needed per topid.
- [ ] Any small public files from NIST (or other potentially troublesome files) mirrored in https://github.com/seanmacavaney/irds-mirror/. Mirrored status properly reflected in downloads.json.

Additional comments/concerns/ideas/etc.

The collection requires a licence approved by CMU, is it possible to restrict the access of the collection? (more details below)

💥 Document Collection: TREC iKAT 2023 ClueWeb22-B

The collection distribution is being handled directly by CMU and not the iKAT organizers. Please follow these steps to get your data license ASAP:

Sign the license form available on the ClueWeb22 project web page. Send the form to CMU for approval (jlm4@andrew.cmu.edu)

Please give enough time to the CMU licensing office to accept your request. A download link will be sent to you by the ClueWeb22 team at CMU.

Note:

CMU requires a signature from the organization (i.e., the university or company), not an individual who wants to use the data. This can slow down the process at your end too. So, it’s useful to start the process ASAP. If you already have an accepted license for ClueWeb22, you don’t need a new form. Please let us know if that’s the case.

seanmacavaney commented 5 months ago

Awesome!

What's the corpus download process like? We can handle the case like we do for other licensed datasets: provide instructions in the software, and ask them to link the downloaded file somewhere that ir-datasets can pick it up.

seanmacavaney commented 5 months ago

As far as the structure goes -- can you clarify if the dataset is a typical clueweb22 split, or a special subset for ikat?

If the former, we have a PR already set for CW22, and it should go under there. Something like clueweb22/trec-ikat-2023 and clueweb22/trec-ikat-2024

If the latter, then it's probably a different top-level dataset? And it'd probably be structured like: ikat/trec-2023 and ikat/trec-2024 (or similar)

seanmacavaney commented 5 months ago

I realized that we already have an agreement for cw22, so I can request a copy and check :)

SimonLupart commented 5 months ago

yes, the raw dataset is included in clueweb22 (clueweb22-iKAT), but it needs a lot of processing to create the passages splits. So instead, we had a processed version, hosted on your server https://ikattrecweb.grill.science/, that could be accessed by contacting Andrew Ramsay to get the credential.

SimonLupart commented 5 months ago

as for the hierarchy, if we want to add the queries and qrels I don't think we can do it under clueweb22/trec-ikat-2023, so it might be better to have a dedicated one?

SimonLupart commented 5 months ago

I have integrated the code, @seanmacavaney can you have a check? I am not sure about the next steps for the PR to be accepted.

Hierarchy is a following: trec-ikat/2023 -> doc collection - qrels - both train and test queries trec-ikat/2023/judged -> subset of queries with relevance judgement in the qrels from NIST assessors. trec-ikat/2023/judged/ptkb -> qrels of the ptkb (see ikat description)

As for the doc collection, we kindly ask people to link the 16 downloaded chunks of the collection in the folder .ir_datasets/trec-ikat/TREC-Ikat-CW22-passage/ (.jsonl.bz2)

Getting the license to use the collection can be time-consuming and would be handled by CMU, not the iKAT organizers. Please follow these steps to get your data license ASAP:

Sign the license form available on the ClueWeb22 project web page: https://lemurproject.org/clueweb22/obtain.php and send the form to CMU for approval (jlm4@andrew.cmu.edu).

Once you have the license, send a mail to Andrew Ramsay [andrew.ramsay@glasgow.ac.uk](mailto:andrew.ramsay@glasgow.ac.uk) to have access to a download link with the preprocessed iKAT passage collection (here are the 16 chunks)

allenai / ir_datasets

TREC iKAT 2023/2024 #260