Open mam10eks opened 1 year ago
I have started to work on this and have a first prototype locally that uses TrecDocs and TsvQueries, so it should be not much code that is needed here.
Awesome! Given LongEval's focus on the temporal, I think it should be encoded at a higher level in the dataset ids, e.g.:
longeval
(plaeholder)
/[2023-07|2023-09|...]
(placeholder)
/[en|fr|...]
(docs)/[train|heldout|eval|...]
(docs, queries, qrels)`Though maybe I'm missing something about how the task is structured?
Yes, makes perfect sense, I can implement this ticket? (I already have a prototype, it is not much code as LongEval comes in formats already supported in ir_datasets)
That would be awesome! I love when folks release data in standard formats :-)
If I may add something, the LongEval collection is subject to a custom license from Qwant (https://lindat.mff.cuni.cz/repository/xmlui/page/Qwant_LongEval_BY-NC-SA_License, this is basically an extension of the CC-BY-NC License) that requires an explicit agreement as well as providing contact information. Is it something that is feasible within ir-datasets?
Dear Romain,
Thanks for reaching out. Yes, this is feasible.
The ir-datasets integration would expect that the user manually downloads the data (I already have a prototype implementation that assumes this). I.e., ir-datasets would not download the dataset, but only show a message to the user to obtain the data (thereby filling out the explicit agreement and contact information) and than store it in some predefined directory.
Best regards,
Maik
Dataset Information:
The goal would be to integrate the data of LongEval for the task 1 on retrieval.
The information from the official task description:
Links to Resources:
https://clef-longeval.github.io/
Dataset ID(s) & supported entities:
longeval/en/train
: docs, queries, qrelslongeval/en/heldout
: docs, querieslongeval/en/a-short-july
: docs, querieslongeval/en/b-long-september
: docs, querieslongeval/fr/train
: docs, queries, qrelslongeval/fr/heldout
: docs, querieslongeval/fr/a-short-july
: docs, querieslongeval/fr/b-long-september
: docs, queriesChecklist
Mark each task once completed. All should be checked prior to merging a new dataset.
ir_datasets/datasets/[topid].py
)tests/integration/[topid].py
)ir_datasets generate_metadata
command, should appear inir_datasets/etc/metadata.json
)ir_datasets/etc/[topid].yaml
)ir_datasets/etc/downloads.json
).github/workflows/verify_downloads.yml
). Only one needed pertopid
.downloads.json
.Additional comments/concerns/ideas/etc.