allenai / ir_datasets

Provides a common interface to many IR ranking datasets.
https://ir-datasets.com/
Apache License 2.0
321 stars 43 forks source link

LongEval Retrieval (used at CLEF 2023) #234

Open mam10eks opened 1 year ago

mam10eks commented 1 year ago

Dataset Information:

The goal would be to integrate the data of LongEval for the task 1 on retrieval.

The information from the official task description:

The goal of Task 1 is to propose an information retrieval system which can handle changes over the time. The proposed retrieval system should follow the temporal timewise evolution of Web documents. The Longeval Websearch collection relies on a large set of data (corpus of pages, queries, user interaction) provided by a commercial search engine (Qwant). It is designed to reflect the changes of the Web across time, by providing evolving document and query sets. The queries in the collection were collected from Qwant's users over several months and can thus be expected to reflect the changes in the search preferences of the users. The documents in the collection were then selected to be able to well evaluate retrieval on these queries at the time they were collected, and thus also change over a time.

Links to Resources:

https://clef-longeval.github.io/

Dataset ID(s) & supported entities:

Checklist

Mark each task once completed. All should be checked prior to merging a new dataset.

Additional comments/concerns/ideas/etc.

mam10eks commented 1 year ago

I have started to work on this and have a first prototype locally that uses TrecDocs and TsvQueries, so it should be not much code that is needed here.

seanmacavaney commented 1 year ago

Awesome! Given LongEval's focus on the temporal, I think it should be encoded at a higher level in the dataset ids, e.g.:

Though maybe I'm missing something about how the task is structured?

mam10eks commented 1 year ago

Yes, makes perfect sense, I can implement this ticket? (I already have a prototype, it is not much code as LongEval comes in formats already supported in ir_datasets)

seanmacavaney commented 1 year ago

That would be awesome! I love when folks release data in standard formats :-)

romaindeveaud commented 1 year ago

If I may add something, the LongEval collection is subject to a custom license from Qwant (https://lindat.mff.cuni.cz/repository/xmlui/page/Qwant_LongEval_BY-NC-SA_License, this is basically an extension of the CC-BY-NC License) that requires an explicit agreement as well as providing contact information. Is it something that is feasible within ir-datasets?

mam10eks commented 1 year ago

Dear Romain,

Thanks for reaching out. Yes, this is feasible.

The ir-datasets integration would expect that the user manually downloads the data (I already have a prototype implementation that assumes this). I.e., ir-datasets would not download the dataset, but only show a message to the user to obtain the data (thereby filling out the explicit agreement and contact information) and than store it in some predefined directory.

Best regards,

Maik