allenai / ir_datasets

Provides a common interface to many IR ranking datasets.
https://ir-datasets.com/
Apache License 2.0
318 stars 42 forks source link

AOL query log #109

Closed seanmacavaney closed 2 years ago

seanmacavaney commented 3 years ago

Dataset Information:

A lightning rod that may not be worth touching.

OTOH, it's still sometimes used by researchers.

Links to Resources:

Dataset ID(s):

Supported Entities

Additional comments/concerns/ideas/etc.

How to deal with documents? Some folks only use the document titles (?!?) and filter out ones that do not match in the top BM25 results. What seems to be common is to fetch all clicked documents and use that as a corpus, but that clearly introduces a bunch of biases. Another dataset (clueweb? c4?) could be used as a source of the documents, though I have not seen anybody do it this way before.

Of course, this dataset could always just consist of queries and qrels, and leave it as an exercise for the user to decide how to construct the documents.

seanmacavaney commented 2 years ago

Added in #126