allenai / ir_datasets

Provides a common interface to many IR ranking datasets.
https://ir-datasets.com/
Apache License 2.0
318 stars 42 forks source link

TREC Podcasts #44

Open seanmacavaney opened 3 years ago

seanmacavaney commented 3 years ago

Dataset Information:

TREC Task in 2020-21. This is a placeholder as I learn more about this task.

Links to Resources:

Dataset ID(s):

<propose dataset ID(s), and where they fit in the hierarchy>

Supported Entities

Additional comments/concerns/ideas/etc.

seanmacavaney commented 3 years ago

I have a copy of the corpus. I think there are interesting questions here about how to incorporate the fact that it's (essentially) a fixed-length passage retrieval task. I.e., should the documents be individual passages or entire episodes?

Following the lead from msmarco-passage, the individual passages could be the docs. But the dataset itself isn't split up that way-- it's chunks of several sentences that do not necessarily line up with the 2-minute (overlapping) chunks.

So keeping entire episodes as documents may seem more natural. But there's a problem there too: then the qrels do not line up with the doc_ids (since the qrels include the timestamp).

I think what I'll do is have both versions, something like this:

This setup has the following nice qualities: