TREC Podcasts - Githubissues

I have a copy of the corpus. I think there are interesting questions here about how to incorporate the fact that it's (essentially) a fixed-length passage retrieval task. I.e., should the documents be individual passages or entire episodes?

Following the lead from msmarco-passage, the individual passages could be the docs. But the dataset itself isn't split up that way-- it's chunks of several sentences that do not necessarily line up with the 2-minute (overlapping) chunks.

So keeping entire episodes as documents may seem more natural. But there's a problem there too: then the qrels do not line up with the doc_ids (since the qrels include the timestamp).

I think what I'll do is have both versions, something like this:

spotify-podcasts (docs) -- full episodes, keeping everything from the original source
spotify-podcasts/chunked (docs) -- 2-minutes chunks, starting on each minute. These will be heavily processed, with fields being doc_id, text, episode_id, and start_timestamp (though doc_id itself is just a concatenation of episode_id and start_timestamp)
spotify-podcasts/chunked/trec-podcasts-{2020,2021} (docs, queries, qrels)

This setup has the following nice qualities:

All source information is available (via spotify-podcasts)
Qrels have doc_ids that line up with the corpus (via spotify-podcasts/chunked)
Should be easy to use in the chunked setting, with a single simple text field

allenai / ir_datasets

TREC Podcasts #44