Open seanmacavaney opened 3 years ago
I have a copy of the corpus. I think there are interesting questions here about how to incorporate the fact that it's (essentially) a fixed-length passage retrieval task. I.e., should the documents be individual passages or entire episodes?
Following the lead from msmarco-passage
, the individual passages could be the docs. But the dataset itself isn't split up that way-- it's chunks of several sentences that do not necessarily line up with the 2-minute (overlapping) chunks.
So keeping entire episodes as documents may seem more natural. But there's a problem there too: then the qrels do not line up with the doc_ids (since the qrels include the timestamp).
I think what I'll do is have both versions, something like this:
spotify-podcasts
(docs) -- full episodes, keeping everything from the original sourcespotify-podcasts/chunked
(docs) -- 2-minutes chunks, starting on each minute. These will be heavily processed, with fields being doc_id
, text
, episode_id
, and start_timestamp
(though doc_id
itself is just a concatenation of episode_id
and start_timestamp
)spotify-podcasts/chunked/trec-podcasts-{2020,2021}
(docs, queries, qrels)This setup has the following nice qualities:
spotify-podcasts
)spotify-podcasts/chunked
)
Dataset Information:
TREC Task in 2020-21. This is a placeholder as I learn more about this task.
Links to Resources:
Dataset ID(s):
<propose dataset ID(s), and where they fit in the hierarchy>
Supported Entities
Additional comments/concerns/ideas/etc.