castorini / pyserini

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.
http://pyserini.io/
Apache License 2.0
1.68k stars 374 forks source link

passage_id is different from what is expected #1271

Closed mfayoub closed 2 years ago

mfayoub commented 2 years ago

Hi everyone,

I've downloaded the prebuilt index named "msmarco-v2-passage", and then I tried a simple search. The resulting passages have a different format that what I was expecting. So, a sample passage id that came out of my search is like msmarco_passage_04_180136318, where I expected a passage id is just a number (from 1 till 8.8 million). Is that right? or am I doing something wrong?

MXueguang commented 2 years ago

Hi @mfayoub

msmarco-v1-passage is the index you are looking for I guess. (which has 8.8 million passages, i.e. the original version of msmarco passage ranking). msmarco-v2-passage is a new corpus released in 2021, https://microsoft.github.io/msmarco/TREC-Deep-Learning-2021.html#please-read-data-refresh

mfayoub commented 2 years ago

Thanks MXueguang for your reply! Yes, I tried msmarco-v1-passage, and it seems returning the expected doc_ids.