castorini / pyserini

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.
http://pyserini.io/
Apache License 2.0
1.68k stars 374 forks source link

docs.json #830

Closed Qiaoyf96 closed 3 years ago

Qiaoyf96 commented 3 years ago

Hi,

I am trying to reproduce the MS MARCO Document Ranking with TCT-ColBERT-V2 (zero-shot). I saw the following step:

Step0: prepare docs.json: split docs into segments of passages Each line contains a json dict as follows: {"id": "[doc_id]#[seg_id]", "contents": "[url]\n[title]\n[seg_text]"}

My question is, do you guys provide the script to produce this docs.json, or do I need to write it on my own?

Thanks, Yifan

MXueguang commented 3 years ago

Hi @Qiaoyf96, currently please follow the script in our docTTTTTquery repo

Concretely, following the instruction below:

In comparison with per-passage expansion, we will use per passage without expansion as the baseline. In this method, we will not append the predicted queries to the passages.

you probably need some minor modification to the script to get the above required format. Notice that in the docTTTTTquery repo we use spacy version 2.1.6

Qiaoyf96 commented 3 years ago

Thanks!

xiahaoyun commented 3 years ago

you probably need some minor modification to the script to get the above required format. Notice that in the docTTTTTquery repo we use spacy version 2.1.6

It does need some modification, because in this script url title seg_text is separated by space instead of \n