castorini / bertserini

BERTserini
https://github.com/castorini/bertserini
Apache License 2.0
25 stars 10 forks source link

Take advantage of pyserini's new prebuilt index features #8

Closed lintool closed 2 years ago

lintool commented 3 years ago

We can now do this in pyserini

>>> from pyserini.search import SimpleSearcher
>>> searcher = SimpleSearcher.from_prebuilt_index('trec45')
Downloading index at https://git.uwaterloo.ca/jimmylin/anserini-indexes/raw/master/index-robust04-20191213.tar.gz...
index-robust04-20191213.tar.gz: 1.70GB [00:50, 36.4MB/s]                                                                                                                                                                        
Extracting /Users/jimmylin/.cache/pyserini/indexes/index-robust04-20191213.tar.gz into /Users/jimmylin/.cache/pyserini/indexes/index-robust04-2019121315f3d001489c97849a010b0a4734d018...
>>> searcher
<pyserini.search._searcher.SimpleSearcher object at 0x7fee58547ac8>
>>> hits = searcher.search('hubble space telescope')
>>> 
>>> # Print the first 10 hits:
... for i in range(0, 10):
...     print(f'{i+1:2} {hits[i].docid:15} {hits[i].score:.5f}')
... 
 1 LA071090-0047   16.85690
 2 FT934-5418      16.75630
 3 FT921-7107      16.68290
 4 LA052890-0021   16.37390
 5 LA070990-0052   16.36460
 6 LA062990-0180   16.19260
 7 LA070890-0154   16.15610
 8 FT934-2516      16.08950
 9 LA041090-0148   16.08810
10 FT944-128       16.01920

Instead of downloading the indexes by hand, take advantage of this feature?

cc/ @MXueguang @qguo96

MXueguang commented 3 years ago

sure!

qguo96 commented 3 years ago

I did something similar in Bertserini, but it may be better to let Pyserini do this now.

qguo96 commented 3 years ago

A PR(https://github.com/rsvp-ai/bertserini/pull/10) to solve this.