castorini / anserini

Anserini is a Lucene toolkit for reproducible information retrieval research
http://anserini.io/
Apache License 2.0
1.02k stars 449 forks source link

Retrieving by BM25 becomes slower when there are many queries #1703

Closed namespace-Pt closed 2 years ago

namespace-Pt commented 2 years ago

I tried BM25 baseline for MSMARCO passage ranking and succeeded. The retrieving speed is about 0.001s/query according to the terminal output when hits=1000.

But when I was going to retrieve more queries (all the 55k training queries) with the exact same index. I found the speed was becoming slower and slower until the entire program stuck at 49.13% queries. Why this could happen? I don't think it's reasonable that the retrieving speed is dragging down by larger query quantities.

lintool commented 2 years ago

You're probably running out of memory. Since the SearchCollection implementation is multi-threaded, it keeps the hits in memory until all the queries are processed, and the writes out to disk all at once. This simplifies thread synchronization.

Try running on smaller batches of queries.