dorianbrown / rank_bm25

A Collection of BM25 Algorithms in Python
Apache License 2.0
1.05k stars 89 forks source link

Support for vectorized/batch inference? #18

Open Smu-Tan opened 2 years ago

Smu-Tan commented 2 years ago

Hi, Im just wondering is there any method that can speed up the retrieval process? for example, vectorized or batch inference? (it means do the retrieval for a batch/a list of query at the same time).

Since Im trying to use the bm25 to retrieve the top n docs for large data(retrieve over 10k query from 50k docs), and if I do this by calling bm25.get_top_n() in a for loop, the inference time will be unacceptable long.

dorianbrown commented 2 years ago

Have you checked out the get_batch_scores method yet? It sounds like this might be what you're looking for.

Smu-Tan commented 2 years ago

Have you checked out the get_batch_scores method yet? It sounds like this might be what you're looking for.

I think get_batch_scores is to compute the bm25 scores between one query and a subset of the corpus? what I need is to compute the bm25 scores between a list of queries and the corpus. And because the query list is very huge(10k queries), then computing them is very slow.

puzzlecollector commented 2 years ago

Is this problem resolved? I am having the same sort of issue. I have 50k queries and it takes a long time (for me 150k seconds approx or almost 42 hrs) to compute.

wise-east commented 2 years ago

@Smu-Tan @puzzlecollector were you able to find an alternative to this implementation to speed up the process?

Smu-Tan commented 2 years ago

@Smu-Tan @puzzlecollector were you able to find an alternative to this implementation to speed up the process?

checkout Pyserini.

AmenRa commented 2 years ago

Hi @Smu-Tan, @puzzlecollector, and @wise-east,

I have just released a new Python-based search engine called retriv. It only takes ~40ms to query 8M documents on my machine, and it can perform multiple searches in parallel. If you try it, please, let me know if it works for your use case.