Open Smu-Tan opened 2 years ago
Have you checked out the get_batch_scores
method yet? It sounds like this might be what you're looking for.
Have you checked out the
get_batch_scores
method yet? It sounds like this might be what you're looking for.
I think get_batch_scores
is to compute the bm25 scores between one query and a subset of the corpus? what I need is to compute the bm25 scores between a list of queries and the corpus. And because the query list is very huge(10k queries), then computing them is very slow.
Is this problem resolved? I am having the same sort of issue. I have 50k queries and it takes a long time (for me 150k seconds approx or almost 42 hrs) to compute.
@Smu-Tan @puzzlecollector were you able to find an alternative to this implementation to speed up the process?
@Smu-Tan @puzzlecollector were you able to find an alternative to this implementation to speed up the process?
checkout Pyserini.
Hi @Smu-Tan, @puzzlecollector, and @wise-east,
I have just released a new Python-based search engine called retriv
.
It only takes ~40ms to query 8M documents on my machine, and it can perform multiple searches in parallel.
If you try it, please, let me know if it works for your use case.
Hi, Im just wondering is there any method that can speed up the retrieval process? for example, vectorized or batch inference? (it means do the retrieval for a batch/a list of query at the same time).
Since Im trying to use the bm25 to retrieve the top n docs for large data(retrieve over 10k query from 50k docs), and if I do this by calling bm25.get_top_n() in a for loop, the inference time will be unacceptable long.