Open StalVars opened 2 years ago
@StalVars whats the latency you got for ~400K documents? Also whats the memory usage?
@StalVars @Witiko I found the library to be super slow during retrieval from ~350K samples. Please help ๐ ๐ ๐
@dorianbrown the library is slow to retrieval from ~350K samples. Can you please guide what to do here?
You may want to use a library such as Gensim, which builds a dictionary mapping from words to ids and then indexes the documents using a sparse matrix. This makes indexing slower, but retrieval is much faster than rank-bm25, because it can use fast matrix operations: https://github.com/RaRe-Technologies/gensim/pull/3304
Alternatively, use industry-strength packages such as pyserini or elasticsearch. Rank-bm25 is not built for speed.
thanks @Witiko. What's the downside of using pyserini?
Also what's the use case of Rank-bm25? Ease of use?
Pyserini is a python binding for the anserini java library. Therefore, you need to have Java installed, which makes pyserini more difficult to install than rank-bm25, which is essentially just a single Python file.
Pyserini is a python binding for the anserini java library. Therefore, you need to have Java installed, which makes pyserini more difficult to install than rank-bm25, which is essentially just a single Python file.
Rank-bm25 is a simple solution for use cases, where speed is not a concern.
@Witiko thanks a lot for the feedback, really appreciate it ๐
So there is no python solution that is fast?
Sounds like need to try with Gensim. But if the loading of documents are slow with Gensim, that's not a good fit for me either.
I need to <~250 ms response time during retrieval.
Gensim is a pure Python solution that uses accelerated python libraries such as SciPy and NumPy. It is quite fast in the retrieval stage. Support for BM25 in Gensim is still experimental; you can install it as follows:
pip install git+https://github.com/witiko/gensim.git@feature/bm25
See https://github.com/RaRe-Technologies/gensim/pull/3304 for an example of how you would use it. If you find it useful, please put a comment there, so that Gensim developers know that it is valuable to users and will merge the support for BM25 soon.
great. Would I get same retrieval accuracy with Gensim in comparison to rank_bm25
?
@Witiko what would be the performance of Gensim during loading the 500k documents? Would that be competitive with rank_bm25
?
The algorithm is exactly the same aa rank-bm25, so accuracy should also be the same. The loading may be slightly slower than in rank-bm25, because we need to build a dictionary, but the retrieval should be significantly faster. Try it out and let me know.
@Witiko
Gensim is a pure Python solution that uses accelerated python libraries such as SciPy and NumPy. It is quite fast in the retrieval stage. Support for BM25 in Gensim is still experimental; you can install it as follows:
pip install git+https://github.com/witiko/gensim.git@feature/bm25
See RaRe-Technologies/gensim#3304 for an example of how you would use it. If you find it useful, please put a comment there, so that Gensim developers know that it is valuable to users and will merge the support for BM25 soon.
I ran the command pip install git+https://github.com/witiko/gensim.git@feature/bm25
.
But it does not install and fails with the following error message:
building 'gensim.models.nmf_pgd' extension
gcc -pthread -B /root/miniconda/envs/codex-env/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -O2 -Wall -fPIC -O2 -isystem /root/miniconda/envs/codex-env/include -I/root/miniconda/envs/codex-env/include -fPIC -O2 -isystem /root/miniconda/envs/codex-env/include -fPIC -I/root/miniconda/envs/codex-env/include/python3.9 -I/root/miniconda/envs/codex-env/lib/python3.9/site-packages/numpy/core/include -c gensim/models/nmf_pgd.c -o build/temp.linux-x86_64-cpython-39/gensim/models/nmf_pgd.o
gcc -pthread -B /root/miniconda/envs/codex-env/compiler_compat -shared -Wl,-rpath,/root/miniconda/envs/codex-env/lib -Wl,-rpath-link,/root/miniconda/envs/codex-env/lib -L/root/miniconda/envs/codex-env/lib -L/root/miniconda/envs/codex-env/lib -Wl,-rpath,/root/miniconda/envs/codex-env/lib -Wl,-rpath-link,/root/miniconda/envs/codex-env/lib -L/root/miniconda/envs/codex-env/lib build/temp.linux-x86_64-cpython-39/gensim/models/nmf_pgd.o -o build/lib.linux-x86_64-cpython-39/gensim/models/nmf_pgd.cpython-39-x86_64-linux-gnu.so
building 'gensim.similarities.fastss' extension
creating build/temp.linux-x86_64-cpython-39/gensim/similarities
gcc -pthread -B /root/miniconda/envs/codex-env/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -O2 -Wall -fPIC -O2 -isystem /root/miniconda/envs/codex-env/include -I/root/miniconda/envs/codex-env/include -fPIC -O2 -isystem /root/miniconda/envs/codex-env/include -fPIC -I/root/miniconda/envs/codex-env/include/python3.9 -I/root/miniconda/envs/codex-env/lib/python3.9/site-packages/numpy/core/include -c gensim/similarities/fastss.c -o build/temp.linux-x86_64-cpython-39/gensim/similarities/fastss.o
gensim/similarities/fastss.c: In function โceditdistโ:
gensim/similarities/fastss.c:725:9: error: โforโ loop initial declarations are only allowed in C99 mode
for (WIDTH tmpi = 0; tmpi <= len_s1; tmpi++) row2[tmpi] = tmpi;
^
gensim/similarities/fastss.c:725:9: note: use option -std=c99 or -std=gnu99 to compile your code
gensim/similarities/fastss.c:727:9: error: โforโ loop initial declarations are only allowed in C99 mode
for (WIDTH i2 = 0; i2 < len_s2; i2++) {
^
gensim/similarities/fastss.c:738:13: error: โforโ loop initial declarations are only allowed in C99 mode
for (WIDTH i1 = 0; i1 < len_s1; i1++) {
^
error: command '/usr/bin/gcc' failed with exit code 1
[end of output]
note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure
ร Encountered error while trying to install package.
โฐโ> gensim
I am not sure what the issue is. It works fine in the python:3.7 Docker image.
@Witiko when I install like pip install --upgrade gensim
, it just works. Looks like an issue with the https://github.com/witiko/gensim.git@feature/bm25 branch.
Thanks for the feedback. I will try with python:3.7
@Witiko when I install like
pip install --upgrade gensim
, it just works. Looks like an issue with the https://github.com/witiko/gensim.git@feature/bm25 branch.
It's not an issue with the branch. The release version of gensim has a precompiled wheel, which circumvents your compiler.
A little tangential, but I found another interesting speed issue. I made a refactored/simplified version of BM25Okapi from rank_bm25 to https://github.com/jankovicsandras/plpgsql_bm25/blob/main/mybm25okapi.py
This is without numpy
(!), using just math
and precalculates stuff in __init__
to simplify the query function. Still it's 2-4x faster than rank_bm25 in this quick-and-dirty test, so it might be possible to speed up rank_bm25 a lot.
(E.g. it's possible to compute half of the lines 119-120 in rank_bm25.py beforehand.
score += (self.idf.get(q) or 0) * (q_freq * (self.k1 + 1) /
(q_freq + self.k1 * (1 - self.b + self.b * doc_len / self.avgdl)))
)
Sounds interesting, I haven't had a close look at the changes yet, but basically you're just precalculating all the static terms from the scoring function?
And I guess since the initial stuff was done with the math
functions, vectorizing it with numpy
would result in larger speed gains? Sounds feasible and promising, I'll try and double check the change when I've got some time, and run some tests. But feel free to create a PR if you're interested!
Thanks for your answer! ๐ I opened a PR: https://github.com/dorianbrown/rank_bm25/pull/46
I made an optimized rewrite with all 3 algorithms: https://github.com/jankovicsandras/bm25opt
Comparative testing shows it runs approx 30-40 x faster than rank_bm25 while producing exactly same scores.
I found the retrieval very slow for ~ 20 million documents (wikipedia). Is it the case?