Retrieving speed for large set of documents

StalVars commented 2 years ago

I found the retrieval very slow for ~ 20 million documents (wikipedia). Is it the case?

nashid commented 2 years ago

@StalVars whats the latency you got for ~400K documents? Also whats the memory usage?

ramsey-coding commented 2 years ago

@StalVars @Witiko I found the library to be super slow during retrieval from ~350K samples. Please help 🙏 🙏 🙏

ramsey-coding commented 2 years ago

@dorianbrown the library is slow to retrieval from ~350K samples. Can you please guide what to do here?

Witiko commented 2 years ago

You may want to use a library such as Gensim, which builds a dictionary mapping from words to ids and then indexes the documents using a sparse matrix. This makes indexing slower, but retrieval is much faster than rank-bm25, because it can use fast matrix operations: https://github.com/RaRe-Technologies/gensim/pull/3304

Alternatively, use industry-strength packages such as pyserini or elasticsearch. Rank-bm25 is not built for speed.

ramsey-coding commented 2 years ago

thanks @Witiko. What's the downside of using pyserini?

Also what's the use case of Rank-bm25? Ease of use?

Witiko commented 2 years ago

Pyserini is a python binding for the anserini java library. Therefore, you need to have Java installed, which makes pyserini more difficult to install than rank-bm25, which is essentially just a single Python file.

Witiko commented 2 years ago

Pyserini is a python binding for the anserini java library. Therefore, you need to have Java installed, which makes pyserini more difficult to install than rank-bm25, which is essentially just a single Python file.

Witiko commented 2 years ago

Rank-bm25 is a simple solution for use cases, where speed is not a concern.

ramsey-coding commented 2 years ago

@Witiko thanks a lot for the feedback, really appreciate it 🙏

So there is no python solution that is fast?

ramsey-coding commented 2 years ago

Sounds like need to try with Gensim. But if the loading of documents are slow with Gensim, that's not a good fit for me either.

I need to <~250 ms response time during retrieval.

Witiko commented 2 years ago

Gensim is a pure Python solution that uses accelerated python libraries such as SciPy and NumPy. It is quite fast in the retrieval stage. Support for BM25 in Gensim is still experimental; you can install it as follows:

pip install git+https://github.com/witiko/gensim.git@feature/bm25

See https://github.com/RaRe-Technologies/gensim/pull/3304 for an example of how you would use it. If you find it useful, please put a comment there, so that Gensim developers know that it is valuable to users and will merge the support for BM25 soon.

ramsey-coding commented 2 years ago

great. Would I get same retrieval accuracy with Gensim in comparison to rank_bm25?

ramsey-coding commented 2 years ago

@Witiko what would be the performance of Gensim during loading the 500k documents? Would that be competitive with rank_bm25?

Witiko commented 2 years ago

The algorithm is exactly the same aa rank-bm25, so accuracy should also be the same. The loading may be slightly slower than in rank-bm25, because we need to build a dictionary, but the retrieval should be significantly faster. Try it out and let me know.

ramsey-coding commented 2 years ago

@Witiko

Gensim is a pure Python solution that uses accelerated python libraries such as SciPy and NumPy. It is quite fast in the retrieval stage. Support for BM25 in Gensim is still experimental; you can install it as follows:
pip install git+https://github.com/witiko/gensim.git@feature/bm25
See RaRe-Technologies/gensim#3304 for an example of how you would use it. If you find it useful, please put a comment there, so that Gensim developers know that it is valuable to users and will merge the support for BM25 soon.

I ran the command pip install git+https://github.com/witiko/gensim.git@feature/bm25.

But it does not install and fails with the following error message:

      building 'gensim.models.nmf_pgd' extension
      gcc -pthread -B /root/miniconda/envs/codex-env/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -O2 -Wall -fPIC -O2 -isystem /root/miniconda/envs/codex-env/include -I/root/miniconda/envs/codex-env/include -fPIC -O2 -isystem /root/miniconda/envs/codex-env/include -fPIC -I/root/miniconda/envs/codex-env/include/python3.9 -I/root/miniconda/envs/codex-env/lib/python3.9/site-packages/numpy/core/include -c gensim/models/nmf_pgd.c -o build/temp.linux-x86_64-cpython-39/gensim/models/nmf_pgd.o
      gcc -pthread -B /root/miniconda/envs/codex-env/compiler_compat -shared -Wl,-rpath,/root/miniconda/envs/codex-env/lib -Wl,-rpath-link,/root/miniconda/envs/codex-env/lib -L/root/miniconda/envs/codex-env/lib -L/root/miniconda/envs/codex-env/lib -Wl,-rpath,/root/miniconda/envs/codex-env/lib -Wl,-rpath-link,/root/miniconda/envs/codex-env/lib -L/root/miniconda/envs/codex-env/lib build/temp.linux-x86_64-cpython-39/gensim/models/nmf_pgd.o -o build/lib.linux-x86_64-cpython-39/gensim/models/nmf_pgd.cpython-39-x86_64-linux-gnu.so
      building 'gensim.similarities.fastss' extension
      creating build/temp.linux-x86_64-cpython-39/gensim/similarities
      gcc -pthread -B /root/miniconda/envs/codex-env/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -O2 -Wall -fPIC -O2 -isystem /root/miniconda/envs/codex-env/include -I/root/miniconda/envs/codex-env/include -fPIC -O2 -isystem /root/miniconda/envs/codex-env/include -fPIC -I/root/miniconda/envs/codex-env/include/python3.9 -I/root/miniconda/envs/codex-env/lib/python3.9/site-packages/numpy/core/include -c gensim/similarities/fastss.c -o build/temp.linux-x86_64-cpython-39/gensim/similarities/fastss.o
      gensim/similarities/fastss.c: In function ‘ceditdist’:
      gensim/similarities/fastss.c:725:9: error: ‘for’ loop initial declarations are only allowed in C99 mode
               for (WIDTH tmpi = 0; tmpi <= len_s1; tmpi++) row2[tmpi] = tmpi;
               ^
      gensim/similarities/fastss.c:725:9: note: use option -std=c99 or -std=gnu99 to compile your code
      gensim/similarities/fastss.c:727:9: error: ‘for’ loop initial declarations are only allowed in C99 mode
               for (WIDTH i2 = 0; i2 < len_s2; i2++) {
               ^
      gensim/similarities/fastss.c:738:13: error: ‘for’ loop initial declarations are only allowed in C99 mode
                   for (WIDTH i1 = 0; i1 < len_s1; i1++) {
                   ^
      error: command '/usr/bin/gcc' failed with exit code 1
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> gensim

Witiko commented 2 years ago

I am not sure what the issue is. It works fine in the python:3.7 Docker image.

ramsey-coding commented 2 years ago

@Witiko when I install like pip install --upgrade gensim, it just works. Looks like an issue with the https://github.com/witiko/gensim.git@feature/bm25 branch.

Thanks for the feedback. I will try with python:3.7

Witiko commented 2 years ago

@Witiko when I install like pip install --upgrade gensim, it just works. Looks like an issue with the https://github.com/witiko/gensim.git@feature/bm25 branch.

It's not an issue with the branch. The release version of gensim has a precompiled wheel, which circumvents your compiler.

jankovicsandras commented 2 weeks ago

A little tangential, but I found another interesting speed issue. I made a refactored/simplified version of BM25Okapi from rank_bm25 to https://github.com/jankovicsandras/plpgsql_bm25/blob/main/mybm25okapi.py This is without numpy (!), using just math and precalculates stuff in __init__ to simplify the query function. Still it's 2-4x faster than rank_bm25 in this quick-and-dirty test, so it might be possible to speed up rank_bm25 a lot.

https://github.com/jankovicsandras/plpgsql_bm25/blob/main/plpgsql_bm25_comparison_with_paradedb_pg_search.ipynb

(E.g. it's possible to compute half of the lines 119-120 in rank_bm25.py beforehand.

     score += (self.idf.get(q) or 0) * (q_freq * (self.k1 + 1) /
                                               (q_freq + self.k1 * (1 - self.b + self.b * doc_len / self.avgdl)))

)

dorianbrown commented 2 weeks ago

Sounds interesting, I haven't had a close look at the changes yet, but basically you're just precalculating all the static terms from the scoring function?

And I guess since the initial stuff was done with the math functions, vectorizing it with numpy would result in larger speed gains? Sounds feasible and promising, I'll try and double check the change when I've got some time, and run some tests. But feel free to create a PR if you're interested!

jankovicsandras commented 2 weeks ago

Thanks for your answer! 😊 I opened a PR: https://github.com/dorianbrown/rank_bm25/pull/46

jankovicsandras commented 3 days ago

I made an optimized rewrite with all 3 algorithms: https://github.com/jankovicsandras/bm25opt

Comparative testing shows it runs approx 30-40 x faster than rank_bm25 while producing exactly same scores.

dorianbrown / rank_bm25

Retrieving speed for large set of documents #25