Unify embedding model / tokenizer for builtin source storages?

pmeier commented 1 year ago

We currently used different embedding models and tokenizers for our builtin source storages:

So far I've just used whatever the documentation of the respective tool suggested. For Chroma that wasn't really an issue. However for LanceDB, added in #66, this added tons of heavy dependencies:

https://github.com/Quansight/ragna/blob/e85129752682e38dc7f2ef9622446f3ba5a168e9/ragna/source_storage/_lancedb.py#L21-L25

Since we build ragna.builtin_config at import time

https://github.com/Quansight/ragna/blob/e85129752682e38dc7f2ef9622446f3ba5a168e9/ragna/__init__.py#L35-L38

and PackageRequirement.is_available() performs the import, we now have a crazy overhead:

With LanceDB

$ time python -c "import ragna"
python -c "import ragna"  6,23s user 2,41s system 124% cpu 6,930 total

Without LanceDB

$ time python -c "import ragna"
python -c "import ragna"  2,28s user 1,63s system 156% cpu 2,497 total

Without LanceDB and Chroma

$ time python -c "import ragna"
python -c "import ragna"  1,26s user 0,20s system 99% cpu 1,458 total

That in itself wouldn't be the issue if the specific embedding model / tokenizer is required for LanceDB. But it isn't.

My proposal is twofold:

We should use the same embedding model / tokenizer for all builtin source storages. Since the source storages basically just store vectors, I currently can't imagine a case where one would require a specific configuration. Even then, providing the same for all other source storages means that we keep our dependencies to a minimum and in turn also the import time.
Instead of using a "random" embedding model / tokenizer, we should lightweight ones. The ones used by Chroma look like a good starting point, but maybe we can do better? @dillonroach do you have insights here?

dillonroach commented 1 year ago

Per our side conversation - https://huggingface.co/spaces/mteb/leaderboard has a set of benchmarks they run against a number of these.. as they say 'your mileage may vary' but it's a decent starting point - the bge, and https://huggingface.co/BAAI/bge-small-en-v1.5 in particular, jump out as setting a good balance between performance and size.

If the goal is to match what's used else-where, encoding = tiktoken.get_encoding("cl100k_base") is the 'default' for GPT4/3.5-turbo and text-embedding-ada-002; and it's also worth noting that the tokenizer bundled with llama models is a BPE model based on sentencepiece. There's some good work done in the latest transformers release specific to tokenizers, https://github.com/huggingface/transformers/releases/tag/v4.34.0, and one can go digging there for some of the latest changes.

pmeier commented 1 year ago

With #72 we no longer have the massive overhead at import, but still we pulling in multiple GBs as dependencies. A docker image based on python:3.11 is ~6GB big.

The folks over at chroma had the same issue and solved it:

# In order to remove dependencies on sentence-transformers, which in turn depends on
# pytorch and sentence-piece we have created a default ONNX embedding function that
# implements the same functionality as "all-MiniLM-L6-v2" from sentence-transformers.
# visit https://github.com/chroma-core/onnx-embedding for the source code to generate
# and verify the ONNX model.

Quansight / ragna

Unify embedding model / tokenizer for builtin source storages? #71