Closed pmeier closed 1 year ago
Per our side conversation - https://huggingface.co/spaces/mteb/leaderboard has a set of benchmarks they run against a number of these.. as they say 'your mileage may vary' but it's a decent starting point - the bge, and https://huggingface.co/BAAI/bge-small-en-v1.5 in particular, jump out as setting a good balance between performance and size.
If the goal is to match what's used else-where, encoding = tiktoken.get_encoding("cl100k_base")
is the 'default' for GPT4/3.5-turbo and text-embedding-ada-002; and it's also worth noting that the tokenizer bundled with llama models is a BPE model based on sentencepiece. There's some good work done in the latest transformers release specific to tokenizers, https://github.com/huggingface/transformers/releases/tag/v4.34.0, and one can go digging there for some of the latest changes.
With #72 we no longer have the massive overhead at import, but still we pulling in multiple GBs as dependencies. A docker image based on python:3.11
is ~6GB big.
The folks over at chroma had the same issue and solved it:
# In order to remove dependencies on sentence-transformers, which in turn depends on
# pytorch and sentence-piece we have created a default ONNX embedding function that
# implements the same functionality as "all-MiniLM-L6-v2" from sentence-transformers.
# visit https://github.com/chroma-core/onnx-embedding for the source code to generate
# and verify the ONNX model.
We currently used different embedding models and tokenizers for our builtin source storages:
So far I've just used whatever the documentation of the respective tool suggested. For Chroma that wasn't really an issue. However for LanceDB, added in #66, this added tons of heavy dependencies:
https://github.com/Quansight/ragna/blob/e85129752682e38dc7f2ef9622446f3ba5a168e9/ragna/source_storage/_lancedb.py#L21-L25
Since we build
ragna.builtin_config
at import timehttps://github.com/Quansight/ragna/blob/e85129752682e38dc7f2ef9622446f3ba5a168e9/ragna/__init__.py#L35-L38
and
PackageRequirement.is_available()
performs the import, we now have a crazy overhead:That in itself wouldn't be the issue if the specific embedding model / tokenizer is required for LanceDB. But it isn't.
My proposal is twofold: