Open atroyn opened 5 months ago
I've been investigating whether we ought to make the switch to Snowflake's arctic models as our default embedding model. Currently they have the best performance on the MTEB Leaderboard for models consuming under 1GB of memory.
I've also been investigating llama.cpp as a new runtime for our built-in EFs, since it supports metal out of the box for mac users, as well as being very minimal to install and run.
Our new benchmark which evaluates retrieval in an AI application context, using synthetically generated queries against a document dataset (technical report and code coming soon). We use the default sentence transformers implementation for each EF. We report just recall with the same chunking strategy for each.
Evaluation was performed with the most appropriate chunk sizes (~256 tokens for all-MiniLM-L6-v2, ~512 for snowflake)
We evaluate latency and memory consumption on the c++ implementation of llama.cpp, with their tool llama-bench
, which runs performance and stress tests. For us. We convert to their model format using their own huggingface to ggml converter.
⚠️ Note: We did not evaluate accuracy on the llama.cpp implementations of these EFs. We have observed that there are are small (1e-5) cosine distances between the sentence transformer and llama.cpp outputs. We expect the impact on recall to be small, but this should be evaluated if we decide to proceed Have now evaluated with the LLama implementation.
We evaluate latency for 256-token inputs. We use the Metal backend since I did this on my M1 mac.
Model | Recall | Memory | Throughput (Tokens/s) | Chunks/s |
---|---|---|---|---|
Default (all-MiniLM-L6-v2) | 0.477 | 43.48 MiB | 16701.57 ± 621.40 | 65.240 |
snowflake-arctic-embed-s | 0.586 | 63.84 MiB | 8736.38 ± 16.73 | 34.126 |
snowflake-arctic-embed-m | 0.603 | 208.68 MiB | 11516.83 ± 98.37 | 44.988 |
These results indicate that we can get a pretty substantial win by upgrading to at least snowflake-arctic-embed-s as our default embedding model, pending evaluation of the recall of the llama-based model. The smaller difference between s and m on our benchmark also reflects the relatively small difference on MTEB on the retrieval task for these models ( 51.98 vs 54.91), so the 300% increase in memory consumption doesn't seem worthwhile.
Throughput is high in all cases, though the difference between s and m is surprising and suggests that there is likely some performance bottleneck that can be dealt with effectively to move this up more should we need to. Note that this is a huge speedup for mac users anyway who would otherwise be on CPU with our current implementation.
This move would come with a few complications:
The snowflake models have a different query and document embedding path, since they are instruction fine tunes (the query path has the query text prefixed with Represent this sentence for searching relevant passages:
). This would add the need to handle query and document paths differently in our EF class, but this is becoming more common, is a relatively minor change, and we are changing the way EFs are integrated anyway.
Moving to llama.cpp as the runtime would require us to extract a minimal embedding-oriented subset of the llama.cpp library and compile and distribute it ourselves; the existing python bindings are quite heavyweight in terms of package size and requirements so we would need to create our own. This is not especially complicated and doable, but adds some extra work for us.
Update (07.31.24)
Recently Snowflake released the 1.5 version of the arctic-embed model, which I also benchmarked. The difference in recall between snowflame-arctic-embed-s
and snowflake-arctic-embed-m
is more significant, both on our benchmark (0.586 -> 0.637, roughly 9% better) but I still don't think it justifies the resource overhead.
Just ran accuracy eval with the llama implementation and recall results are the same to within 0.002.
@atroyn, what is the migration path for existing users with all-mini EF?
Have given some thought to that and I think the best way is to expose a 're-embed' method on collections which allows users to swap embedding functions, which will re-embed their documents for them.
I've split the inference environment implementation from this issue, see https://github.com/chroma-core/chroma/issues/2682.
Default Embedding Function
Embedding functions have significant influence on the accuracy of retrieval, especially recall. Currently we use a fairly basic sentence transformer model, but lately there have been some better open-source models released in the same weight-class in terms of memory and compute.
Additionally, it’s easy to trip up as a user with embedding functions, because they typically have a fixed and relatively short context window which will truncate documents, causing important information to be lost. They’re also usually trained on only one distance metric, which we rely on the user to set themselves currently.
[Complexity] Subtask
[Low] Evaluate and swap to a new default EF. [Snowflake’s models](https://huggingface.co/Snowflake/snowflake-arctic-embed-s) look particularly promising. This is already available as ONNX.
[Low] Attach dimensionality and distance metric to EF class. This would allow us to set both of these automatically on collection creation without the user having to think about passing HNSW params.
[Low] Attach context window length metadata to the EF class. This would allow Smart Chunker to auto-parametrize to the right settings, and we could warn or error for users when this was being exceeded.
[Med] Expose tokenizer when it’s available. This would help Smart Chunker to count correctly.