chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
15.32k stars 1.28k forks source link

[Update][Accuracy] Default Embedding Function #2284

Open atroyn opened 5 months ago

atroyn commented 5 months ago

Default Embedding Function

Embedding functions have significant influence on the accuracy of retrieval, especially recall. Currently we use a fairly basic sentence transformer model, but lately there have been some better open-source models released in the same weight-class in terms of memory and compute.

Additionally, it’s easy to trip up as a user with embedding functions, because they typically have a fixed and relatively short context window which will truncate documents, causing important information to be lost. They’re also usually trained on only one distance metric, which we rely on the user to set themselves currently.

[Complexity] Subtask

atroyn commented 4 months ago

Evaluating Snowflake Arctic Embed under llama.cpp

I've been investigating whether we ought to make the switch to Snowflake's arctic models as our default embedding model. Currently they have the best performance on the MTEB Leaderboard for models consuming under 1GB of memory.

I've also been investigating llama.cpp as a new runtime for our built-in EFs, since it supports metal out of the box for mac users, as well as being very minimal to install and run.

Method

Accuracy

Our new benchmark which evaluates retrieval in an AI application context, using synthetically generated queries against a document dataset (technical report and code coming soon). We use the default sentence transformers implementation for each EF. We report just recall with the same chunking strategy for each.

Evaluation was performed with the most appropriate chunk sizes (~256 tokens for all-MiniLM-L6-v2, ~512 for snowflake)

Throughput and resources

We evaluate latency and memory consumption on the c++ implementation of llama.cpp, with their tool llama-bench, which runs performance and stress tests. For us. We convert to their model format using their own huggingface to ggml converter.

⚠️ Note: We did not evaluate accuracy on the llama.cpp implementations of these EFs. We have observed that there are are small (1e-5) cosine distances between the sentence transformer and llama.cpp outputs. We expect the impact on recall to be small, but this should be evaluated if we decide to proceed Have now evaluated with the LLama implementation.

We evaluate latency for 256-token inputs. We use the Metal backend since I did this on my M1 mac.

Results

Model Recall Memory Throughput (Tokens/s) Chunks/s
Default (all-MiniLM-L6-v2) 0.477 43.48 MiB 16701.57 ± 621.40 65.240
snowflake-arctic-embed-s 0.586 63.84 MiB 8736.38 ± 16.73 34.126
snowflake-arctic-embed-m 0.603 208.68 MiB 11516.83 ± 98.37 44.988

Conlcusion

These results indicate that we can get a pretty substantial win by upgrading to at least snowflake-arctic-embed-s as our default embedding model, pending evaluation of the recall of the llama-based model. The smaller difference between s and m on our benchmark also reflects the relatively small difference on MTEB on the retrieval task for these models ( 51.98 vs 54.91), so the 300% increase in memory consumption doesn't seem worthwhile.

Throughput is high in all cases, though the difference between s and m is surprising and suggests that there is likely some performance bottleneck that can be dealt with effectively to move this up more should we need to. Note that this is a huge speedup for mac users anyway who would otherwise be on CPU with our current implementation.

This move would come with a few complications:

Update (07.31.24) Recently Snowflake released the 1.5 version of the arctic-embed model, which I also benchmarked. The difference in recall between snowflame-arctic-embed-s and snowflake-arctic-embed-m is more significant, both on our benchmark (0.586 -> 0.637, roughly 9% better) but I still don't think it justifies the resource overhead.

atroyn commented 4 months ago

Just ran accuracy eval with the llama implementation and recall results are the same to within 0.002.

tazarov commented 4 months ago

@atroyn, what is the migration path for existing users with all-mini EF?

atroyn commented 4 months ago

Have given some thought to that and I think the best way is to expose a 're-embed' method on collections which allows users to swap embedding functions, which will re-embed their documents for them.

atroyn commented 2 months ago

I've split the inference environment implementation from this issue, see https://github.com/chroma-core/chroma/issues/2682.