chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
14.72k stars 1.23k forks source link

[Bug]: add_documents method run slowly #2853

Closed david101-hunter closed 4 days ago

david101-hunter commented 5 days ago

Description

The add_documents method in the Chroma is running significantly slower than expected when processing large batches of documents. This is causing bottlenecks in our document ingestion pipeline.

I have about 200 docs, after using model embebdding bge-m3, I use add_documents to add all docs to vector store like this

self.vector_store.add_documents(self.completed_documents)

Steps to Reproduce

Expected Behavior

Based on our performance requirements, the add_documents method should process 200 documents in under 60 seconds.

Actual Behavior

The add_documents method is taking approximately 10 minutes to process 200 documents.

Versions

Environment

OS: Ubuntu 20.04 LTS Python version: 3.9.11 Library version: langchain_chroma==0.1.2

Relevant log output

No response

tazarov commented 4 days ago

@david101-hunter, I think the the slowness comes not from Chroma itself but from the embedding model. bge-m3 is a relatively large and heavy model and unless you run it on a GPU it can be kind of slow on modest hardware. Can I suggest that you either pre-compute the batch embeddings and just try to add them and measure times then or simply use the default embedding model (all-mini-lm) and measure times then.

david101-hunter commented 4 days ago

Thanks for your insight!