david101-hunter commented 5 days ago

Description

The add_documents method in the Chroma is running significantly slower than expected when processing large batches of documents. This is causing bottlenecks in our document ingestion pipeline.

I have about 200 docs, after using model embebdding bge-m3, I use add_documents to add all docs to vector store like this

self.vector_store.add_documents(self.completed_documents)

Steps to Reproduce

Initialize a Chroma instance
Prepare a batch of 200 text documents (average: from 100-1000 character)
Call the add_documents method with this batch
Observe the time taken to complete the operation

Expected Behavior

Based on our performance requirements, the add_documents method should process 200 documents in under 60 seconds.

Actual Behavior

The add_documents method is taking approximately 10 minutes to process 200 documents.

Versions

Environment

OS: Ubuntu 20.04 LTS Python version: 3.9.11 Library version: langchain_chroma==0.1.2

Relevant log output

No response

tazarov commented 4 days ago

@david101-hunter, I think the the slowness comes not from Chroma itself but from the embedding model. bge-m3 is a relatively large and heavy model and unless you run it on a GPU it can be kind of slow on modest hardware. Can I suggest that you either pre-compute the batch embeddings and just try to add them and measure times then or simply use the default embedding model (all-mini-lm) and measure times then.

david101-hunter commented 4 days ago

Thanks for your insight!

chroma-core / chroma

[Bug]: add_documents method run slowly #2853