Open saachishenoy opened 7 months ago
@tazarov would this be a good place to clean up the WAL? That's probably part of why it's getting slower.
WAL is definitely one thing. Yet, I feel there is something odd here. I've personally created collections of about 10M+ embeddings, and the approximate runtime for adding 1M embeddings goes up as the HNSW binary index increases. Also I need to check LC's impl, but it might be slower than adding the docs with Chroma persistent client directly.
@saachishenoy, when you create your collection, you can specify hnsw:batch_size
and hnsw:sync_threshold
. The batch size controls the in-memory (aka brute force buffer size), whereas the threshold controls how frequently Chroma will dump the binary index to disk. The rule of thumb is batch size < threshold. That said, try bumping the batch size to 10k (or more) and the threshold to 20-50k.
Still, the slowest part of Chroma is adding the vectors to the HNSW index; generally, that cannot be sped up too much. This brings me to my next question, @saachishenoy: What CPU arch are you running Chroma on? If it is Intel, then there is a good chance that rebuilding the HNSW lib for AVX support will boost performance.
Any progress on this? Trying to build a collection of similar size and inserts are getting very slow
What happened?
I have 2 million articles that are being chunked into roughly 12 million documents using langchain. I want to run a search over these documents so I would like to have them into ideally one chroma db. Would the quickest way to insert millions of documents into chroma db be to insert all of them upon db creation or to use db.add_documents(). Right now I'm doing it in db.add_documents() in chunks of 100,000 but the time to add_documents seems to get longer and longer with each call. Should I just try inserting all 12million chunks when I create it, I have a GPU and a lot storage and it used to take 30 min per 100K but now were at a little past an hour to add_document 100k documents.
Versions
runnning coda on a VM, 1 GPU
Relevant log output