[Bug]: add_documents gets slower with each call

saachishenoy commented 7 months ago

What happened?

I have 2 million articles that are being chunked into roughly 12 million documents using langchain. I want to run a search over these documents so I would like to have them into ideally one chroma db. Would the quickest way to insert millions of documents into chroma db be to insert all of them upon db creation or to use db.add_documents(). Right now I'm doing it in db.add_documents() in chunks of 100,000 but the time to add_documents seems to get longer and longer with each call. Should I just try inserting all 12million chunks when I create it, I have a GPU and a lot storage and it used to take 30 min per 100K but now were at a little past an hour to add_document 100k documents.

Versions

runnning coda on a VM, 1 GPU

Relevant log output

from langchain.chains import RetrievalQA
from langchain.vectorstores import Chroma
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import PyPDFLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.embeddings import SentenceTransformerEmbeddings
from sentence_transformers import SentenceTransformer
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.text_splitter import RecursiveCharacterTextSplitter

model_path = "./multi-qa-MiniLM-L6-cos-v1/"
model_kwargs = {"device": "cuda"}
embeddings = SentenceTransformerEmbeddings(model_name="./multi-qa-MiniLM-L6-cos-v1/",  model_kwargs=model_kwargs)

documents_array = documents[0:100000] 

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    length_function=len,
    is_separator_regex=False,
)

docs = text_splitter.create_documents(documents_array)

persist_directory = "chroma_db"

vectordb = Chroma.from_documents(
    documents=docs, embedding=embeddings, persist_directory=persist_directory
)

vectordb.persist()
vectordb._collection.count()

docs = text_splitter.create_documents(documents[500000:600000])

def batch_process(documents_arr, batch_size, process_function):
    for i in range(0, len(documents_arr), batch_size):
        batch = documents_arr[i:i + batch_size]
        process_function(batch)

def add_to_chroma_database(batch):
    vectordb.add_documents(documents=batch)

batch_size = 41000

batch_process(docs, batch_size, add_to_chroma_database)

beggers commented 7 months ago

@tazarov would this be a good place to clean up the WAL? That's probably part of why it's getting slower.

tazarov commented 7 months ago

WAL is definitely one thing. Yet, I feel there is something odd here. I've personally created collections of about 10M+ embeddings, and the approximate runtime for adding 1M embeddings goes up as the HNSW binary index increases. Also I need to check LC's impl, but it might be slower than adding the docs with Chroma persistent client directly.

@saachishenoy, when you create your collection, you can specify hnsw:batch_size and hnsw:sync_threshold. The batch size controls the in-memory (aka brute force buffer size), whereas the threshold controls how frequently Chroma will dump the binary index to disk. The rule of thumb is batch size < threshold. That said, try bumping the batch size to 10k (or more) and the threshold to 20-50k.

Still, the slowest part of Chroma is adding the vectors to the HNSW index; generally, that cannot be sped up too much. This brings me to my next question, @saachishenoy: What CPU arch are you running Chroma on? If it is Intel, then there is a good chance that rebuilding the HNSW lib for AVX support will boost performance.

krish-bell commented 2 weeks ago

Any progress on this? Trying to build a collection of similar size and inserts are getting very slow

chroma-core / chroma