chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
13.36k stars 1.14k forks source link

[Feature Request]: add_documents should allow passing of embeddings #2404

Open 42degrees opened 6 days ago

42degrees commented 6 days ago

Describe the problem

I'm very new to all this, so maybe I'm missing something, but I can't figure out how to do what I want to do. I see in the documentation that if I want to create a whole vector database at once, including embeddings, I can call Chroma.from_documents and pass a set of embeddings for each document. I can call collection.Add and pass a set of embeddings. So, why can't I call db.add_documents(documents, ids, embeddings)? It seems like this should be a reasonable request, but the only way to call add_documents is to let Chroma be in control of calling the embedding function and in my situation I already have the pre-calculated embeddings, but I am gathering it in batches, so I don't have everything to call from_documents (and I'm not 100% sure that ends up making the same database to use for RAG). Unless I'm missing something?

Describe the proposed solution

All methods of adding documents to Chroma support the same methods of adding embeddings.

Alternatives considered

I'm not sure if calling db.add_documents is actually adding to some default collection anyway, so maybe the solution is to get the default collection and then use collection.Add()? I did some googling and didn't find anywhere talking about a "default" collection or if one exists how I would go about getting it with get_collection()?

Importance

would make my life easier

Additional Information

No response

jeffchuber commented 5 days ago

@42degrees add_documents is a langchain API.

You can absolutely add data directly to Chroma with precomputed embeddings. Check out this part of the docs

https://docs.trychroma.com/reference/py-collection#add