chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
13.5k stars 1.15k forks source link

[Feature Request]: How to retrieve ids and metadata associated with embeddings of a particular file and not just for the entire collection? #1893

Open UsamaHussain8 opened 3 months ago

UsamaHussain8 commented 3 months ago

Describe the problem

I am working on a chat application in Langchain, Python. The idea is that user submits some pdf files that the chat model is trained on and then asks questions from the model regarding those documents. The embeddings are stored in Chromadb vector database. So effectively a RAG-based solution.

Now, both the creation and storage of embeddings are working fine and also chat is working good. However, I am storing my custom metadata to the embeddings and some ids. The code for that is given as under:

def read_docs(pdf_file):
    pdf_loader = PyPDFLoader(pdf_file)
    pdf_documents = pdf_loader.load()

    text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
    documents = text_splitter.split_documents(pdf_documents)

    return documents
def generate_and_store_embeddings(documents, pdf_file, user_id):
    client = chromadb.PersistentClient(path="./trained_db")
    collection = client.get_or_create_collection("PDF_Embeddings", 
                       embedding_function=embedding_functions.OpenAIEmbeddingFunction(api_key=config["OPENAI_API_KEY"], 
                       model_name=configs.EMBEDDINGS_MODEL))
    now = datetime.now()

    #custom metadata and ids I want to store along with the embeddings for each pdf
    metadata = {"source": pdf_file.filename, "user": str(user_id), 'created_at': now.strftime("%d/%m/%Y %H:%M:%S")}
    ids = [str(uuid.uuid4()) for _ in range(len(documents))]

    try:
        vectordb = Chroma.from_documents(
                    documents,        
                    embedding=OpenAIEmbeddings(openai_api_key=config["OPENAI_API_KEY"], 
                    model=configs.EMBEDDINGS_MODEL),
                    persist_directory='./trained_db',
                    collection_name = collection.name, 
                    client = client,
                    ids = ids,
                    collection_metadata = {item: value for (item, value) in metadata.items()}
                )
        vectordb.persist()

    except Exception as err:
        print(f"An error occured: {err=}, {type(err)=}")
        return {"answer": "An error occured while generating embeddings. Please check terminal for more details."}
    return vectordb

Now, what I want is to retrieve those ids and metadata associated with the pdf file rather than all the ids/metadata in the collection. This is so that when a user enters the pdf file to delete the embeddings of, I can retrieve the metadata and the ids of that pdf file only and then delete those embeddings from the collection. So, in a way I should provide pdf file name or something along the lines and I would get ids and/or metadata in return.

Describe the proposed solution

I would very much like a function which takes as parameter a document or its filename, and returns embeddings and ids and metadata associated with that file.

Alternatives considered

I tried using where clause to screen for the metadata as provided below:

print(vectordb.get(where={"source": pdf_file.filename}))

It returns:

{'ids': [], 'embeddings': None, 'metadatas': [], 'documents': [], 'uris': None, 'data': None}

Importance

i cannot use Chroma without it

Additional Information

No response

tazarov commented 3 months ago

@UsamaHussain8, LC injects source PDF as metadata to each document and then that metadata is sent to Chroma - https://github.com/langchain-ai/langchain/blob/40f846e65da37a1c00d72da9ea64ebb0f295b016/libs/community/langchain_community/vectorstores/chroma.py#L777C46-L777C55

After processing the PDF, check and ensure that each of your documents has the required metadata field. If it has then your get() or query() should work fine.

Also, the way you add metadata to the collection overwrites existing data unless, of course, you want to have 1 PDF per collection, in which case your code is ok as-is.

UsamaHussain8 commented 3 months ago

@UsamaHussain8, LC injects source PDF as metadata to each document and then that metadata is sent to Chroma - https://github.com/langchain-ai/langchain/blob/40f846e65da37a1c00d72da9ea64ebb0f295b016/libs/community/langchain_community/vectorstores/chroma.py#L777C46-L777C55

After processing the PDF, check and ensure that each of your documents has the required metadata field. If it has then your get() or query() should work fine.

Also, the way you add metadata to the collection overwrites existing data unless, of course, you want to have 1 PDF per collection, in which case your code is ok as-is.

Yes, I am aware of that. But that wasn't helping me get the ids that I want. Anyways, I have found the solution. I store the metadata and ids right at the time the documents are generated by splitting the original document to pages (in read_docs(pdf_file) function). Then I can get the ids and metadata etc. using vectordb.get(where={"source": pdf_file.filename}). Thanks for your reply though