Open UsamaHussain8 opened 3 months ago
@UsamaHussain8, LC injects source PDF as metadata to each document and then that metadata is sent to Chroma - https://github.com/langchain-ai/langchain/blob/40f846e65da37a1c00d72da9ea64ebb0f295b016/libs/community/langchain_community/vectorstores/chroma.py#L777C46-L777C55
After processing the PDF, check and ensure that each of your documents has the required metadata field. If it has then your get()
or query()
should work fine.
Also, the way you add metadata to the collection overwrites existing data unless, of course, you want to have 1 PDF per collection, in which case your code is ok as-is.
@UsamaHussain8, LC injects source PDF as metadata to each document and then that metadata is sent to Chroma - https://github.com/langchain-ai/langchain/blob/40f846e65da37a1c00d72da9ea64ebb0f295b016/libs/community/langchain_community/vectorstores/chroma.py#L777C46-L777C55
After processing the PDF, check and ensure that each of your documents has the required metadata field. If it has then your
get()
orquery()
should work fine.Also, the way you add metadata to the collection overwrites existing data unless, of course, you want to have 1 PDF per collection, in which case your code is ok as-is.
Yes, I am aware of that. But that wasn't helping me get the ids that I want. Anyways, I have found the solution. I store the metadata and ids right at the time the documents are generated by splitting the original document to pages (in read_docs(pdf_file)
function). Then I can get the ids and metadata etc. using vectordb.get(where={"source": pdf_file.filename})
. Thanks for your reply though
Describe the problem
I am working on a chat application in Langchain, Python. The idea is that user submits some pdf files that the chat model is trained on and then asks questions from the model regarding those documents. The embeddings are stored in Chromadb vector database. So effectively a RAG-based solution.
Now, both the creation and storage of embeddings are working fine and also chat is working good. However, I am storing my custom metadata to the embeddings and some ids. The code for that is given as under:
Now, what I want is to retrieve those ids and metadata associated with the pdf file rather than all the ids/metadata in the collection. This is so that when a user enters the pdf file to delete the embeddings of, I can retrieve the metadata and the ids of that pdf file only and then delete those embeddings from the collection. So, in a way I should provide pdf file name or something along the lines and I would get ids and/or metadata in return.
Describe the proposed solution
I would very much like a function which takes as parameter a document or its filename, and returns embeddings and ids and metadata associated with that file.
Alternatives considered
I tried using where clause to screen for the metadata as provided below:
print(vectordb.get(where={"source": pdf_file.filename}))
It returns:
Importance
i cannot use Chroma without it
Additional Information
No response