Closed noah53866 closed 8 months ago
Could you please check your version of chroma, and also provide your python version? I tried loading a file multiple times, but didn't get an error but see this prints:
Insert of existing embedding ID: c739a952-e133-f4c3-0f78-c05927587fcd
Insert of existing embedding ID: ba060ef7-eec5-ddf9-8213-04c6001c288f
Insert of existing embedding ID: ed543fd0-3075-7b3e-e5bf-f50daa8f2c74
Insert of existing embedding ID: 63b47f06-b9c1-6c2e-573b-610850b57cdf
Insert of existing embedding ID: 5c0f40c9-d755-b157-0e11-1efb00d6a8ae
Insert of existing embedding ID: 652710fe-d51a-3858-44d9-9acac7d54438
Insert of existing embedding ID: d25693ed-ab08-3f62-10df-1ee97970a4e0
Insert of existing embedding ID: 2e4ce943-d706-2587-7800-629130d37c6e
Insert of existing embedding ID: ee3fa4ef-817e-ea09-dbee-68e44313d329
Insert of existing embedding ID: 6773ac43-e443-90e8-27ca-dfdbce37312e
Insert of existing embedding ID: 275c9be7-6aa7-902c-c436-cfe9697592f1
Insert of existing embedding ID: f578fe6a-53e2-8b42-f098-da574d18638e
Insert of existing embedding ID: dd08b25f-7d65-88d8-6ec1-a1345f8205d1
...
This is my version of chroma:
python
>>> import chromadb
>>> chromadb.__version__
'0.4.22'
I am getting the same error. Checked my chromaDB version and it is '0.4.22'
@sarahwooders Happy to test it with my documents after you give me a green light if that's helpful. Thank you.
@vinayak-revelation thank you! Could you please try the latest 0.3.2 release?
Tried it, still getting the same error.
File "/home/ubuntu/projects/MemGPT-Prod/memgpt-prod-ve/lib/python3.11/site-packages/chromadb/api/types.py", line 240, in validate_ids raise errors.DuplicateIDError(message) chromadb.errors.DuplicateIDError: Expected IDs to be unique, found duplicates of: c53f9798-d3ac-6e53-84a2-1ff4f6fb2a4d, e5129649-5ba1-9216-695e-8206a1f8d366, 5096539b-1dfa-8b52-2012-fd0b7e707b97, 88eb4638-2328-d70b-b2e2-879e608ef581, 3697637a-ebb6-70b6-fdf5-e935285deb2e, a5a131e2-a480-8666-ae0c-7b303b8f112a, f576b3d4-b7c8-5a65-3150-bd8198029d61, 6381db94-edaf-75c0-fa82-68ac109f0bf4
Do I need to clear out the chromadb or delete some remnants from previous work before I do this?
Also, memgpt version now gives me: 0.3.2
Hmm ok thanks for letting me kno @vinayak-revelation - I don't have a large file to test with on hand, but seems I need to actually repro this to fix it so will try to do it asap.
Hmm ok thanks for letting me kno @vinayak-revelation - I don't have a large file to test with on hand, but seems I need to actually repro this to fix it so will try to do it asap.
If you’d like, I could upload the file I was using so you can reproduce the error.
Yes that would be great! If you can upload here or DM me on discord that'd be very helpful.
I wil try to get you one. The file I am using right now is protected but let me find something and attach here. Deeply appreciate your assistance!
Yes that would be great! If you can upload here or DM me on discord that'd be very helpful.
I am trying to find free documents online, and the biggest ones I can find are txt (out of copyright books). Those are able to digest with no issues despite being very large. I tried to convert those txt files into PDFs and they work too.
The one interesting data point i do have is those PDFs (my personal health records) were parsing and embedding fine with the memgpt load command couple releases back, the same file though does not work now.
I will keep looking for a good example which can help us denug this issue in the meantime.
I think I found the issue. When you import a PDF, during it's chunking, if it comes across page(s) which are just pasted images with little text, that chunk and another page looks same (if the headers are same) since it cannot parse the image from the PDF and just parses the header or little text there maybe. For the error: chromadb.errors.DuplicateIDError: Expected IDs to be unique, found duplicates of: 7f3c6004-8008-9c9a-5655-edc5ee68164f How are those IDs calculated? Is Chroma doing this and if it is, is there a way to not import when duplicates are found but import the rest?
My suspicion started with this one: https://github.com/embedchain/embedchain/issues/64 and knowing anything other than text in these PDFs is not parsed by the default parser.
The IDs are created by us (not chroma) as a hash of the text and agent ID (https://github.com/cpacker/MemGPT/blob/main/memgpt/data_types.py#L308) - we implemented this to avoid duplication in the DB, but I think you're right that it's whats causing the issue. I still need to repro, but probably if filter out duplicates before running the ChromaDB insert the issue will be resolved.
Hi @sarahwooders : I was able to generate an example. Notice the first page has an image and second page is just text. The first page and second page have the same text. So I think the parser creates two embeddings which are essentially the same id and then breaks.
This one broke for me when I did:
memgpt load directory --name vd1 --input-files ~/Downloads/Example.pdf
with the error:
chromadb.errors.DuplicateIDError: Expected IDs to be unique, found duplicates of: e87f024f-2b86-d25d-a381-8d02f038b61d
Hope this helps, thank you for helping fix this. I agree filtering duplicates and then insert will work :)
@vinayak-revelation thanks I was able to get the same error with that example! Fix should be in #1001
The fix should be in the nightly package which you can install with pip install pymemgpt-nightly
and will be in a release in the next 1-2 days
Describe the bug When inputting my large .jsonl dataset, Chroma throws an error and the data source is unusable.
Please describe your setup
Screenshots
MemGPT Config config.txt
Local LLM details
If you are trying to run MemGPT with local LLMs, please provide the following information: