ChromaDB DuplicateIDError on `memgpt load`

noah53866 commented 8 months ago

Describe the bug When inputting my large .jsonl dataset, Chroma throws an error and the data source is unusable.

Please describe your setup

[ ] How did you install memgpt?
- 'pip -U install pymemgpt' and 'pip -U install pymemgpt[local]'
[ ] Describe your setup
- Running MemGPT on a WSL Ubuntu distro.
- Running it via Powershell.

Screenshots

MemGPT Config config.txt

Local LLM details

If you are trying to run MemGPT with local LLMs, please provide the following information:

[ ] Mistral-7B-Instruct-v0.2 with a homemade LORA.
[ ] WebUI.
[ ] 32 GB RAM, RTX 3060, AMD Ryzen 7 5800X.

sarahwooders commented 8 months ago

Could you please check your version of chroma, and also provide your python version? I tried loading a file multiple times, but didn't get an error but see this prints:

Insert of existing embedding ID: c739a952-e133-f4c3-0f78-c05927587fcd
Insert of existing embedding ID: ba060ef7-eec5-ddf9-8213-04c6001c288f
Insert of existing embedding ID: ed543fd0-3075-7b3e-e5bf-f50daa8f2c74
Insert of existing embedding ID: 63b47f06-b9c1-6c2e-573b-610850b57cdf
Insert of existing embedding ID: 5c0f40c9-d755-b157-0e11-1efb00d6a8ae
Insert of existing embedding ID: 652710fe-d51a-3858-44d9-9acac7d54438
Insert of existing embedding ID: d25693ed-ab08-3f62-10df-1ee97970a4e0
Insert of existing embedding ID: 2e4ce943-d706-2587-7800-629130d37c6e
Insert of existing embedding ID: ee3fa4ef-817e-ea09-dbee-68e44313d329
Insert of existing embedding ID: 6773ac43-e443-90e8-27ca-dfdbce37312e
Insert of existing embedding ID: 275c9be7-6aa7-902c-c436-cfe9697592f1
Insert of existing embedding ID: f578fe6a-53e2-8b42-f098-da574d18638e
Insert of existing embedding ID: dd08b25f-7d65-88d8-6ec1-a1345f8205d1
...

This is my version of chroma:

python
>>> import chromadb
>>> chromadb.__version__
'0.4.22'

vinayak-revelation commented 8 months ago

I am getting the same error. Checked my chromaDB version and it is '0.4.22'

vinayak-revelation commented 8 months ago

@sarahwooders Happy to test it with my documents after you give me a green light if that's helpful. Thank you.

sarahwooders commented 8 months ago

@vinayak-revelation thank you! Could you please try the latest 0.3.2 release?

vinayak-revelation commented 8 months ago

Tried it, still getting the same error.

File "/home/ubuntu/projects/MemGPT-Prod/memgpt-prod-ve/lib/python3.11/site-packages/chromadb/api/types.py", line 240, in validate_ids raise errors.DuplicateIDError(message) chromadb.errors.DuplicateIDError: Expected IDs to be unique, found duplicates of: c53f9798-d3ac-6e53-84a2-1ff4f6fb2a4d, e5129649-5ba1-9216-695e-8206a1f8d366, 5096539b-1dfa-8b52-2012-fd0b7e707b97, 88eb4638-2328-d70b-b2e2-879e608ef581, 3697637a-ebb6-70b6-fdf5-e935285deb2e, a5a131e2-a480-8666-ae0c-7b303b8f112a, f576b3d4-b7c8-5a65-3150-bd8198029d61, 6381db94-edaf-75c0-fa82-68ac109f0bf4

Do I need to clear out the chromadb or delete some remnants from previous work before I do this?

Also, memgpt version now gives me: 0.3.2

sarahwooders commented 8 months ago

Hmm ok thanks for letting me kno @vinayak-revelation - I don't have a large file to test with on hand, but seems I need to actually repro this to fix it so will try to do it asap.

noah53866 commented 8 months ago

Hmm ok thanks for letting me kno @vinayak-revelation - I don't have a large file to test with on hand, but seems I need to actually repro this to fix it so will try to do it asap.

If you’d like, I could upload the file I was using so you can reproduce the error.

sarahwooders commented 8 months ago

Yes that would be great! If you can upload here or DM me on discord that'd be very helpful.

vinayak-revelation commented 8 months ago

I wil try to get you one. The file I am using right now is protected but let me find something and attach here. Deeply appreciate your assistance!

noah53866 commented 8 months ago

Yes that would be great! If you can upload here or DM me on discord that'd be very helpful.

bigdata.csv

vinayak-revelation commented 8 months ago

I am trying to find free documents online, and the biggest ones I can find are txt (out of copyright books). Those are able to digest with no issues despite being very large. I tried to convert those txt files into PDFs and they work too.

The one interesting data point i do have is those PDFs (my personal health records) were parsing and embedding fine with the memgpt load command couple releases back, the same file though does not work now.

I will keep looking for a good example which can help us denug this issue in the meantime.

vinayak-revelation commented 8 months ago

I think I found the issue. When you import a PDF, during it's chunking, if it comes across page(s) which are just pasted images with little text, that chunk and another page looks same (if the headers are same) since it cannot parse the image from the PDF and just parses the header or little text there maybe. For the error: chromadb.errors.DuplicateIDError: Expected IDs to be unique, found duplicates of: 7f3c6004-8008-9c9a-5655-edc5ee68164f How are those IDs calculated? Is Chroma doing this and if it is, is there a way to not import when duplicates are found but import the rest?

My suspicion started with this one: https://github.com/embedchain/embedchain/issues/64 and knowing anything other than text in these PDFs is not parsed by the default parser.

sarahwooders commented 8 months ago

The IDs are created by us (not chroma) as a hash of the text and agent ID (https://github.com/cpacker/MemGPT/blob/main/memgpt/data_types.py#L308) - we implemented this to avoid duplication in the DB, but I think you're right that it's whats causing the issue. I still need to repro, but probably if filter out duplicates before running the ChromaDB insert the issue will be resolved.

vinayak-revelation commented 8 months ago

Example.pdf

Hi @sarahwooders : I was able to generate an example. Notice the first page has an image and second page is just text. The first page and second page have the same text. So I think the parser creates two embeddings which are essentially the same id and then breaks.

This one broke for me when I did:

memgpt load directory --name vd1 --input-files ~/Downloads/Example.pdf

with the error:

chromadb.errors.DuplicateIDError: Expected IDs to be unique, found duplicates of: e87f024f-2b86-d25d-a381-8d02f038b61d

Hope this helps, thank you for helping fix this. I agree filtering duplicates and then insert will work :)

sarahwooders commented 8 months ago

@vinayak-revelation thanks I was able to get the same error with that example! Fix should be in #1001

sarahwooders commented 8 months ago

The fix should be in the nightly package which you can install with pip install pymemgpt-nightly and will be in a release in the next 1-2 days

cpacker / MemGPT

ChromaDB DuplicateIDError on `memgpt load` #986