chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
15.58k stars 1.3k forks source link

[Bug]: Client.DeleteCollection not deleting files, version 0.4.10 #1152

Open adriano515 opened 1 year ago

adriano515 commented 1 year ago

What happened?

When using client.delete_collection("collection_name") the db is deleted from SQLite3 but the directory and the files are not deleted (not even the contents of them since all weight more than 0kb)

Versions

Chroma v0.4.10, Python 3.10, Windows 11

Relevant log output

No response

tazarov commented 1 year ago

@adriano515, this was fixed sometime ago with https://github.com/chroma-core/chroma/pull/1080 and we did test against Win10. But if what you're saying is true, then there should be a uuid name dir that reflects the collection's segment in ./chroma dir.

Can you share some code to reproduce this?

adriano515 commented 1 year ago

Left a video here: https://discord.com/channels/1073293645303795742/1153432513641988137 @tazarov

alexgravx commented 6 months ago

Hi, I have exactly the same issue on MacOS with a ChromaDB v0.5.0 and a local persist directory. This is very strange, as this exact same issue seems to have been mentioned in #1009 and solved by #1080.

I am using from langchain_community.vectorstores import Chroma, however the issue seems to be linked to Chroma, not LangChain.

Steps to reproduce:

Capture d’écran 2024-05-16 à 00 31 53

Capture d’écran 2024-05-16 à 00 18 17

Capture d’écran 2024-05-16 à 00 19 10

How I fix in a similar way as in #1080 (which unfortunately doesn't work for me):

I add this code:

import shutil
import sqlite3

def get_ids(path):
    database = sqlite3.connect(path)
    cursor = database.cursor()
    cursor.execute("SELECT id FROM segments WHERE scope = 'VECTOR'")
    ids = cursor.fetchall()
    return [id[0] for id in ids]

def delete_unexisting_files(path, ids):
    elements = os.listdir(path)
    elements.remove(".DS_Store")
    elements.remove("chroma.sqlite3")
    for el in elements:
        if el not in ids:
            shutil.rmtree(os.path.join(path, el))

ids = get_ids(os.path.join(chroma_db_dir, "chroma.sqlite3"))
delete_unexisting_files(chroma_db_dir, ids)

Of course, it would be nice to directly add this in the library, by simply "shutil.rmtree" the directories associated to .delete_collection(), without seeking manually for the ids in the db...

Should I re-open an issue, @tazarov ? The #1080 fix to act on .delete() in /segment/impl/vector/local_persistent_hnsw.py, but .delete_collection() is defined in /api/segment.py and in /db/mixins/sysdb.py

vinay-kasireddy commented 5 months ago

Hi, I have the same issue in windows environment. All you need to do is create a collection, add some documents and try to delete it. You will see collection going away but not the directories. In our use-case, we need to run embeddings on a daily basis and as you can imagine this would result in proliferation of directories leading to slowness in retrievals. So, please fix this at the earliest.

tazarov commented 4 months ago

@alexgravx, let me revisit this. Can you share some details about your OS version, python version, CPU (M or Intel)?

EDIT: Do you have an antivirus or similar that may scan open files, thus preventing Chroma from removing the dir?

alexgravx commented 4 months ago

Hi @tazarov,

Thanks for you reply ! Here are the details:

OS version: macOS Sonoma 14.4 and 14.5 Python version: 3.9.6 CPU: M2 (ARM architecture), on a Mac Book Air model.

I don’t have any antivirus. The only protections on my mac are the ones from Apple. Moreover, I didn’t get any alert at the time so I think it wasn’t linked to another app/process.