chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
15.09k stars 1.27k forks source link

[Bug]: client.delete_collection does not deletes local segment directories in Flask. #1245

Open pranauv1 opened 1 year ago

pranauv1 commented 1 year ago

What happened?

The local segment file is not deleted.

Why? (my guess)

I referred to #1080 but, This only happens in a Flask server as this folder is being accessed by the server itself. I got no logs on the console but I guess it should throw this error WinErr32: folder is being used by another process. Cannot even manually delete this folder if the server is running, that's why I guessed the above!

Versions

chromadb 0.4.14, python 3.8, windows 11

Relevant log output

No response

beggers commented 1 year ago

Hey @pranauv1 , it sounds like you're using chroma as a library in part of your Flask webapp, is that correct? Meaning, you're not using Chroma in client-server mode.

pranauv1 commented 1 year ago

Yup! That's right, I'm not using Chroma in client-server mode.

Yuhui0620 commented 1 year ago

I encountered the same problem while using chromadb library(version=0.4.10) in FastAPI webapp. The folder ./chroma/{uuid} was not deleted when I called delete_collection(collection_name).

jeffchuber commented 1 year ago

here is a minimal test case that does not reproduce this. If someone could help me get a reproduction here, that'd be great!

output

chromadb.__version__ 0.4.15
on script run 0.140625
after_reset 0.140625
before 1.7428932189941406
after 0.140625
diff 1.6022682189941406
after_reset end 0.140625
import os
import chromadb
from chromadb.config import Settings

print("chromadb.__version__", chromadb.__version__)

def get_folder_size(start_path: str) -> float:
    total_size = 0
    for dirpath, dirnames, filenames in os.walk(start_path):
        for f in filenames:
            fp = os.path.join(dirpath, f)
            # skip if it is symbolic link
            if not os.path.islink(fp):
                total_size += os.path.getsize(fp)

    return total_size / (1024 * 1024)  # convert bytes to megabytes

script_entry = get_folder_size("./chroma")
print("on script run", script_entry)

client = chromadb.PersistentClient(settings=Settings(allow_reset=True))

client.reset()

after_reset = get_folder_size("./chroma")
print("after_reset", after_reset)

collection = client.get_or_create_collection("fruit")
collection.upsert(
    documents=["apples", "oranges", "bananas", "pineapples"], ids=["1", "2", "3", "4"]
)

# print(collection.query(query_texts=["hawaii"], n_results=1))

# get the size of the folder called ./chroma

before_size = get_folder_size("./chroma")
print("before", before_size)

client.delete_collection("fruit")
after_size = get_folder_size("./chroma")
print("after", after_size)

# difference
print("diff", before_size - after_size)

client.reset()
after_reset_end = get_folder_size("./chroma")
print("after_reset end", after_reset_end)
Yuhui0620 commented 1 year ago

@jeffchuber @tazarov The problem can be reproduced when the app is restarted.

Based on your test case, I wrote a simple fastapi demo with api upsert and delete for creating and deleting collection:


import os

import chromadb
from chromadb.config import Settings
from fastapi import FastAPI

print("chromadb.__version__", chromadb.__version__)

def get_folder_size(start_path: str) -> float:
    total_size = 0
    for dirpath, dirnames, filenames in os.walk(start_path):
        for f in filenames:
            fp = os.path.join(dirpath, f)
            # skip if it is symbolic link
            if not os.path.islink(fp):
                total_size += os.path.getsize(fp)

    return total_size / (1024 * 1024)  # convert bytes to megabytes

app = FastAPI()

@app.get("/upsert")
async def upsert():
    script_entry = get_folder_size("./chroma")
    print("on script run", script_entry)

    client = chromadb.PersistentClient(settings=Settings(allow_reset=True))

    client.reset()

    after_reset = get_folder_size("./chroma")
    print("after_reset", after_reset)

    collection = client.get_or_create_collection("fruit")
    collection.upsert(
        documents=["apples", "oranges", "bananas", "pineapples"], ids=["1", "2", "3", "4"]
    )

    # print(collection.query(query_texts=["hawaii"], n_results=1))

    # get the size of the folder called ./chroma

    before_size = get_folder_size("./chroma")
    print("before", before_size)

@app.get("/delete")
async def delete():
    client = chromadb.PersistentClient(settings=Settings(allow_reset=True))
    client.delete_collection("fruit")
    after_size = get_folder_size("./chroma")
    print("after", after_size)

    # difference
    # print("diff", before_size - after_size)

    client.reset()
    after_reset_end = get_folder_size("./chroma")
    print("after_reset end", after_reset_end)

I started the service with uvicorn test:app --port 8901 --host 0.0.0.0 and accessed localhost:8901/upsert, then I got the expected output:

chromadb.__version__ 0.4.10
INFO:     Started server process [1030359]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8901 (Press CTRL+C to quit)
on script run 0.0
after_reset 0.12109375
before 1.7233619689941406
INFO:     127.0.0.1:41438 - "GET /upsert HTTP/1.1" 200 OK

But when I restarted the service and accessed localhost:8901/delete, the output was:

chromadb.__version__ 0.4.10
INFO:     Started server process [1030462]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8901 (Press CTRL+C to quit)
after 1.7233619689941406
after_reset end 1.7233619689941406
INFO:     127.0.0.1:58384 - "GET /delete HTTP/1.1" 200 OK

Obviously, the folder of the collection was not deleted.

jeffchuber commented 1 year ago

@Yuhui0620

I ran your server and got this

~/s/chroma main *38 !1 ?5 > uvicorn server:app --port 8901 --host 0.0.0.0
chromadb.__version__ 0.4.15
INFO:     Started server process [80410]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8901 (Press CTRL+C to quit)
on script run 0.140625
after_reset 0.140625
before 1.7428932189941406
INFO:     127.0.0.1:62386 - "GET /upsert HTTP/1.1" 200 OK
INFO:     127.0.0.1:62386 - "GET /favicon.ico HTTP/1.1" 404 Not Found
after 0.140625
after_reset end 0.140625
INFO:     127.0.0.1:62388 - "GET /delete HTTP/1.1" 200 OK

What version of chroma are you running? Can you run latest? 0.4.15

Yuhui0620 commented 1 year ago

@jeffchuber 0.4.10, 0.4.15 works fine on this case, but is still not working on my project

jeffchuber commented 1 year ago

what version is your project using? chromadb.__version__

Yuhui0620 commented 1 year ago

I upgraded chromadb from 0.4.10 to 0.4.15 just now and still encountered the problem

Yuhui0620 commented 1 year ago

@Yuhui0620

I ran your server and got this

~/s/chroma main *38 !1 ?5 > uvicorn server:app --port 8901 --host 0.0.0.0
chromadb.__version__ 0.4.15
INFO:     Started server process [80410]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8901 (Press CTRL+C to quit)
on script run 0.140625
after_reset 0.140625
before 1.7428932189941406
INFO:     127.0.0.1:62386 - "GET /upsert HTTP/1.1" 200 OK
INFO:     127.0.0.1:62386 - "GET /favicon.ico HTTP/1.1" 404 Not Found
after 0.140625
after_reset end 0.140625
INFO:     127.0.0.1:62388 - "GET /delete HTTP/1.1" 200 OK

What version of chroma are you running? Can you run latest? 0.4.15

@jeffchuber 0.4.10, 0.4.15 works fine on this case, but is still not working on my project

@jeffchuber Could you restart the server with 'ctrl+c' and try again? I restart the service, actually the problem still exists in chromadb==0.4.15, and the output is:

chromadb.__version__ 0.4.15
INFO:     Started server process [1032507]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8901 (Press CTRL+C to quit)
on script run 0.0
after_reset 0.140625
before 1.7428932189941406
INFO:     127.0.0.1:33396 - "GET /upsert HTTP/1.1" 200 OK
^CINFO:     Shutting down
INFO:     Waiting for application shutdown.
INFO:     Application shutdown complete.
INFO:     Finished server process [1032507]

# Restart the Server

chromadb.__version__ 0.4.15
INFO:     Started server process [1032605]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://0.0.0.0:8901 (Press CTRL+C to quit)
after 1.7428932189941406
after_reset end 1.7428932189941406
INFO:     127.0.0.1:58222 - "GET /delete HTTP/1.1" 200 OK
jeffchuber commented 12 months ago

@pranauv1 @Yuhui0620

@tazarov is digging into this right now. we can repro thanks to @Yuhui0620 's issues https://github.com/chroma-core/chroma/issues/1309

tazarov commented 9 months ago

@pranauv1, @Yuhui0620, this may be a bit late. Still, it is important to note that deleting files like Chroma does when removing a collection binary segment is an action that Chroma delegates to the underlying OS and File storage. We have observed that in Windows, files are not deleted due to other processes using them (e.g. antivirus). In other instances, users have reported similar behavior for AWS EFS, which is based on NFS. So, while Chroma does its best to remove the files from the filesystem, there are cases where that is simply impossible and/or the FS does not comply.

We can make Chroma fail delete_collection operations when it fails to remove the files from the filesystem, but this has its own drawbacks.

beggers commented 9 months ago

We can make Chroma fail delete_collection operations when it fails to remove the files from the filesystem, but this has its own drawbacks.

This seems like the correct behavior to me. If we can't actually delete the collection we should report that to the caller, no?

jczic commented 2 months ago

Hello all!

I had the same problem in production and it was very serious for our company!

We add collections with many vectors/documents and update them very often. The problem is that if you take a closer look at the SQLite3 database, all the deleted information with deleted links (foreign keys) keeps adding up, but the DB keeps getting bigger and bigger. In a short space of time we reached over 13 GB in the ChromaDB database folder and the server memory was exploding!

I found a strange, and temporary, solution by testing numerous solutions.... Here it is:

        ids = chromaColl.get()['ids']
        if ids :
            chromaColl.delete(ids)
        del chromaColl
        _chromadb.delete_collection(collectionName)

Why is it absolutely necessary to call these 2 deletions in order to empty the data correctly?

Thank you !