Open pranauv1 opened 1 year ago
Hey @pranauv1 , it sounds like you're using chroma as a library in part of your Flask webapp, is that correct? Meaning, you're not using Chroma in client-server mode.
Yup! That's right, I'm not using Chroma in client-server mode.
I encountered the same problem while using chromadb library(version=0.4.10) in FastAPI webapp. The folder ./chroma/{uuid}
was not deleted when I called delete_collection(collection_name)
.
here is a minimal test case that does not reproduce this. If someone could help me get a reproduction here, that'd be great!
output
chromadb.__version__ 0.4.15
on script run 0.140625
after_reset 0.140625
before 1.7428932189941406
after 0.140625
diff 1.6022682189941406
after_reset end 0.140625
import os
import chromadb
from chromadb.config import Settings
print("chromadb.__version__", chromadb.__version__)
def get_folder_size(start_path: str) -> float:
total_size = 0
for dirpath, dirnames, filenames in os.walk(start_path):
for f in filenames:
fp = os.path.join(dirpath, f)
# skip if it is symbolic link
if not os.path.islink(fp):
total_size += os.path.getsize(fp)
return total_size / (1024 * 1024) # convert bytes to megabytes
script_entry = get_folder_size("./chroma")
print("on script run", script_entry)
client = chromadb.PersistentClient(settings=Settings(allow_reset=True))
client.reset()
after_reset = get_folder_size("./chroma")
print("after_reset", after_reset)
collection = client.get_or_create_collection("fruit")
collection.upsert(
documents=["apples", "oranges", "bananas", "pineapples"], ids=["1", "2", "3", "4"]
)
# print(collection.query(query_texts=["hawaii"], n_results=1))
# get the size of the folder called ./chroma
before_size = get_folder_size("./chroma")
print("before", before_size)
client.delete_collection("fruit")
after_size = get_folder_size("./chroma")
print("after", after_size)
# difference
print("diff", before_size - after_size)
client.reset()
after_reset_end = get_folder_size("./chroma")
print("after_reset end", after_reset_end)
@jeffchuber @tazarov The problem can be reproduced when the app is restarted.
Based on your test case, I wrote a simple fastapi demo with api upsert
and delete
for creating and deleting collection:
import os
import chromadb
from chromadb.config import Settings
from fastapi import FastAPI
print("chromadb.__version__", chromadb.__version__)
def get_folder_size(start_path: str) -> float:
total_size = 0
for dirpath, dirnames, filenames in os.walk(start_path):
for f in filenames:
fp = os.path.join(dirpath, f)
# skip if it is symbolic link
if not os.path.islink(fp):
total_size += os.path.getsize(fp)
return total_size / (1024 * 1024) # convert bytes to megabytes
app = FastAPI()
@app.get("/upsert")
async def upsert():
script_entry = get_folder_size("./chroma")
print("on script run", script_entry)
client = chromadb.PersistentClient(settings=Settings(allow_reset=True))
client.reset()
after_reset = get_folder_size("./chroma")
print("after_reset", after_reset)
collection = client.get_or_create_collection("fruit")
collection.upsert(
documents=["apples", "oranges", "bananas", "pineapples"], ids=["1", "2", "3", "4"]
)
# print(collection.query(query_texts=["hawaii"], n_results=1))
# get the size of the folder called ./chroma
before_size = get_folder_size("./chroma")
print("before", before_size)
@app.get("/delete")
async def delete():
client = chromadb.PersistentClient(settings=Settings(allow_reset=True))
client.delete_collection("fruit")
after_size = get_folder_size("./chroma")
print("after", after_size)
# difference
# print("diff", before_size - after_size)
client.reset()
after_reset_end = get_folder_size("./chroma")
print("after_reset end", after_reset_end)
I started the service with uvicorn test:app --port 8901 --host 0.0.0.0
and accessed localhost:8901/upsert
, then I got the expected output:
chromadb.__version__ 0.4.10
INFO: Started server process [1030359]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8901 (Press CTRL+C to quit)
on script run 0.0
after_reset 0.12109375
before 1.7233619689941406
INFO: 127.0.0.1:41438 - "GET /upsert HTTP/1.1" 200 OK
But when I restarted the service and accessed localhost:8901/delete
, the output was:
chromadb.__version__ 0.4.10
INFO: Started server process [1030462]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8901 (Press CTRL+C to quit)
after 1.7233619689941406
after_reset end 1.7233619689941406
INFO: 127.0.0.1:58384 - "GET /delete HTTP/1.1" 200 OK
Obviously, the folder of the collection was not deleted.
@Yuhui0620
I ran your server and got this
~/s/chroma main *38 !1 ?5 > uvicorn server:app --port 8901 --host 0.0.0.0
chromadb.__version__ 0.4.15
INFO: Started server process [80410]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8901 (Press CTRL+C to quit)
on script run 0.140625
after_reset 0.140625
before 1.7428932189941406
INFO: 127.0.0.1:62386 - "GET /upsert HTTP/1.1" 200 OK
INFO: 127.0.0.1:62386 - "GET /favicon.ico HTTP/1.1" 404 Not Found
after 0.140625
after_reset end 0.140625
INFO: 127.0.0.1:62388 - "GET /delete HTTP/1.1" 200 OK
What version of chroma are you running? Can you run latest? 0.4.15
@jeffchuber 0.4.10, 0.4.15 works fine on this case, but is still not working on my project
what version is your project using? chromadb.__version__
I upgraded chromadb from 0.4.10 to 0.4.15 just now and still encountered the problem
@Yuhui0620
I ran your server and got this
~/s/chroma main *38 !1 ?5 > uvicorn server:app --port 8901 --host 0.0.0.0 chromadb.__version__ 0.4.15 INFO: Started server process [80410] INFO: Waiting for application startup. INFO: Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8901 (Press CTRL+C to quit) on script run 0.140625 after_reset 0.140625 before 1.7428932189941406 INFO: 127.0.0.1:62386 - "GET /upsert HTTP/1.1" 200 OK INFO: 127.0.0.1:62386 - "GET /favicon.ico HTTP/1.1" 404 Not Found after 0.140625 after_reset end 0.140625 INFO: 127.0.0.1:62388 - "GET /delete HTTP/1.1" 200 OK
What version of chroma are you running? Can you run latest?
0.4.15
@jeffchuber 0.4.10, 0.4.15 works fine on this case, but is still not working on my project
@jeffchuber Could you restart the server with 'ctrl+c' and try again? I restart the service, actually the problem still exists in chromadb==0.4.15, and the output is:
chromadb.__version__ 0.4.15
INFO: Started server process [1032507]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8901 (Press CTRL+C to quit)
on script run 0.0
after_reset 0.140625
before 1.7428932189941406
INFO: 127.0.0.1:33396 - "GET /upsert HTTP/1.1" 200 OK
^CINFO: Shutting down
INFO: Waiting for application shutdown.
INFO: Application shutdown complete.
INFO: Finished server process [1032507]
# Restart the Server
chromadb.__version__ 0.4.15
INFO: Started server process [1032605]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8901 (Press CTRL+C to quit)
after 1.7428932189941406
after_reset end 1.7428932189941406
INFO: 127.0.0.1:58222 - "GET /delete HTTP/1.1" 200 OK
@pranauv1 @Yuhui0620
@tazarov is digging into this right now. we can repro thanks to @Yuhui0620 's issues https://github.com/chroma-core/chroma/issues/1309
@pranauv1, @Yuhui0620, this may be a bit late. Still, it is important to note that deleting files like Chroma does when removing a collection binary segment is an action that Chroma delegates to the underlying OS and File storage. We have observed that in Windows, files are not deleted due to other processes using them (e.g. antivirus). In other instances, users have reported similar behavior for AWS EFS, which is based on NFS. So, while Chroma does its best to remove the files from the filesystem, there are cases where that is simply impossible and/or the FS does not comply.
We can make Chroma fail delete_collection
operations when it fails to remove the files from the filesystem, but this has its own drawbacks.
We can make Chroma fail delete_collection operations when it fails to remove the files from the filesystem, but this has its own drawbacks.
This seems like the correct behavior to me. If we can't actually delete the collection we should report that to the caller, no?
Hello all!
I had the same problem in production and it was very serious for our company!
We add collections with many vectors/documents and update them very often. The problem is that if you take a closer look at the SQLite3 database, all the deleted information with deleted links (foreign keys) keeps adding up, but the DB keeps getting bigger and bigger. In a short space of time we reached over 13 GB in the ChromaDB database folder and the server memory was exploding!
I found a strange, and temporary, solution by testing numerous solutions.... Here it is:
ids = chromaColl.get()['ids']
if ids :
chromaColl.delete(ids)
del chromaColl
_chromadb.delete_collection(collectionName)
Why is it absolutely necessary to call these 2 deletions in order to empty the data correctly?
Thank you !
What happened?
The local segment file is not deleted.
Why? (my guess)
I referred to #1080 but, This only happens in a Flask server as this folder is being accessed by the server itself. I got no logs on the console but I guess it should throw this error
WinErr32: folder is being used by another process
. Cannot even manually delete this folder if the server is running, that's why I guessed the above!Versions
chromadb 0.4.14, python 3.8, windows 11
Relevant log output
No response