chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
15.3k stars 1.28k forks source link

[Bug]: OSError: [Errno 24] Too many open files: #1379

Closed meteor260 closed 3 months ago

meteor260 commented 1 year ago

What happened?

I use Django and host it as a service. After a long time, the following issue happened OSError: [Errno 24] Too many open files:

Is there some close() function that should be called after a query or a collection is established?

Versions

chroma-hnswlib==0.7.3 chromadb==0.4.15

Python: 3.11.4

Relevant log output

No response

meteor260 commented 1 year ago

After checking, I almost confirm that the leak is because of chroma.

HammadB commented 1 year ago

What is your file limit? Can you run ulimit -n and report back the value if you are on a Unix system?

Chroma requires a high fd limit and the default is set quite low on most platforms. We have a setting "chroma_server_nofile" that can be used to increase this limit. I suggest something like 65K.

meteor260 commented 1 year ago

Thanks, the original is 1024, and last night I already changed the limit to about 1M, and let me observe it...

meteor260 commented 12 months ago

After a day, I checked the command: lsof | grep 25801 | wc -l The number is 117040 I think there should be a problem inside...

HammadB commented 12 months ago

That is odd, can you show the output for which files are being opened? What platform are you on?

meteor260 commented 12 months ago

lsof show lots of this file: chroma.sqlite3

OS is Ubuntu

I used this function to init client: client = chromadb.PersistentClient() And use the functions: collection.add() collection.get() collection.query()

meteor260 commented 11 months ago

Any finding for this issue?

chest3x commented 10 months ago

I am facing a similar issue. I am running Chroma DB from LangChain for a question-answer application.

After certain amount of load, my app went into "Too many open files".

I started investigating and found out that for each request on my app, the number of open descriptors is piling up by 5-10.

Example of "lsof | grep -c chroma" results: Pre-request - 123 During request - 497 Post-request - 128

And it keeps growing like this after each request.

hedleyroos commented 10 months ago

The issue is because of this line: https://github.com/chroma-core/chroma/blob/main/chromadb/api/segment.py#L692

The class hierarchy is tricky to navigate, but in essence loading an index opens a few related files. Crucially, these files are never closed after they've been used. I've hacked around trying to delete Python objects to see if the files will eventually close, but to no avail. I think hnswlib keeps them open until the entire Python process is killed.

What does work is adding vector_reader.close_persistent_index() in the method I've linked, after the results have been fetched. I don't know if it is threadsafe, and I also don't know if trying to re-use an index with closed files will fail, but in my testing it kept working. My use case does not modify any indexes while the Python process is running though.

Ultimately a better solution is to give your client object a time-to-live. Kill and recreate it every X minutes / calls, and that should release the open files. I'll try it shortly and leave feedback here. It may not work because of the aforementioned issue where I suspect hnswlib keeps the files open until the entire Python process is killed.

tazarov commented 10 months ago

@chest3x, @hedleyroos, thank you for this analysis. I will dig deeper and let you know soon.

hedleyroos commented 10 months ago

I've tested a TTL on the client object. I you also forcibly do import gc; gc.collect() after the client object is destroyed then the file descriptors are released correctly. Garbage collection is sometimes like cargo culting and I prefer not to do it in a web server environment, but it seems to be working. A client pool would be a decent solution.

hedleyroos commented 10 months ago

Unfortunately recreating the client object and doing a subsequent garbage collect works on 0.4.13 but not 0.4.22.

hedleyroos commented 10 months ago

I've gone through the code in detail, but there are a few different styles of caches and it needs someone more familiar to investigate. However, in the case of Django and any other webserver the workaround is simple. Ensure you create a fresh client object on each request. Overhead is minimal. Subclass Client to create your own client class (note the import):

from chromadb.api.client import Client                                                                                  

class MyClient(Client):                                                                                                 

    def __del__(self):                                                                                                  
        for instance in self._server._manager._instances.values():                                                      
            getattr(instance, "close_persistent_index", lambda: None)()

At the end of each request this client will close the open files correctly. It's not an awesome solution (client pool would still be better) but at least it solves the immediate issue.

tazarov commented 10 months ago

@hedleyroos, @chest3x, I don't think the issue you are seeing is related to the binary hnsw index (aka vector_reader). Chroma is thread-safe, and to ensure thread safety, it will create per-thread connection pools to the metadata/sys DB, which is essentially the SQLite file under the persistent dir. With each thread, a new file handle is opened to the sqlite file, this is by design of sqlite when operating in a multi-threaded environment.

Several factors may be causing the error you are seeing:

Chroma has introduced a CHROMA_SERVER_NOFILE setting to set the number of open files allowed.

hedleyroos commented 9 months ago

@tazarov Chroma is threadsafe, but hnswlib is ultimately a shared object accessed through Python's ctypes , CFFI or equivalent interface. When hnswlib is imported it belongs to the running Python process, not a single Python object like the Chroma Client.

hnswlib opens files when loading vectors, but when the Python Client object that invoked it is deleted the shared object cannot automatically know that it has to close files. The SO "belongs" to the running Python process, and it has no concept of Python objects. Once the Python process is killed in its entirety all open file handles are released, probably due to some unload code or maybe code in a destructor somewhere in hnswlib.

I've read through the caching code in Chroma, and while I should really address this in a new ticket, I don't think any process should see what ulimit is and then assume it can open files until it reaches that value (see https://github.com/chroma-core/chroma/blob/main/chromadb/segment/impl/manager/local.py#L71). There may be other code and libraries that also need to open files, so any single piece of code can't assume it can open the max number of files. Also, the LRU cache is only ever used if hint_use_collection is somewhere in the calling stack, which is not always the case. I should really create a ticket for this one :)

I don't run Chroma in server mode, but it looks like CHROMA_SERVER_NOFILE is only for that mode, not when instantiating clients from local storage.

But as I posted earlier it's a trivial fix if you have a long running process like Django and only need to read, not write, from a vector database(s): don't create one global client object, but rather one per request, and subclass Client and implement a __del__ method.

Apologies for the long post @tazarov .

nucflash commented 8 months ago

I think there is one more "file handler leak" besides the one that @hedleyroos covers in his fix above: When I create databases, the Sqlite3 /path/to/database/chroma.db file remains open as if the connection never closes, and it never gets garbage collected either.

Here's a proof of concept:

import time
import chromadb
from chromadb.api.client import Client

""" apply @hedleyroos fix """
class MyClient(Client):
    def __del__(self):
        for instance in self._server._manager._instances.values():
            getattr(instance, "close_persistent_index", lambda: None)()

for i in range(10):
    settings = chromadb.Settings()
    settings.is_persistent = True
    settings.persist_directory = f'chroma-file-leak-{i}'
    chroma_client = MyClient(tenant=chromadb.DEFAULT_TENANT, database=chromadb.DEFAULT_DATABASE, settings=settings)
    collection = chroma_client.create_collection(name="no-name")
    collection.add(documents=['doc'], ids=['id1'])
    del collection
    del chroma_client
# give time to look at `lsof` and open files
# in macOS: Activity Monitor > double click on the python process > Tab "Open Files and Ports"
time.sleep(600)

The listed files show open files to the respective sqlite3 chroma.dbs even after the client and the collection is objects are deleted. This how it looks on mine–

.../chroma-file-leak-0/chroma.sqlite3
.../chroma-file-leak-1/chroma.sqlite3
.../chroma-file-leak-2/chroma.sqlite3
.../chroma-file-leak-3/chroma.sqlite3
.../chroma-file-leak-4/chroma.sqlite3
.../chroma-file-leak-5/chroma.sqlite3
.../chroma-file-leak-6/chroma.sqlite3
.../chroma-file-leak-7/chroma.sqlite3
.../chroma-file-leak-8/chroma.sqlite3
.../chroma-file-leak-9/chroma.sqlite3

When we work with only a handful of databases like in the example above this is not so much of an issue but when we have to populate thousands of databases (e.g., one database per use case) then it becomes impossible.

I looked in the code for explicitly closing the sqlite connection but I wasn't successful. Any pointers on how we can close it, it is much appreciated.

hedleyroos commented 8 months ago

@nucflash do chroma_client.clear_system_cache() before you delete the instance. Chroma has a class level cache, and I think it's keeping handles open.

tazarov commented 8 months ago

Here's the PR #1690 that resolves similar problem, although this more targeted at Chroma server.

tazarov commented 8 months ago

Also there's work on proper closing all open file handles for persistent client here - https://github.com/amikos-tech/chroma-core/commit/1e8ce2eed03a951dd115971238e89886652f0811

nucflash commented 8 months ago

Thank you for the swift reply @hedleyroos and @tazarov ! I much appreciate the workaround, and I'm happy to confirm that the combined solutions from @hedleyroos completely solve the issue.

KnightAsterial commented 6 months ago

Has the solution to this problem been merged into the ChromaDB? I am still running into this bug when running ChromaDB from LangChain like @chest3x

bash99 commented 4 months ago

I also got this error in current version.

tazarov commented 4 months ago

@bash99 we have a long standing PR #2014 for this issue. Let me see if we can rebase and reprioritize it.

bash99 commented 4 months ago

@bash99 we have a long standing PR #2014 for this issue. Let me see if we can rebase and reprioritize it.

Yes, this patch fix it for me. Does it has any side affects?

I've doing 10k requests with ab, and my flask server under 1024 files limit is just fine.

pinsisong commented 3 months ago

I'm having the same issue, Pls when can it be closed

tazarov commented 3 months ago

@pinsisong, thanks for reminding us about this. I've rebased the PR on the latest main, and all tests pass. Let me check with @HammadB about merging this.