[Bug]: Repeated clicking on the same file eventually leads to server "Too many open files" error and session becomes unusable

diwu-sf commented 1 month ago

Is there an existing issue for the same bug?

[X] I have checked the existing issues.

Describe the bug and reproduction steps

Note, I searched for the error message "Too many open files" and didn't see any open issue against this error.

Repro steps:

Run with a prompt of do nothing
Add files to the /workspace, just a few files will do
In the Workspace UI, click open one of the files
Repeatedly click on the file quickly over and over again

Eventually, the UI and backend server becomes broken. In the UI, the message "Failed to fetch file" will show up. In the backend, when running with DEBUG=1 make run this error message shows up:

Traceback (most recent call last):
  File "/pypoetry/virtualenvs/openhands-ai-zQh2l52--py3.12/lib/python3.12/site-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pypoetry/virtualenvs/openhands-ai-zQh2l52--py3.12/lib/python3.12/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pypoetry/virtualenvs/openhands-ai-zQh2l52--py3.12/lib/python3.12/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/pypoetry/virtualenvs/openhands-ai-zQh2l52--py3.12/lib/python3.12/site-packages/starlette/applications.py", line 113, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/pypoetry/virtualenvs/openhands-ai-zQh2l52--py3.12/lib/python3.12/site-packages/starlette/middleware/errors.py", line 187, in __call__
    raise exc
  File "/pypoetry/virtualenvs/openhands-ai-zQh2l52--py3.12/lib/python3.12/site-packages/starlette/middleware/errors.py", line 165, in __call__
    await self.app(scope, receive, _send)
  File "/pypoetry/virtualenvs/openhands-ai-zQh2l52--py3.12/lib/python3.12/site-packages/starlette/middleware/base.py", line 185, in __call__
    with collapse_excgroups():
  File "/Users/diwu/miniconda3/lib/python3.12/contextlib.py", line 158, in __exit__
    self.gen.throw(value)
  File "/pypoetry/virtualenvs/openhands-ai-zQh2l52--py3.12/lib/python3.12/site-packages/starlette/_utils.py", line 82, in collapse_excgroups
    raise exc
  File "/pypoetry/virtualenvs/openhands-ai-zQh2l52--py3.12/lib/python3.12/site-packages/starlette/middleware/base.py", line 187, in __call__
    response = await self.dispatch_func(request, call_next)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "openhands/server/listen.py", line 217, in attach_session
    request.state.conversation = await call_sync_from_async(
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "openhands/utils/async_utils.py", line 18, in call_sync_from_async
    result = await coro
             ^^^^^^^^^^
  File "/Users/diwu/miniconda3/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "openhands/utils/async_utils.py", line 17, in <lambda>
    coro = loop.run_in_executor(None, lambda: fn(*args, **kwargs))
                                              ^^^^^^^^^^^^^^^^^^^
  File "openhands/server/session/manager.py", line 52, in attach_to_conversation
    return Conversation(sid, file_store=self.file_store, config=self.config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "openhands/server/session/conversation.py", line 31, in __init__
    self.runtime = runtime_cls(
                   ^^^^^^^^^^^^
  File "openhands/runtime/impl/eventstream/eventstream_runtime.py", line 162, in __init__
    self.docker_client: docker.DockerClient = self._init_docker_client()
                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "openhands/runtime/impl/eventstream/eventstream_runtime.py", line 241, in _init_docker_client
    raise ex
  File "openhands/runtime/impl/eventstream/eventstream_runtime.py", line 236, in _init_docker_client
    return docker.from_env()
           ^^^^^^^^^^^^^^^^^
  File "/pypoetry/virtualenvs/openhands-ai-zQh2l52--py3.12/lib/python3.12/site-packages/docker/client.py", line 94, in from_env
    return cls(
           ^^^^
  File "/pypoetry/virtualenvs/openhands-ai-zQh2l52--py3.12/lib/python3.12/site-packages/docker/client.py", line 45, in __init__
    self.api = APIClient(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pypoetry/virtualenvs/openhands-ai-zQh2l52--py3.12/lib/python3.12/site-packages/docker/api/client.py", line 207, in __init__
    self._version = self._retrieve_server_version()
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pypoetry/virtualenvs/openhands-ai-zQh2l52--py3.12/lib/python3.12/site-packages/docker/api/client.py", line 230, in _retrieve_server_version
    raise DockerException(

OpenHands Installation

Docker command in README

OpenHands Version

main

Operating System

MacOS

Logs, Errors, Screenshots, and Additional Context

No response

diwu-sf commented 1 month ago

I did some debugging into this, it appears that there's many unix sockets being opened per "click" to the docker daemon socket. There appear to be a leak somewhere.

diwu-sf commented 1 month ago

Every time the docker container log is retrieved, a bunch of unix socket FDs are leaked. The issue is the same as what's reported in https://github.com/docker/docker-py/issues/3282

As far as I can tell, the best way to fix this would be to:

stop using container.logs(stream=True, follow=True) because the bug is in the generator
keep a timestamp of the last log retrieved time
poll the log each time from since to until
get rid of that extra background thread, just make this poll the log in [since -> until] segments every time get_and_clear is called

The other leak of the TCP socket is from requests.session() returning a response but never explicitly closing that response object, so it accumulates as leaked connection. Solutions are:

either refactor the way response are consumed to always call close() so the conn is returned to the pool (looks like a lot of refactor, not worth it IMO)
do session.headers['Connection'] = 'close' to automatically close the connection (this one worked with minimal change, I'd just fix it this way and it doesn't matter perf wise when the TCP to the sandbox doesn't get re-used)

tofarr commented 1 month ago

This issue should now be resolved on main

diwu-sf commented 1 month ago

Bug is less severe, but it's still fundamentally there, take a look at the output of lsof -p {pid of the uvicorn process}:

python3.1 12305 diwu   98u    IPv6 0xcc4fb2f8cb35a3a9       0t0                 TCP localhost:54187->localhost:30147 (CLOSE_WAIT)
python3.1 12305 diwu   99u    unix 0xcc4fb3072e10a861       0t0                     ->0xcc4fb3072e10a929
python3.1 12305 diwu  100u    IPv6 0xcc4fb2f8cb35b1a9       0t0                 TCP localhost:54191->localhost:30147 (CLOSE_WAIT)
python3.1 12305 diwu  101u    unix 0xcc4fb3072e1083a9       0t0                     ->0xcc4fb3072e108471
python3.1 12305 diwu  102u    IPv6 0xcc4fb2f8cb35fea9       0t0                 TCP localhost:54194->localhost:30147 (CLOSE_WAIT)
python3.1 12305 diwu  103u    unix 0xcc4fb3072e1089e9       0t0                     ->0xcc4fb3072e108ab1
python3.1 12305 diwu  104u    IPv6 0xcc4fb2f8cb3605a9       0t0                 TCP localhost:54198->localhost:30147 (CLOSE_WAIT)
python3.1 12305 diwu  105u    unix 0xcc4fb3072e108e99       0t0                     ->0xcc4fb3072e109411
python3.1 12305 diwu  106u    IPv6 0xcc4fb2f8cb360ca9       0t0                 TCP localhost:54202->localhost:30147 (CLOSE_WAIT)
python3.1 12305 diwu  107u    unix 0xcc4fb3072e108219       0t0                     ->0xcc4fb3072e1082e1
python3.1 12305 diwu  108u    IPv6 0xcc4fb2f8cb35d4a9       0t0                 TCP localhost:54205->localhost:30147 (CLOSE_WAIT)
python3.1 12305 diwu  109u    unix 0xcc4fb3072e108859       0t0                     ->0xcc4fb3072e108921
python3.1 12305 diwu  110u    IPv6 0xcc4fb2f8cb35e2a9       0t0                 TCP localhost:54209->localhost:30147 (CLOSE_WAIT)
python3.1 12305 diwu  111u    unix 0xcc4fb3072e108b79       0t0                     ->0xcc4fb3072e108c41
python3.1 12305 diwu  112u    IPv6 0xcc4fb2f8cb35f0a9       0t0                 TCP localhost:54213->localhost:30147 (CLOSE_WAIT)
python3.1 12305 diwu  113u    unix 0xcc4fb3072e109fc9       0t0                     ->0xcc4fb3072e10a091
python3.1 12305 diwu  114u    IPv6 0xcc4fb2f8cb3595a9       0t0                 TCP localhost:54262->localhost:30147 (ESTABLISHED)
python3.1 12305 diwu  115u    unix 0xcc4fb3072e1097f9       0t0                     ->0xcc4fb3072e109a51
python3.1 12305 diwu  116u    IPv6 0xcc4fb2f8cb359ca9       0t0                 TCP localhost:54267->localhost:30147 (ESTABLISHED)
python3.1 12305 diwu  117u    unix 0xcc4fb3072e108791       0t0                     ->0xcc4fb3072e108d09
python3.1 12305 diwu  118u    IPv6 0xcc4fb2f8cb3579a9       0t0                 TCP localhost:54270->localhost:30147 (ESTABLISHED)
python3.1 12305 diwu  119u    unix 0xcc4fb3072e108601       0t0                     ->0xcc4fb3072e1086c9
python3.1 12305 diwu  120u    IPv6 0xcc4fb2f8cb3533a9       0t0                 TCP localhost:54274->localhost:30147 (ESTABLISHED)
python3.1 12305 diwu  121u    unix 0xcc4fb3072e10a3b1       0t0                     ->0xcc4fb3072e10a479

TCP CLOSE_WAIT is the one that's easy to solve, by setting the connection header to CLOSE. the other unix one is still the log tailing issue

diwu-sf commented 4 weeks ago

The TCP isn't being leaked, so now the only remaining accumulating leak is the unix docker sockets. You can repro it by repeated clicking on a file and then do lsof -p ... | grep unix

mamoodi commented 1 week ago

@diwu-sf is this bug finally resolved? tofarr put in a few fixes.

diwu-sf commented 1 week ago

No there's still Unix socket leak to the docker socket due to the log streamer.

Use the same repo and you should see that Unix sockets are still being accumulated per click

rbren commented 1 week ago

Was able to see the leakage by following lsof -p for this process, and clicking on files

~/.cache/pypoetry/virtualenvs/openhands-ai-uYxnY0EM-py3.12/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=5, pipe_handle=7) --multiprocessing-fork

Strange that the uvicorn proc itself doesn't own the leak, just this one. Which I guess points to the leak being in a thread?

diwu-sf commented 1 week ago

It’s the docker log streamer thread. It never actually closed the log line generator

On Thu, Nov 14, 2024 at 6:19 AM Robert Brennan @.***> wrote:

Was able to see the leakage by following lsof -p for this process, and clicking on files

~/.cache/pypoetry/virtualenvs/openhands-ai-uYxnY0EM-py3.12/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=5, pipe_handle=7) --multiprocessing-fork

Strange that the uvicorn proc itself doesn't own the leak, just this one. Which I guess points to the leak being in a thread?

— Reply to this email directly, view it on GitHub https://github.com/All-Hands-AI/OpenHands/issues/4538#issuecomment-2476473995, or unsubscribe https://github.com/notifications/unsubscribe-auth/BL3P7PSMDECPQFNMNPMLBQD2ASWPTAVCNFSM6AAAAABQP5ZOPSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINZWGQ3TGOJZGU . You are receiving this because you were mentioned.Message ID: @.***>

rbren commented 1 week ago

OK I fixed several leakage issues, but this one persists 🙃

I did start calling log_generator.close() but no luck

for posterity, you can watch the leakage with

while true; do lsof -p $(pgrep -f "tracker_fd") | wc -l; sleep 1; done

All the leaks go away if you make LogBuffer a null class (replace all method logic with pass or return empty vars)

rbren commented 1 week ago

I've further confirmed that if you don't instantiate the log_generator in LogBuffer, the leak goes away.

I've also confirmed that we're closing every log_generator we create.

TBH at this point I'm assuming there's a bug in the docker SDK

diwu-sf commented 1 week ago

There is a bug, look at the sdk reference I had earlier in this thread.

There’s also a workaround fix for the leak in that docker issues thread

On Thu, Nov 14, 2024 at 11:57 AM Robert Brennan @.***> wrote:

I've further confirmed that if you don't instantiate the log_generator in LogBuffer, the leak goes away.

I've also confirmed that we're closing every log_generator we create.

TBH at this point I'm assuming there's a bug in the docker SDK

— Reply to this email directly, view it on GitHub https://github.com/All-Hands-AI/OpenHands/issues/4538#issuecomment-2477294543, or unsubscribe https://github.com/notifications/unsubscribe-auth/BL3P7PQKBQZO74AJ6MIZOI32AT6D5AVCNFSM6AAAAABQP5ZOPSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINZXGI4TINJUGM . You are receiving this because you were mentioned.Message ID: @.***>

All-Hands-AI / OpenHands