All-Hands-AI / OpenHands

🙌 OpenHands: Code Less, Make More
https://all-hands.dev
MIT License
33.25k stars 3.81k forks source link

[Bug]: Repeated clicking on the same file eventually leads to server "Too many open files" error and session becomes unusable #4538

Open diwu-sf opened 2 days ago

diwu-sf commented 2 days ago

Is there an existing issue for the same bug?

Describe the bug and reproduction steps

Note, I searched for the error message "Too many open files" and didn't see any open issue against this error.

Repro steps:

  1. Run with a prompt of do nothing
  2. Add files to the /workspace, just a few files will do
  3. In the Workspace UI, click open one of the files
  4. Repeatedly click on the file quickly over and over again

Eventually, the UI and backend server becomes broken. In the UI, the message "Failed to fetch file" will show up. In the backend, when running with DEBUG=1 make run this error message shows up:

Traceback (most recent call last):
  File "/pypoetry/virtualenvs/openhands-ai-zQh2l52--py3.12/lib/python3.12/site-packages/uvicorn/protocols/http/httptools_impl.py", line 401, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pypoetry/virtualenvs/openhands-ai-zQh2l52--py3.12/lib/python3.12/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pypoetry/virtualenvs/openhands-ai-zQh2l52--py3.12/lib/python3.12/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/pypoetry/virtualenvs/openhands-ai-zQh2l52--py3.12/lib/python3.12/site-packages/starlette/applications.py", line 113, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/pypoetry/virtualenvs/openhands-ai-zQh2l52--py3.12/lib/python3.12/site-packages/starlette/middleware/errors.py", line 187, in __call__
    raise exc
  File "/pypoetry/virtualenvs/openhands-ai-zQh2l52--py3.12/lib/python3.12/site-packages/starlette/middleware/errors.py", line 165, in __call__
    await self.app(scope, receive, _send)
  File "/pypoetry/virtualenvs/openhands-ai-zQh2l52--py3.12/lib/python3.12/site-packages/starlette/middleware/base.py", line 185, in __call__
    with collapse_excgroups():
  File "/Users/diwu/miniconda3/lib/python3.12/contextlib.py", line 158, in __exit__
    self.gen.throw(value)
  File "/pypoetry/virtualenvs/openhands-ai-zQh2l52--py3.12/lib/python3.12/site-packages/starlette/_utils.py", line 82, in collapse_excgroups
    raise exc
  File "/pypoetry/virtualenvs/openhands-ai-zQh2l52--py3.12/lib/python3.12/site-packages/starlette/middleware/base.py", line 187, in __call__
    response = await self.dispatch_func(request, call_next)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "openhands/server/listen.py", line 217, in attach_session
    request.state.conversation = await call_sync_from_async(
                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "openhands/utils/async_utils.py", line 18, in call_sync_from_async
    result = await coro
             ^^^^^^^^^^
  File "/Users/diwu/miniconda3/lib/python3.12/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "openhands/utils/async_utils.py", line 17, in <lambda>
    coro = loop.run_in_executor(None, lambda: fn(*args, **kwargs))
                                              ^^^^^^^^^^^^^^^^^^^
  File "openhands/server/session/manager.py", line 52, in attach_to_conversation
    return Conversation(sid, file_store=self.file_store, config=self.config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "openhands/server/session/conversation.py", line 31, in __init__
    self.runtime = runtime_cls(
                   ^^^^^^^^^^^^
  File "openhands/runtime/impl/eventstream/eventstream_runtime.py", line 162, in __init__
    self.docker_client: docker.DockerClient = self._init_docker_client()
                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "openhands/runtime/impl/eventstream/eventstream_runtime.py", line 241, in _init_docker_client
    raise ex
  File "openhands/runtime/impl/eventstream/eventstream_runtime.py", line 236, in _init_docker_client
    return docker.from_env()
           ^^^^^^^^^^^^^^^^^
  File "/pypoetry/virtualenvs/openhands-ai-zQh2l52--py3.12/lib/python3.12/site-packages/docker/client.py", line 94, in from_env
    return cls(
           ^^^^
  File "/pypoetry/virtualenvs/openhands-ai-zQh2l52--py3.12/lib/python3.12/site-packages/docker/client.py", line 45, in __init__
    self.api = APIClient(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pypoetry/virtualenvs/openhands-ai-zQh2l52--py3.12/lib/python3.12/site-packages/docker/api/client.py", line 207, in __init__
    self._version = self._retrieve_server_version()
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pypoetry/virtualenvs/openhands-ai-zQh2l52--py3.12/lib/python3.12/site-packages/docker/api/client.py", line 230, in _retrieve_server_version
    raise DockerException(

OpenHands Installation

Docker command in README

OpenHands Version

main

Operating System

MacOS

Logs, Errors, Screenshots, and Additional Context

No response

diwu-sf commented 2 days ago

I did some debugging into this, it appears that there's many unix sockets being opened per "click" to the docker daemon socket. There appear to be a leak somewhere.

diwu-sf commented 2 days ago

Every time the docker container log is retrieved, a bunch of unix socket FDs are leaked. The issue is the same as what's reported in https://github.com/docker/docker-py/issues/3282

As far as I can tell, the best way to fix this would be to:

The other leak of the TCP socket is from requests.session() returning a response but never explicitly closing that response object, so it accumulates as leaked connection. Solutions are:

tofarr commented 1 day ago

This issue should now be resolved on main

diwu-sf commented 1 day ago

Bug is less severe, but it's still fundamentally there, take a look at the output of lsof -p {pid of the uvicorn process}:

python3.1 12305 diwu   98u    IPv6 0xcc4fb2f8cb35a3a9       0t0                 TCP localhost:54187->localhost:30147 (CLOSE_WAIT)
python3.1 12305 diwu   99u    unix 0xcc4fb3072e10a861       0t0                     ->0xcc4fb3072e10a929
python3.1 12305 diwu  100u    IPv6 0xcc4fb2f8cb35b1a9       0t0                 TCP localhost:54191->localhost:30147 (CLOSE_WAIT)
python3.1 12305 diwu  101u    unix 0xcc4fb3072e1083a9       0t0                     ->0xcc4fb3072e108471
python3.1 12305 diwu  102u    IPv6 0xcc4fb2f8cb35fea9       0t0                 TCP localhost:54194->localhost:30147 (CLOSE_WAIT)
python3.1 12305 diwu  103u    unix 0xcc4fb3072e1089e9       0t0                     ->0xcc4fb3072e108ab1
python3.1 12305 diwu  104u    IPv6 0xcc4fb2f8cb3605a9       0t0                 TCP localhost:54198->localhost:30147 (CLOSE_WAIT)
python3.1 12305 diwu  105u    unix 0xcc4fb3072e108e99       0t0                     ->0xcc4fb3072e109411
python3.1 12305 diwu  106u    IPv6 0xcc4fb2f8cb360ca9       0t0                 TCP localhost:54202->localhost:30147 (CLOSE_WAIT)
python3.1 12305 diwu  107u    unix 0xcc4fb3072e108219       0t0                     ->0xcc4fb3072e1082e1
python3.1 12305 diwu  108u    IPv6 0xcc4fb2f8cb35d4a9       0t0                 TCP localhost:54205->localhost:30147 (CLOSE_WAIT)
python3.1 12305 diwu  109u    unix 0xcc4fb3072e108859       0t0                     ->0xcc4fb3072e108921
python3.1 12305 diwu  110u    IPv6 0xcc4fb2f8cb35e2a9       0t0                 TCP localhost:54209->localhost:30147 (CLOSE_WAIT)
python3.1 12305 diwu  111u    unix 0xcc4fb3072e108b79       0t0                     ->0xcc4fb3072e108c41
python3.1 12305 diwu  112u    IPv6 0xcc4fb2f8cb35f0a9       0t0                 TCP localhost:54213->localhost:30147 (CLOSE_WAIT)
python3.1 12305 diwu  113u    unix 0xcc4fb3072e109fc9       0t0                     ->0xcc4fb3072e10a091
python3.1 12305 diwu  114u    IPv6 0xcc4fb2f8cb3595a9       0t0                 TCP localhost:54262->localhost:30147 (ESTABLISHED)
python3.1 12305 diwu  115u    unix 0xcc4fb3072e1097f9       0t0                     ->0xcc4fb3072e109a51
python3.1 12305 diwu  116u    IPv6 0xcc4fb2f8cb359ca9       0t0                 TCP localhost:54267->localhost:30147 (ESTABLISHED)
python3.1 12305 diwu  117u    unix 0xcc4fb3072e108791       0t0                     ->0xcc4fb3072e108d09
python3.1 12305 diwu  118u    IPv6 0xcc4fb2f8cb3579a9       0t0                 TCP localhost:54270->localhost:30147 (ESTABLISHED)
python3.1 12305 diwu  119u    unix 0xcc4fb3072e108601       0t0                     ->0xcc4fb3072e1086c9
python3.1 12305 diwu  120u    IPv6 0xcc4fb2f8cb3533a9       0t0                 TCP localhost:54274->localhost:30147 (ESTABLISHED)
python3.1 12305 diwu  121u    unix 0xcc4fb3072e10a3b1       0t0                     ->0xcc4fb3072e10a479

TCP CLOSE_WAIT is the one that's easy to solve, by setting the connection header to CLOSE. the other unix one is still the log tailing issue

diwu-sf commented 13 hours ago

The TCP isn't being leaked, so now the only remaining accumulating leak is the unix docker sockets. You can repro it by repeated clicking on a file and then do lsof -p ... | grep unix