Open diwu-sf opened 1 month ago
I did some debugging into this, it appears that there's many unix sockets being opened per "click" to the docker daemon socket. There appear to be a leak somewhere.
Every time the docker container log is retrieved, a bunch of unix socket FDs are leaked. The issue is the same as what's reported in https://github.com/docker/docker-py/issues/3282
As far as I can tell, the best way to fix this would be to:
container.logs(stream=True, follow=True)
because the bug is in the generatorsince
to until
[since -> until]
segments every time get_and_clear
is calledThe other leak of the TCP socket is from requests.session()
returning a response but never explicitly closing that response object, so it accumulates as leaked connection. Solutions are:
close()
so the conn is returned to the pool (looks like a lot of refactor, not worth it IMO)session.headers['Connection'] = 'close'
to automatically close the connection (this one worked with minimal change, I'd just fix it this way and it doesn't matter perf wise when the TCP to the sandbox doesn't get re-used)This issue should now be resolved on main
Bug is less severe, but it's still fundamentally there, take a look at the output of lsof -p {pid of the uvicorn process}
:
python3.1 12305 diwu 98u IPv6 0xcc4fb2f8cb35a3a9 0t0 TCP localhost:54187->localhost:30147 (CLOSE_WAIT)
python3.1 12305 diwu 99u unix 0xcc4fb3072e10a861 0t0 ->0xcc4fb3072e10a929
python3.1 12305 diwu 100u IPv6 0xcc4fb2f8cb35b1a9 0t0 TCP localhost:54191->localhost:30147 (CLOSE_WAIT)
python3.1 12305 diwu 101u unix 0xcc4fb3072e1083a9 0t0 ->0xcc4fb3072e108471
python3.1 12305 diwu 102u IPv6 0xcc4fb2f8cb35fea9 0t0 TCP localhost:54194->localhost:30147 (CLOSE_WAIT)
python3.1 12305 diwu 103u unix 0xcc4fb3072e1089e9 0t0 ->0xcc4fb3072e108ab1
python3.1 12305 diwu 104u IPv6 0xcc4fb2f8cb3605a9 0t0 TCP localhost:54198->localhost:30147 (CLOSE_WAIT)
python3.1 12305 diwu 105u unix 0xcc4fb3072e108e99 0t0 ->0xcc4fb3072e109411
python3.1 12305 diwu 106u IPv6 0xcc4fb2f8cb360ca9 0t0 TCP localhost:54202->localhost:30147 (CLOSE_WAIT)
python3.1 12305 diwu 107u unix 0xcc4fb3072e108219 0t0 ->0xcc4fb3072e1082e1
python3.1 12305 diwu 108u IPv6 0xcc4fb2f8cb35d4a9 0t0 TCP localhost:54205->localhost:30147 (CLOSE_WAIT)
python3.1 12305 diwu 109u unix 0xcc4fb3072e108859 0t0 ->0xcc4fb3072e108921
python3.1 12305 diwu 110u IPv6 0xcc4fb2f8cb35e2a9 0t0 TCP localhost:54209->localhost:30147 (CLOSE_WAIT)
python3.1 12305 diwu 111u unix 0xcc4fb3072e108b79 0t0 ->0xcc4fb3072e108c41
python3.1 12305 diwu 112u IPv6 0xcc4fb2f8cb35f0a9 0t0 TCP localhost:54213->localhost:30147 (CLOSE_WAIT)
python3.1 12305 diwu 113u unix 0xcc4fb3072e109fc9 0t0 ->0xcc4fb3072e10a091
python3.1 12305 diwu 114u IPv6 0xcc4fb2f8cb3595a9 0t0 TCP localhost:54262->localhost:30147 (ESTABLISHED)
python3.1 12305 diwu 115u unix 0xcc4fb3072e1097f9 0t0 ->0xcc4fb3072e109a51
python3.1 12305 diwu 116u IPv6 0xcc4fb2f8cb359ca9 0t0 TCP localhost:54267->localhost:30147 (ESTABLISHED)
python3.1 12305 diwu 117u unix 0xcc4fb3072e108791 0t0 ->0xcc4fb3072e108d09
python3.1 12305 diwu 118u IPv6 0xcc4fb2f8cb3579a9 0t0 TCP localhost:54270->localhost:30147 (ESTABLISHED)
python3.1 12305 diwu 119u unix 0xcc4fb3072e108601 0t0 ->0xcc4fb3072e1086c9
python3.1 12305 diwu 120u IPv6 0xcc4fb2f8cb3533a9 0t0 TCP localhost:54274->localhost:30147 (ESTABLISHED)
python3.1 12305 diwu 121u unix 0xcc4fb3072e10a3b1 0t0 ->0xcc4fb3072e10a479
TCP CLOSE_WAIT is the one that's easy to solve, by setting the connection header to CLOSE. the other unix one is still the log tailing issue
The TCP isn't being leaked, so now the only remaining accumulating leak is the unix docker sockets.
You can repro it by repeated clicking on a file and then do lsof -p ... | grep unix
@diwu-sf is this bug finally resolved? tofarr put in a few fixes.
No there's still Unix socket leak to the docker socket due to the log streamer.
Use the same repo and you should see that Unix sockets are still being accumulated per click
Was able to see the leakage by following lsof -p
for this process, and clicking on files
~/.cache/pypoetry/virtualenvs/openhands-ai-uYxnY0EM-py3.12/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=5, pipe_handle=7) --multiprocessing-fork
Strange that the uvicorn proc itself doesn't own the leak, just this one. Which I guess points to the leak being in a thread?
It’s the docker log streamer thread. It never actually closed the log line generator
On Thu, Nov 14, 2024 at 6:19 AM Robert Brennan @.***> wrote:
Was able to see the leakage by following lsof -p for this process, and clicking on files
~/.cache/pypoetry/virtualenvs/openhands-ai-uYxnY0EM-py3.12/bin/python -c from multiprocessing.spawn import spawn_main; spawn_main(tracker_fd=5, pipe_handle=7) --multiprocessing-fork
Strange that the uvicorn proc itself doesn't own the leak, just this one. Which I guess points to the leak being in a thread?
— Reply to this email directly, view it on GitHub https://github.com/All-Hands-AI/OpenHands/issues/4538#issuecomment-2476473995, or unsubscribe https://github.com/notifications/unsubscribe-auth/BL3P7PSMDECPQFNMNPMLBQD2ASWPTAVCNFSM6AAAAABQP5ZOPSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINZWGQ3TGOJZGU . You are receiving this because you were mentioned.Message ID: @.***>
OK I fixed several leakage issues, but this one persists 🙃
I did start calling log_generator.close()
but no luck
for posterity, you can watch the leakage with
while true; do lsof -p $(pgrep -f "tracker_fd") | wc -l; sleep 1; done
All the leaks go away if you make LogBuffer a null class (replace all method logic with pass
or return empty vars)
I've further confirmed that if you don't instantiate the log_generator
in LogBuffer, the leak goes away.
I've also confirmed that we're closing every log_generator
we create.
TBH at this point I'm assuming there's a bug in the docker SDK
There is a bug, look at the sdk reference I had earlier in this thread.
There’s also a workaround fix for the leak in that docker issues thread
On Thu, Nov 14, 2024 at 11:57 AM Robert Brennan @.***> wrote:
I've further confirmed that if you don't instantiate the log_generator in LogBuffer, the leak goes away.
I've also confirmed that we're closing every log_generator we create.
TBH at this point I'm assuming there's a bug in the docker SDK
— Reply to this email directly, view it on GitHub https://github.com/All-Hands-AI/OpenHands/issues/4538#issuecomment-2477294543, or unsubscribe https://github.com/notifications/unsubscribe-auth/BL3P7PQKBQZO74AJ6MIZOI32AT6D5AVCNFSM6AAAAABQP5ZOPSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINZXGI4TINJUGM . You are receiving this because you were mentioned.Message ID: @.***>
Is there an existing issue for the same bug?
Describe the bug and reproduction steps
Note, I searched for the error message "Too many open files" and didn't see any open issue against this error.
Repro steps:
do nothing
/workspace
, just a few files will doEventually, the UI and backend server becomes broken. In the UI, the message "Failed to fetch file" will show up. In the backend, when running with
DEBUG=1 make run
this error message shows up:OpenHands Installation
Docker command in README
OpenHands Version
main
Operating System
MacOS
Logs, Errors, Screenshots, and Additional Context
No response