Open weber-s opened 5 months ago
That's really interesting feedback; thanks ! =) great graph. We only changed behavior and force with this shutdownAfterWriting mode for unix sockets from a 'socketpair' between forked children for background save processes - so it is indeed odd that so many TCP sockets would be in CLOSE_WAIT state; that looks very unexpected.
All sockets should be associated with a remote user and we should be polling on them to get an ERR/HUP or whatever and then go on to close that socket - anything else is madness and will burn file-descriptors. Can you provide some log output:
What does
$ kill -USR1
give in the logs when in this state ? and are there any interesting log fragments we can see ? =)
Thanks!
Would be nice to have some $ lsof -p
Please find attached lsof -n
& ss -a -t
when the number of closed_wait tcp connections growth.
I forgot to do the kill -USR1 output. I'll do the next time.
Wow that's great data - and to give a flavor:
coolwsd 2155435 2155456 coolwsd cool 3796u IPv6 6636068 0t0 TCP 127.0.0.1:9980->127.0.0.1:33666 (CLOSE_WAIT)
coolwsd 2155435 2155456 coolwsd cool 3797u IPv6 6635349 0t0 TCP 127.0.0.1:9980->127.0.0.1:33670 (CLOSE_WAIT)
coolwsd 2155435 2155456 coolwsd cool 3798u IPv6 6635350 0t0 TCP 127.0.0.1:9980->127.0.0.1:33684 (CLOSE_WAIT)
coolwsd 2155435 2155456 coolwsd cool 3799u sock 0,8 0t0 6635352 protocol: TCPv6
coolwsd 2155435 2155456 coolwsd cool 3800u IPv6 6635353 0t0 TCP 127.0.0.1:9980->127.0.0.1:33698 (CLOSE_WAIT)
coolwsd 2155435 2155456 coolwsd cool 3801u IPv6 6635354 0t0 TCP 127.0.0.1:9980->127.0.0.1:33706 (CLOSE_WAIT)
coolwsd 2155435 2155456 coolwsd cool 3802u IPv6 6635355 0t0 TCP 127.0.0.1:9980->127.0.0.1:33710 (CLOSE_WAIT)
coolwsd 2155435 2155456 coolwsd cool 3803u IPv6 6635356 0t0 TCP 127.0.0.1:9980->127.0.0.1:33726 (CLOSE_WAIT)
coolwsd 2155435 2155456 coolwsd cool 3804u IPv6 6636083 0t0 TCP 127.0.0.1:9980->127.0.0.1:33738 (CLOSE_WAIT)
coolwsd 2155435 2155456 coolwsd cool 3805u IPv6 6635362 0t0 TCP 127.0.0.1:9980->127.0.0.1:33748 (CLOSE_WAIT)
coolwsd 2155435 2155456 coolwsd cool 3806u IPv6 6635363 0t0 TCP 127.0.0.1:9980->127.0.0.1:33756 (CLOSE_WAIT)
coolwsd 2155435 2155456 coolwsd cool 3807u IPv6 6636087 0t0 TCP 127.0.0.1:9980->127.0.0.1:33762 (CLOSE_WAIT)
coolwsd 2155435 2155456 coolwsd cool 3808u sock 0,8 0t0 6636089 protocol: TCPv6
coolwsd 2155435 2155456 coolwsd cool 3809u IPv6 6636090 0t0 TCP 127.0.0.1:9980->127.0.0.1:33780 (CLOSE_WAIT)
coolwsd 2155435 2155456 coolwsd cool 3810u IPv6 6635366 0t0 TCP 127.0.0.1:9980->127.0.0.1:33782 (CLOSE_WAIT)
coolwsd 2155435 2155456 coolwsd cool 3811u IPv6 6635367 0t0 TCP 127.0.0.1:9980->127.0.0.1:33784 (CLOSE_WAIT)
Not good; sub 10x kit processes & open-documents - and ~4k file-descriptors in OPEN_WAIT.
So a couple of requests.
Can you do a:
ps auwx | grep coolwsd
# get the pid of the coolwsd daemon - usutally the 1st hit and do:
kill -USR1 <that-pid>
And attach some chunks of the journal - there should be a lot of repetition there; and we should be able to see what all those sockets are associated with.
Another thing that would be useful woudl be:
strace -f -tt -s 256 -o /tmp/slog -p
and leave that to run for a bit - zip up /tmp/slog and mail it to me - michael.meeks@collabora.com =)
If the USR1 dump output doesn't help - the strace will I think;
Thanks for finding such an interesting one ! =)
Ah - it would also be good to have full version details and hashes from help->about if possible. Thanks =)
Still struggling to find something here - I added https://github.com/CollaboraOnline/online/pull/9497 which unwinds a few somewhat unhelpful behaviors I was seeing but ... unlikely to fix this.
Had some inspiration while on the beach on holiday:
https://github.com/CollaboraOnline/online/pull/9604/commits
"Remove sockets on exception during polling thread"
almost certainly fixes something that might cause this problem.
Thanks for the feedback. As soon as there is a release, we will test it.
By the way, the bug didn't happen again since my last message. Probably because summer holidays.
Describe the Bug
After upgrading from 23.05 to 24.04.2, we observe sometime a drastic increase of TCP connections in a
CLOSE_WAIT
status. So big that the server slows drastically and collabora (and nextcloud...) become unusable.Here is a graph of our monitoring:
And the violet graph (
CLOSE_WAIT
) growth up... a lot (outside the graph here).Steps to Reproduce
Expected Behavior
No increase of CLOSE_WAIT tcp connections.
Server
Aditionnal information
We suspect this commit to be related : https://github.com/CollaboraOnline/online/commit/9e0cef08c223a8022e73f408c9f587b26cca6ded . But this is just after a quick search on the repo (it match the time frame of our update and the subject).
Maybe related to https://github.com/CollaboraOnline/online/issues/9341