Important amount of CLOSE_WAIT tcp connections

weber-s commented 5 months ago

Describe the Bug

After upgrading from 23.05 to 24.04.2, we observe sometime a drastic increase of TCP connections in a CLOSE_WAIT status. So big that the server slows drastically and collabora (and nextcloud...) become unusable.

Here is a graph of our monitoring:

And the violet graph (CLOSE_WAIT) growth up... a lot (outside the graph here).

Steps to Reproduce

Install collabora
wait... We are not able to reproduce it right now :/

Expected Behavior

No increase of CLOSE_WAIT tcp connections.

Server

Collabora version: 24.04.4

Aditionnal information

We suspect this commit to be related : https://github.com/CollaboraOnline/online/commit/9e0cef08c223a8022e73f408c9f587b26cca6ded . But this is just after a quick search on the repo (it match the time frame of our update and the subject).

mmeeks commented 4 months ago

That's really interesting feedback; thanks ! =) great graph. We only changed behavior and force with this shutdownAfterWriting mode for unix sockets from a 'socketpair' between forked children for background save processes - so it is indeed odd that so many TCP sockets would be in CLOSE_WAIT state; that looks very unexpected.

All sockets should be associated with a remote user and we should be polling on them to get an ERR/HUP or whatever and then go on to close that socket - anything else is madness and will burn file-descriptors. Can you provide some log output:

What does $ kill -USR1

give in the logs when in this state ? and are there any interesting log fragments we can see ? =)

Thanks!

mmeeks commented 4 months ago

Would be nice to have some $ lsof -p data as well. Also - if we can get: $ ss -a -t # output as well as the -USR1 output - then perhaps we can see which connections are in what state internally, and also in the CLOSE_WAIT state ! thanks =)

phlb commented 4 months ago

Please find attached lsof -n & ss -a -t when the number of closed_wait tcp connections growth.

I forgot to do the kill -USR1 output. I'll do the next time.

ss_a_t.txt.gz lsof_n.txt.gz

mmeeks commented 4 months ago

Wow that's great data - and to give a flavor:

coolwsd   2155435 2155456 coolwsd               cool 3796u     IPv6            6636068       0t0        TCP 127.0.0.1:9980->127.0.0.1:33666 (CLOSE_WAIT)
coolwsd   2155435 2155456 coolwsd               cool 3797u     IPv6            6635349       0t0        TCP 127.0.0.1:9980->127.0.0.1:33670 (CLOSE_WAIT)
coolwsd   2155435 2155456 coolwsd               cool 3798u     IPv6            6635350       0t0        TCP 127.0.0.1:9980->127.0.0.1:33684 (CLOSE_WAIT)
coolwsd   2155435 2155456 coolwsd               cool 3799u     sock                0,8       0t0    6635352 protocol: TCPv6
coolwsd   2155435 2155456 coolwsd               cool 3800u     IPv6            6635353       0t0        TCP 127.0.0.1:9980->127.0.0.1:33698 (CLOSE_WAIT)
coolwsd   2155435 2155456 coolwsd               cool 3801u     IPv6            6635354       0t0        TCP 127.0.0.1:9980->127.0.0.1:33706 (CLOSE_WAIT)
coolwsd   2155435 2155456 coolwsd               cool 3802u     IPv6            6635355       0t0        TCP 127.0.0.1:9980->127.0.0.1:33710 (CLOSE_WAIT)
coolwsd   2155435 2155456 coolwsd               cool 3803u     IPv6            6635356       0t0        TCP 127.0.0.1:9980->127.0.0.1:33726 (CLOSE_WAIT)
coolwsd   2155435 2155456 coolwsd               cool 3804u     IPv6            6636083       0t0        TCP 127.0.0.1:9980->127.0.0.1:33738 (CLOSE_WAIT)
coolwsd   2155435 2155456 coolwsd               cool 3805u     IPv6            6635362       0t0        TCP 127.0.0.1:9980->127.0.0.1:33748 (CLOSE_WAIT)
coolwsd   2155435 2155456 coolwsd               cool 3806u     IPv6            6635363       0t0        TCP 127.0.0.1:9980->127.0.0.1:33756 (CLOSE_WAIT)
coolwsd   2155435 2155456 coolwsd               cool 3807u     IPv6            6636087       0t0        TCP 127.0.0.1:9980->127.0.0.1:33762 (CLOSE_WAIT)
coolwsd   2155435 2155456 coolwsd               cool 3808u     sock                0,8       0t0    6636089 protocol: TCPv6
coolwsd   2155435 2155456 coolwsd               cool 3809u     IPv6            6636090       0t0        TCP 127.0.0.1:9980->127.0.0.1:33780 (CLOSE_WAIT)
coolwsd   2155435 2155456 coolwsd               cool 3810u     IPv6            6635366       0t0        TCP 127.0.0.1:9980->127.0.0.1:33782 (CLOSE_WAIT)
coolwsd   2155435 2155456 coolwsd               cool 3811u     IPv6            6635367       0t0        TCP 127.0.0.1:9980->127.0.0.1:33784 (CLOSE_WAIT)

Not good; sub 10x kit processes & open-documents - and ~4k file-descriptors in OPEN_WAIT.

So a couple of requests.

Can you do a:

ps auwx | grep coolwsd
# get the pid of the coolwsd daemon - usutally the 1st hit and do:
kill -USR1 <that-pid>

And attach some chunks of the journal - there should be a lot of repetition there; and we should be able to see what all those sockets are associated with.

Another thing that would be useful woudl be:

strace -f -tt -s 256 -o /tmp/slog -p

and leave that to run for a bit - zip up /tmp/slog and mail it to me - michael.meeks@collabora.com =)

If the USR1 dump output doesn't help - the strace will I think;

Thanks for finding such an interesting one ! =)

mmeeks commented 4 months ago

Ah - it would also be good to have full version details and hashes from help->about if possible. Thanks =)

mmeeks commented 4 months ago

Still struggling to find something here - I added https://github.com/CollaboraOnline/online/pull/9497 which unwinds a few somewhat unhelpful behaviors I was seeing but ... unlikely to fix this.

mmeeks commented 4 months ago

Had some inspiration while on the beach on holiday:

https://github.com/CollaboraOnline/online/pull/9604/commits

"Remove sockets on exception during polling thread"

almost certainly fixes something that might cause this problem.

phlb commented 4 months ago

Thanks for the feedback. As soon as there is a release, we will test it.

By the way, the bug didn't happen again since my last message. Probably because summer holidays.

CollaboraOnline / online