CollaboraOnline / online

Collabora Online is a collaborative online office suite based on LibreOffice technology. This is also the source for the Collabora Office apps for iOS and Android.
https://collaboraonline.com
Other
1.76k stars 677 forks source link

coolwsd freezes regularly #9341

Closed doobry-systemli closed 1 month ago

doobry-systemli commented 2 months ago

Describe the Bug

We're running latest Collabora Online CODE edition (24.04.4-2) installed on a Debian Bookworm server and integrated into Nextcloud.

Since several releases (probably more or less since the upgrade 24.04.) coolwsd freezes on a regular basis. The coolwsd webservice no longer responds and requests run into timeouts.

Not sure if it's related, but this log entry seems to show up before the freeze:

coolwsd[479202]: frk-442418-442418 2024-06-24 11:56:39.708686 +0200 [ forkit ] INF  Forkit initialization complete: setting log-level to [warning] as configured.| kit/ForKit.cpp:789
coolwsd[479202]: kit-479202-479202 2024-06-24 18:05:59.423378 +0200 [ kitbroker_0de ] ERR  ToMaster-2479: error: cmd=getclipboard kind=nodocloaded| common/Session.hpp:163
coolwsd[479202]: kit-479202-479202 2024-06-24 18:05:59.423424 +0200 [ kitbroker_0de ] WRN  ToMaster-2479: Skipping unload on incomplete view [ToMaster-2479], viewId: -1| kit/ChildSession.cpp:159
coolwsd[442405]: wsd-442405-479513 2024-06-24 18:05:59.424148 +0200 [ docbroker_0de ] ERR  ToClient-2479: getclipboard error failure: nodocloaded| wsd/ClientSession.cpp:1806

Apart from that, this log appears from time to time, but not shortly before or after the freeze:

coolwsd[442405]: wsd-442405-476075 2024-06-24 18:24:31.027392 +0200 [ docbroker_0cd ] ERR  #67: Read failed, have 0 buffered bytes (ECONNRESET: Connection reset by peer)| net/Socket.hpp:1146

Desktop

(Please complete the following information)

doobry-systemli commented 2 months ago

https://github.com/CollaboraOnline/online/issues/9318 could be the same problem :thinking:

vmiklos commented 2 months ago

Hi @doobry-systemli , do you think you could try to profile with the watchdog enabled to see what's the backtrace of this hang?

Here is how to use the watchdog:

1) You need to set the COOL_WATCHDOG environment variable to something non-empty, like '1'. 2) Use the scripts/watchdog-cool from this repo to run 'perf' with suitable parameters, this will give you a flamegraph that shows what are the places where we hang for too long.

Hopefully that would give us a picture of the situation. Could you please try to run the watch and paste the generated SVG here, so we can improve the situation? Thanks.

doobry-systemli commented 2 months ago

Hey @vmiklos, sure I can do this. Am I supposed to run the watchdog-cool script once coolwsd froze? Or before? And the COOL_WATCHDOG variable probably needs to be set in the environment that starts coolwsd, i.e. in the systemd unit, right?

doobry-systemli commented 2 months ago

@vmiklos also, which time should I give as argument to watchdog-cool?

Or would this need to be called before the freeze happens and run until the freeze happens? Then I'd probably have to pass something like 5h to it as for now I don't have a clear reproducer.

vmiklos commented 2 months ago

watchdog-cool is meant to be long-running, like you run it for 1h and then you can see when the watchdog triggered in that timefame. So you start it proactively, not when you already have a hang.

The COOL_WATCHDOG env var has to be set on the coolwsd process yes, e.g. in the systemd unit, correct.

The example timeout is 1h, so perhaps try with that. That means you start it proactively and you hope you get a hang in the next hour. If it doesn't happen in the next hour, you can run the script (just a wrapper around 'perf', really) again, with an increased timeframe. If something is not clear, just ask. :-) Thanks.

doobry-systemli commented 2 months ago

Thanks @vmiklos. I started it now. One additional question: is it ok to abort the perf process once the freeze happens? Or do I have to wait until the process finishes with the specified timeframe?

vmiklos commented 2 months ago

You can press Ctrl-C any time to stop collecting if you think the interesting hang happened already.

The important part is to look at the generated (interactive) SVG and if it has meaningful function names, then attach it to this issue. Thanks.

doobry-systemli commented 2 months ago

Here's a flamegraph of watchdog-cool running for 5 hours and a freeze happening after +/- 2 hours.

watchdog

vmiklos commented 2 months ago

Hrm, my hope was that the watchdog would show what kind of request causes the hang, but this is not a lot of data.. unless @caolanm thinks otherwise.

bateast commented 2 months ago

Same here. Really problematic in production mode.

Desktop

Collabora version: 24.04.4.2-1
OS and version: Debian Stable (Bookworm)

How can we help in identifying or bypassing the issue?

caolanm commented 2 months ago

I wonder if you have the coolwsd-dbgsym package installed? If not, then installing that may give more fine grained results that might be more helpful

doobry-systemli commented 2 months ago

I wonder if you have the coolwsd-dbgsym package installed? If not, then installing that may give more fine grained results that might be more helpful

Thanks. The package indeed was not installed yet. We installed it now and restarted the coolwsd service. (Un)fortunately the hang/freeze didn't happen within the last days. I wonder why, but after the initial catch with the flamegraph from above it didn't happen again since we started to try to catch it :zany_face:

One change we had to do was to reboot the kernel without lockdown mode, because otherwise /sys/kernel/tracing/ was empty despite tracefs being mounted. Usually we boot our VMs with lockdown=confidentiality as kernel boot parameter. Maybe that's related? :thinking:

bateast commented 1 month ago

One change we had to do was to reboot the kernel without lockdown mode, because otherwise /sys/kernel/tracing/ was empty despite tracefs being mounted. Usually we boot our VMs with lockdown=confidentiality as kernel boot parameter. Maybe that's related? 🤔

Hello, still present here (and again, really problematic in a production mode for small organization). Only quick and efficient solution found up to now: monitoring and fast server reboot on first hang detected.

by the way, our boot parameters (about lockdown): cat /proc/cmdline BOOT_IMAGE=/@rootfs/boot/vmlinuz-6.1.0-22-cloud-amd64 root=UUID=2ff0bc0c-642d-4aeb-985e-e6a9e1cd6856 ro rootflags=subvol=@rootfs console=tty0 console=ttyS0,115200 earlyprintk=ttyS0,115200 consoleblank=0 So, no, lockdown is not activated…

caolanm commented 1 month ago

with the mentioned coolwsd-dbgsym (or the equivalent coolwsd-debuginfo + coolwsd-debugsource rpms for the rpm-using distros) when it hangs if you try:

sudo gdb --q --n --ex 'thread apply all backtrace full' --batch --pid=$(pidof coolwsd) > /tmp/full.backtrace.log and attach the resulting /tmp/full.backtrace.log that might be useful

mmeeks commented 1 month ago

Ah - so ... the problem here seems simple; the new server-side downloading clipboard code (which avoids the browser's various well meaning policy issues forbidding it to connect to other servers) does:

ClientSession.cpp: http::Request httpRequest(Poco::URI(url).getPathAndQuery()); if (!httpSession->asyncRequest(httpRequest, docBroker->getPoll()))

And you'd think that asyncRequest was fully async; but unfortunately the DNS is synchronous; is there a reason why a DNS lookup would take an unreasonably long time on your machine ? Either way I think we end up where it says:

HttpRequest::asyncRequest -> HttpRequest::connect() -> net::connect which does:

struct addrinfo* ainfo = nullptr;
struct addrinfo hints;
std::memset(&hints, 0, sizeof(hints));
const int rc = getaddrinfo(host.c_str(), port.c_str(), &hints, &ainfo);

synchronously.

mmeeks commented 1 month ago

Then again - why that would block the websrv thread instead of the poll of the DocumentBroker is something of an unusual mystery.

mejo- commented 1 month ago

@mmeeks would that DNS lookup happen on the server side, or on the client side? If the second was true, it would explain that the issue gets triggered by some particular clients and if they stop using Collabora the problem doesn't appear any longer.

mmeeks commented 1 month ago

Thanks @mejo- we think that this could be cause by non-asynchrnous DNS during a connect as above - which could block a DocumentBroker thread (but not the whole coolwsd). I believe Caolan is looking at making that fully async - in case it's the issue. We have also found a potentially related threading issue with a similar trace that we have a potential fix (or two) for that may be related. Thinking and testing is ongoing there.

caolanm commented 1 month ago

I feel there is a good chance that https://github.com/CollaboraOnline/online/commit/d6b15b9f2de67de2ae99321e65380495acbebd17 will help here given the similarity of the callstack seen in the motivating case there

mmeeks commented 1 month ago

Possibly; I'm not hyper-convinced that the checkFileInfoSync path is ever taken; but at least if by the time we're unwinding the stack there the CheckFileInfo class is cleaned up - then at least we don't de-reference a freed 'this' =)

doobry-systemli commented 1 month ago

We caught the issue another time, this time with coolwsd-debug being installed. Doesn't look like the flamegraph has more info this time, but here it is:

watchdog

caolanm commented 1 month ago

The originally anonymous [coolwsd] symbol now appears as CheckFileInfo::checkFileInfo which is good to see as the concrete hanging location.

caolanm commented 1 month ago

I think now in 24.04.5-2 that this exact glitch shouldn't happen. Let me know if it reappears in that version though.