bus1 / dbus-broker

Linux D-Bus Message Broker
https://github.com/bus1/dbus-broker/wiki
Apache License 2.0
667 stars 78 forks source link

Timeout on service calls to a systemd unit which fails #304

Closed agners closed 1 year ago

agners commented 1 year ago

A service call to a systemd D-Bus activated service which fails usually returns immediately with an error along these lines:

Call failed: Could not activate remote peer: unit failed.

We found that in a real world scenario, sometimes the method call instead times out.

Call failed: Connection timed out

It seems once dbus-broker is in that state, all service call to that service time out. Restarting the dbus-broker brings back the previous (expected) behavior. It seems in that state systemd is no longer attempting to start the service even.

In our case, we use dbus-fast to talk to the Bluetooth daemon. The Bluetooth daemon does not get started due to a condition (/sys/class/bluetooth missing). It seems that a race condition gets triggered quite easily in our setup (restarting our application Home Assistant Core triggers the problem about every other restart).

I've also successfully reproduced the problem using three instances of busctl while running busctl monitor. This was on a ODROID-N2 ARM board:

while [ true ]; do busctl call org.bluez / org.freedesktop.DBus.ObjectManager GetManagedObjects; done

It seems a race condition within dbus-broker causes all future calls to the method to timeout instead of returning an error.

dvdhrm commented 1 year ago

Thanks a lot for the report! There is certainly a mechanism in dbus-broker that delays service-activation if it thinks an activation is already pending. If that activation somehow gets stale, it will likely cause the behavior you see.

First of all, there are no timeouts in dbus. Any timeout you see is a custom timeout of your client-library. I don't think this matters much here, though.

To debug this further, it would be great to get some more details on what exactly is failing. Can you show relevant log-messages of BlueZ, dbus-broker and systemd in your log? Does BlueZ have a ConditionExists= line on your system, or does the daemon really fail on startup and return an error? What does the systemd-unit look like when BlueZ failed?

Lastly, what is the version of dbus-broker you are running? (and the systemd version, maybe?)

agners commented 1 year ago

First of all, there are no timeouts in dbus. Any timeout you see is a custom timeout of your client-library. I don't think this matters much here, though.

Right, those are client side timeouts. That is actually how we noticed it, as our application would hang forever on that call.

To debug this further, it would be great to get some more details on what exactly is failing. Can you show relevant log-messages of BlueZ, dbus-broker and systemd in your log? Does BlueZ have a ConditionExists= line on your system, or does the daemon really fail on startup and return an error? What does the systemd-unit look like when BlueZ failed?

We use a pretty standard `bluetooth.service file, which has the following condition:

# cat /usr/lib/systemd/system/bluetooth.service
[Unit]
Description=Bluetooth service
Documentation=man:bluetoothd(8)
ConditionPathIsDirectory=/sys/class/bluetooth
...

If Bluetooth is configured as a module in the kernel, and no Bluetooth device is plugged in (hence no driver pulls in the bluetooth kernel module), then this condition fails.

# systemctl status bluetooth
○ bluetooth.service - Bluetooth service
     Loaded: loaded (/usr/lib/systemd/system/bluetooth.service; enabled; vendor preset: enabled)
     Active: inactive (dead)
  Condition: start condition failed at Wed 2022-12-14 14:19:11 UTC; 3s ago
             └─ ConditionPathIsDirectory=/sys/class/bluetooth was not met
       Docs: man:bluetoothd(8)

Dec 14 14:19:11 ha-odroid-n2-one systemd[1]: Bluetooth service was skipped because of a failed condition check (ConditionPathIsDirectory=/sys/class/bluetooth).

Each (properly failing) call leads to an additional start attempt (and another Bluetooth service was skipped line).

Lastly, what is the version of dbus-broker you are running? (and the systemd version, maybe?)

dbus-broker 32, systemd 250

dvdhrm commented 1 year ago

I cannot reproduce this. I run systemd-252, so this might be fixed in a newer systemd version, but I did not find any related commit so far. Can you elaborate on what exactly happens when you hit this? Does this happen when /sys/class/bluetooth becomes available? Does this happen even if /sys/class/bluetooth never occurs? Does this happen with other services as well? What does systemctl report when it is in the failed state?

In particular, I'd be interested in what the state of your system is when you hit this bug. What does systemctl show then? Does the log show anything? What does busctl show for bluez in that case? Does busctl monitor show anything useful?

agners commented 1 year ago

Can you elaborate on what exactly happens when you hit this?

When calling the API in a loop, suddenly busctl returns timeouts instead of unit failed errors.

Does this happen when /sys/class/bluetooth becomes available?

No, this is the case when /sys/class/bluetooth stays unavailable.

Does this happen even if /sys/class/bluetooth never occurs?

Exactly yes. The device in question has no Bluetooth hardware.

Does this happen with other services as well?

I have not tested that, I'd imagine it does not matter what service that is.

What does systemctl report when it is in the failed state?

Essentially what is shown in the above console output. I also can trigger a start attempt with systemctl, and it leads to another entry Bluetooth service was skipped because of a failed condition check (ConditionPathIsDirectory=/sys/class/bluetooth).. From what I can tell this is a dbus-broker problem not even attempting to start the service at that point.

I cannot reproduce this.

Honestly, I feared that much. It seems to be somewhat hard to reproduce at times even on ARM hardware. I am not exactly sure how/why our real application using dbus-fast is so reliable in triggering this. I assume it calls multiple functions in a quite quick succession which makes it trigger reliably.

What sometimes help to trigger the race condition is to also run buctl monitor alongside.

What does busctl show for bluez in that case?

Using what verb? I think pretty much all attempts to access the org.bluez name lead to timeouts.

Does busctl monitor show anything useful?

It seems to just stop suddenly. It is kinda hard to make good sense out of it.

dvdhrm commented 1 year ago

Thanks for the information! I was assuming this happens when the service at one point becomes active. I did not expect this to happen if the service stays inactive.

If your system is in the failed state, can you recover it by running systemctl start bluetooth manually? That is, trigger a manual retry of the unit in question and thus possibly let the broker see again that the unit fails?

I think it would be very helpful to see the bus-traffic of a failed attempt. That is, sudo busctl monitor >traffic.log while the bug appears. This would allow seeing the information the launcher gets, and thus deduce better why it cannot see the unit-failure. Please note that this log can possibly contain credentials (but I am not aware of any service sending credentials via dbus inline). If you rather not send this log, let me know.

Can you tell me a bit more about the machine you trigger this on? Is it a particularly slow machine, and thus much more likely to hit scheduling races? Or is it even a single-core machine?

agners commented 1 year ago

If your system is in the failed state, can you recover it by running systemctl start bluetooth manually? That is, trigger a manual retry of the unit in question and thus possibly let the broker see again that the unit fails?

Indeed, this recovers the state. Running systemctl start bluetooth makes the busctl calls return Call failed: Could not activate remote peer: unit failed. again.

I think it would be very helpful to see the bus-traffic of a failed attempt. That is, sudo busctl monitor >traffic.log while the bug appears.

This log is taken from the system I can reproduce the problem (a ODROID-N2, 6 core, 4xCortex-A73, 2xCortex-A53). I had three instances of the above while [ true ] loop running, and the busctl monitor instance. It took a minute or two for the problem to appear, so that made the log a bit long. Then all three instances start to return Call failed: Connection timed out, and I interrupted the bucstl monitor command with Ctrl+C.

traffic.log.gz

dvdhrm commented 1 year ago

Thanks a lot for the dump! I filed PR #308 to try to fix this race. This is definitely a bug and your dump shows the sequence that triggers this. Lets see whether this fixes your issues.

(I have not tested it, yet, but will try to do so the coming days)

agners commented 1 year ago

Thanks a lot for looking into this! I am traveling for the next 12 days and I won't be able to test this out. I will try it when I am back!

agners commented 1 year ago

Thanks @dvdhrm for addressing this issue and @teg for reviewing and merging! :heart:

Do you happen to plan a release soon? I'd like to include the fix in our next release (probably later February), and debating if I should backport the fix or wait for the release :smile:

dvdhrm commented 1 year ago

I will prepare an upstream release now. Not sure whether I will push updates into distros today, though. Probably next week.

agners commented 1 year ago

No worries, we use Buildroot build system and I'll take care of bumping the package there.