bus1 / dbus-broker

Linux D-Bus Message Broker
https://github.com/bus1/dbus-broker/wiki
Apache License 2.0
675 stars 78 forks source link

launch/service: track systemd unit state #250

Closed dvdhrm closed 3 years ago

dvdhrm commented 3 years ago

Track the ActiveState property of systemd units associated with dbus services. This allows us to better track when a service activation fails beyond just the StartUnit() call.

Until now we only ever tracked failure of the StartUnit() call itself. However, this call does not wait for a service to be fully activated, but merely waits for the start-job to be queued in systemd. This means, we only see failures of the unit-verification and transaction checks, but never get any results of the transaction itself.

In particular, this means that if any unit successfully spawns, but then fails before claiming a bus-name, we never see this failure, because it is reported only late by systemd.

This commit extends the launcher to track all systemd units we activate for as long as a dbus-service is live. We only ever query the ActiveState property, and thus can reliably track when a unit fails. Note that we never track when a unit succeeds, because this information is of no use in the launcher. Instead, we rely on the broker to successfully track acquired names (which it already does). We only ever tell the broker when a unit fails, and we now do that over the entire lifetime of a dbus-service. This way, even late failures are reported to the broker (which will ignore them, if they appeared after a service already activated).

lucab commented 3 years ago

@dvdhrm @teg I'm not sure that it is 100% related to this PR, but we recently started observing some activation races in rpm-ostree CI, see https://github.com/coreos/rpm-ostree/issues/2531.

In particular, the client that is triggering the activation of a systemd service unit seems to receive a Could not activate remote peer error, but systemd correctly manages to start the service. The CI is hitting this case after the service unit has been forcefully SIGKILLed and put in a failed state. It doesn't seem to be deterministically reproducible though.

This happens on a Fedora 33 image with systemd-246.7-1.fc33.x86_64 and dbus-broker-26-1.fc33.x86_64.

cgwalters commented 3 years ago

Offhand I think this is probably correct - we shouldn't mask errors. So the ostree transactionality test should have been using a retry loop from the start, as you added in https://github.com/ostreedev/ostree/pull/2276 .