bus1 / dbus-broker

Linux D-Bus Message Broker
https://github.com/bus1/dbus-broker/wiki
Apache License 2.0
677 stars 80 forks source link

service activation regression, could not activate remote peer #258

Closed lucab closed 3 years ago

lucab commented 3 years ago

There is a regression in service activation that we started observing in rpm-ostree CI after a dbus-broker update. The last known-working version is v24, while both v26 and v27 show the same regression symptoms (though maybe different root causes):

# rpm-ostree status > /dev/null; systemctl kill -s KILL rpm-ostreed; rpm-ostree status > /dev/null && echo ok
error: Could not activate remote peer.

# systemctl status rpm-ostreed
* rpm-ostreed.service - rpm-ostree System Management Daemon
     Loaded: loaded (/usr/lib/systemd/system/rpm-ostreed.service; static)
     Active: activating (start) since Mon 2021-02-15 11:10:58 UTC; 20ms ago
       Docs: man:rpm-ostree(1)
   Main PID: 3271 (rpm-ostree)
      Tasks: 1 (limit: 9444)
     Memory: 1.3M
     CGroup: /system.slice/rpm-ostreed.service
             `-3271 /usr/bin/rpm-ostree start-daemon

Feb 15 11:10:58 local-test systemd[1]: Starting rpm-ostree System Management Daemon...

At least on v27 it doesn't seem to be a race (can be manually reproduce, not improved by adding artificial sleep), but it looks like systemctl reset-failed has some effect on it:

# rpm-ostree status > /dev/null; systemctl kill -s KILL rpm-ostreed; sleep 15; rpm-ostree status > /dev/null && echo ok
error: Could not activate remote peer.

# rpm-ostree status > /dev/null; systemctl kill -s KILL rpm-ostreed; systemctl reset-failed; rpm-ostree status > /dev/null && echo ok
ok

Cross-references:

dvdhrm commented 3 years ago

I did some more investigation, because the bus-log seems really out of order. But reading through the systemd sources, it seems this is rather obscure:

When you send StartUnit to systemd, it will queue a local job to start a unit, then reply with success to StartUnit. However, there is no guarantee this local job was at all scheduled, yet. So even after you got the reply to StartUnit a query for the Unit-state might still return failed (if it already was in failed state). It even sends out PropertiesChanged with ActiveState=failed. Moreover, even if the unit was forked and the properties return the PID of the unit, the state is still failed. Only after some time, the state will change to activating (don't ask me when...). :disappointed:

Systemd does return a job-ID as reply to StartUnit. The problem is, the ID is only known when we receive the reply, so we would install a possible dbus-match too late and miss any signals sent in-between. The alternative would be to listen to ALL job-notifications of systemd, which is just ridiculous. Hence, the Job API is not useful to track a unit-operation. :disappointed:

I don't know what to do. sigh

dvdhrm commented 3 years ago

The reason we want this activation-tracking is to make sure our TransientUnit handling matches how the old-style direct activation in dbus-daemon works (which tracks exit-code failures of the processes it executes).

I now verified and the classic systemd-activation in dbus-daemon actually never tracks service failures. It simply verifies the job was queued successfully (see ActivationRequest signal handling in systemd). So we could technically just enable this tracking for transient units. However, it would be a missed opportunity.

I currently see no other way than watching JobRemoved on systemd for all jobs and then checking whether a queued job failed. This should be safe to do, since it is also what systemd does internally (though it is quite sad that we need such broad matches for it).

dvdhrm commented 3 years ago

@lucab I just pushed dbus-broker-27-2 into rawhide, which includes 2 recent commits I just pushed to main. After discussion with Lennart, I decided to add JobRemoved tracking. A bit unfortunate, but we can reduce the match to a short timeframe just around the StartUnit call.

Anyway, can you give this a try? My local tests now all succeed.

lucab commented 3 years ago

@dvdhrm I tested 27-2.fc35 manually and I'm not seeing the issue anymore!

lucab commented 3 years ago

Not sure if the plan is to have a v28 tagged for this, but in either cases it would be good to propagate back this fix downstream to F33 as the update there introduced the regression.

dvdhrm commented 3 years ago

This is now in F32 up to rawhide as dbus-broker-27-2. Given the other fixes and improvements, I will likely also release v28 into rawhide this week.

Closing this as it is resolved. Thanks a lot for the help!