Closed lucab closed 3 years ago
I did some more investigation, because the bus-log seems really out of order. But reading through the systemd sources, it seems this is rather obscure:
When you send StartUnit
to systemd, it will queue a local job to start a unit, then reply with success to StartUnit
. However, there is no guarantee this local job was at all scheduled, yet. So even after you got the reply to StartUnit
a query for the Unit-state might still return failed
(if it already was in failed
state). It even sends out PropertiesChanged
with ActiveState=failed
. Moreover, even if the unit was forked and the properties return the PID of the unit, the state is still failed
. Only after some time, the state will change to activating
(don't ask me when...). :disappointed:
Systemd does return a job-ID as reply to StartUnit
. The problem is, the ID is only known when we receive the reply, so we would install a possible dbus-match too late and miss any signals sent in-between. The alternative would be to listen to ALL job-notifications of systemd, which is just ridiculous. Hence, the Job API is not useful to track a unit-operation. :disappointed:
I don't know what to do. sigh
The reason we want this activation-tracking is to make sure our TransientUnit handling matches how the old-style direct activation in dbus-daemon works (which tracks exit-code failures of the processes it executes).
I now verified and the classic systemd-activation in dbus-daemon actually never tracks service failures. It simply verifies the job was queued successfully (see ActivationRequest
signal handling in systemd). So we could technically just enable this tracking for transient units. However, it would be a missed opportunity.
I currently see no other way than watching JobRemoved
on systemd for all jobs and then checking whether a queued job failed. This should be safe to do, since it is also what systemd does internally (though it is quite sad that we need such broad matches for it).
@lucab I just pushed dbus-broker-27-2
into rawhide, which includes 2 recent commits I just pushed to main
. After discussion with Lennart, I decided to add JobRemoved
tracking. A bit unfortunate, but we can reduce the match to a short timeframe just around the StartUnit
call.
Anyway, can you give this a try? My local tests now all succeed.
@dvdhrm I tested 27-2.fc35
manually and I'm not seeing the issue anymore!
Not sure if the plan is to have a v28 tagged for this, but in either cases it would be good to propagate back this fix downstream to F33 as the update there introduced the regression.
This is now in F32 up to rawhide as dbus-broker-27-2
. Given the other fixes and improvements, I will likely also release v28 into rawhide this week.
Closing this as it is resolved. Thanks a lot for the help!
There is a regression in service activation that we started observing in
rpm-ostree
CI after a dbus-broker update. The last known-working version isv24
, while bothv26
andv27
show the same regression symptoms (though maybe different root causes):At least on v27 it doesn't seem to be a race (can be manually reproduce, not improved by adding artificial
sleep
), but it looks likesystemctl reset-failed
has some effect on it:Cross-references: