Open nh2 opened 6 years ago
Did you have
} // (optionalAttrs (cfg.leaveOnStop) {
ExecStop = "${cfg.package.bin}/bin/consul leave";
});
Enabled? Perhaps leaving took to long and systemd bailed out?
If you can tell me whether you had this option enabled or not I can try to reproduce the issue
I actually think that that config is buggy!
It should be ExecStopPre= consul leave
ExecStop
is responsible for killing the PID which consul leave
doesn't do iirc
edit: Nope consul leave
does exit the unit gracefully
Related: https://github.com/NixOS/nixpkgs/issues/33172
You also seem to have opened that issue :). I think the Current command vanished from the unit file
part is very relevant
@arianvp No, I have no ExecStop
in systemctl cat consul
.
I have:
ExecReload=/nix/store/kl0vlvk2bd6f1hls53xkvkixlz6vp72r-consul-1.3.0-bin/bin/consul reload
ExecStart=@/nix/store/kl0vlvk2bd6f1hls53xkvkixlz6vp72r-consul-1.3.0-bin/bin/consul consul agent -config-dir /etc/consul.d -config-file /etc/consul.json -conf>
ExecStartPre=/nix/store/453nm1q1r7rzqzh5arx6z04qh1klg7f0-unit-script/bin/consul-pre-start
Also, I think that if systemd tried to stop consul at all, even via ExecStop
, it would have printed Stopping consul.service...
(which it didn't), wouldn't it?
The difference to #33172 is that there it at least says Stopping Nginx Web Server...
.
I just noticed
node-2..............> Job for consul.service canceled.
This is a string that systemctl stop
returns when it fails to stop the job (See https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=768208 for example)
Further research:
What is certain, is that whatever caused the Job canceled for consul.service
, caused the service to not stop
edit: Scrap this part below. I just realised that in the logs that failed to have Stopping consul...
consul
didn't exit with a non-zero exit code
Job for
consul.service canceled.
This is a string that
systemctl stop
returns when it fails to stop the job
I can actually reproduce this part.
If I just type systemctl stop consul
on my deployed system, I immediately get Job for consul.service canceled.
Afterwards systemctl status consul
shows the service as running.
However, journalctl
also shows that consul did get a stop command (agent: Caught signal: terminated
) and exited 1. But then it restarted (with Starting consul.service...
before it).
[root@node-3:~]# systemctl cat consul
# /nix/store/amy5ccykgajhc2q75mkbjv362z599xvr-unit-consul.service/consul.service
[Unit]
After=network.target
X-Restart-Triggers=/nix/store/0aa2x43nwmid5vwcbqxq6a1m5hpck6jh-etc-consul.json /nix/store/fiawx2232c8z11jca84b1ibd7zfjjz9d-etc-server-service.json /nix/store/5pk>
[Service]
Environment="LOCALE_ARCHIVE=/nix/store/v2hvbpzhndgjm04djqql93bmb4w69gql-glibc-locales-2.26-131/lib/locale/locale-archive"
Environment="PATH=/nix/store/2afkicxfldn57gp22l45xavbpi4sa57g-iproute2-4.15.0/bin:/nix/store/iywd02hbirf506q740z0v6zyrvsa9gcf-gnugrep-3.1/bin:/nix/store/crda0ambd6cjfgp>
Environment="TZDIR=/nix/store/k8n4sf0yfas70yfv156knrpv9xfipjwm-tzdata-2017c/share/zoneinfo"
ExecReload=/nix/store/kl0vlvk2bd6f1hls53xkvkixlz6vp72r-consul-1.3.0-bin/bin/consul reload
ExecStart=@/nix/store/kl0vlvk2bd6f1hls53xkvkixlz6vp72r-consul-1.3.0-bin/bin/consul consul agent -config-dir /etc/consul.d -config-file /etc/consul.json -config-file /et>
ExecStartPre=/nix/store/453nm1q1r7rzqzh5arx6z04qh1klg7f0-unit-script/bin/consul-pre-start
PermissionsStartOnly=true
Restart=on-failure
TimeoutStartSec=0
User=consul
From the doc regarding Restart=
:
Setting this to on-failure is the recommended choice for long-running services, in order to increase reliability by attempting automatic recovery from errors. For services that shall be able to terminate on their own choice (and avoid immediate restarting), on-abnormal is an alternative choice.
I'm not sure but maybe setting Restart=on-abnormal
in the service unit would allow you to stop the service manually? i.e. the stop job would not be canceled immediately
Related:
The NixOS consul systemd service sets TimeoutStartSec = "0";
, which is likely wrong (at least neither me nor boucman
from IRC #systemd
can guess what that's for).
I asked about that on https://github.com/NixOS/nixpkgs/commit/9ce0c1cb7113#r23138805 but never got an answer from @wkennington.
Another service that have a suspicious TimeoutStartSec=0
:
(I commented). (Update: Answer that it's still unclear. I filed https://github.com/NixOS/nixpkgs/issues/49700 for it.)
I have filed one systemd issue that seems to be at least related: https://github.com/systemd/systemd/issues/10626
I don't think the timer issue explains the full situation, because of the
2 out of 7 machines
from the issue description where consul didn't get restarted, only 1 had a systemd timer installed.
I am getting more and more convinced that the Job for ... canceled.
issue and the problem of an old version running are totally separate.
In particular, I suspect that the old-version problem has more to do with Current command vanished from the unit file, execution of the command list won't be resumed.
.
I have done the following experiment, which supports this:
I've started modifying https://github.com/NixOS/nixpkgs/blob/c9e13806267f7fd3351d52a19cc6db5fa2985ca9/nixos/modules/services/networking/consul.nix#L181
adding more and more -ui
switches and deploying with nixops in between. After the deploy is done, I run ps aux | grep consul | grep agent
to see what command line is running.
After a deploy with -ui -ui -ui -ui -ui
(5 -ui
), ps
shows only 4 -ui
:
Nov 02 17:52:52 node-1 systemd[26291]: consul.service: Executing: consul agent -config-dir /etc/consul.d -ui -ui -ui -ui -config-file /etc/consul.json -config-file /etc/consul-addrs.json
That is even though systemctl cat
shows 5 -ui
:
# systemctl cat consul.service | grep ExecStart
ExecStart=@/nix/store/kl0vlvk2bd6f1hls53xkvkixlz6vp72r-consul-1.3.0-bin/bin/consul consul agent -config-dir /etc/consul.d -ui -ui -ui -ui -ui -config-file /etc/consul.json -config-file /etc/consul-addrs.json
I had systemd-analyze log-level debug
on, and in journalctl -f
could see that indeed systemd did start consul with the command line showing only 4 -ui
:
Nov 02 17:52:51 node-1 nixos[26199]: switching to system configuration /nix/store/z0c30mny9rmsywvisf318wafqxqcb8ck-nixos-system-node-1-18.03pre-git
Nov 02 17:52:51 node-1 systemd[1]: Stopping consul.service...
Nov 02 17:52:51 node-1 consul[14968]: 2018/11/02 17:52:51 [INFO] agent: Caught signal: terminated
Nov 02 17:52:52 node-1 consul[14968]: 2018/11/02 17:52:52 [INFO] agent: Exit code: 1
Nov 02 17:52:52 node-1 systemd[1]: consul.service: Main process exited, code=exited, status=1/FAILURE
Nov 02 17:52:52 node-1 systemd[1]: consul.service: Failed with result 'exit-code'.
Nov 02 17:52:52 node-1 systemd[1]: Starting consul.service...
Nov 02 17:52:52 node-1 systemd[26286]: cephfs-mounted.service: Executing: /nix/store/xwb1wk1vxjvq4vwxfbphggygfqcb15si-unit-script/bin/cephfs-mounted-pre-start
Nov 02 17:52:52 node-1 systemd[26287]: consul.service: Executing: /nix/store/fqk7rpkdf9741blndfkrlxywdyl477j8-unit-script/bin/consul-pre-start
Nov 02 17:52:52 node-1 systemd[1]: Started consul.service.
Nov 02 17:52:52 node-1 systemd[26291]: consul.service: Executing: consul agent -config-dir /etc/consul.d -ui -ui -ui -ui -config-file /etc/consul.json -config-file /etc/consul-addrs.json
Nov 02 17:52:52 node-1 consul[26291]: ==> Consul agent running!
Nov 02 17:52:52 node-1 systemd[1]: Reloading.
Nov 02 17:52:52 node-1 systemd[1]: consul.service: Current command vanished from the unit file, execution of the command list won't be resumed.
Nov 02 17:52:52 node-1 nixos[26199]: finished switching to system configuration /nix/store/z0c30mny9rmsywvisf318wafqxqcb8ck-nixos-system-node-1-18.03pre-git
And in this case, no Job for consul.service canceled
was printed in nixops.
Trying this experiment more times, I also observe deploys where Job for consul.service canceled
is printed, but consul starts with the correct number of -ui
s nevertheless.
I wrote a script to ease debugging this automatically, but encountered another systemd issue that stopped my script from being effective, because the systemd timer randomly stopped firing at all:
https://github.com/systemd/systemd/issues/6680#issuecomment-435597174
After adding the Requires=
from timer to service (as described in the systemd link), with my script I can now reproduce the issue of the old ExecStart
line running still after deploy within < 10 deploys on average (so in around < 2 minutes).
My script does nothing special, just adding more -ui
s as described above and checking whether they are reflected in ps
; and that in a loop, printing after how many nixops deploys the change didn't get reflected.
OK, so some more info:
I have an rsync job with a systemd timer; the rsync job service Requires = consul.service
.
Involved are: consul.service
, cdn-rsync.service
, cdn-rsync.timer
.
See https://gist.github.com/nh2/be476d7ce3466ea5b92dfbc39b235770 for their contents.
It seems like at least for one of the "2 out of 7 machines" (the other one doens't run a timer, so there it's totally unclear to me so far), the deploy
not getting the new consul started correctly is dependent on what those timers do.
@arianvp and me know from observing systemd (https://github.com/systemd/systemd/issues/10626) that a timer with OnActiveSec=
or OnUnitActiveSec=
, will, contrary to expectations of people on the #systemd
IRC channel, immediately restart the service when it goes down.
I suspect that what happens is this:
switch-configuration
does systemctl stop consul.service
cdn-rsync.service
; the timer immediately restarts it, and this immediately brings up consul
systemctl daemon-reload
and subsequent systemctl start consul
that switch-configuration
doesI haven't been able to reproduce the probelm if my timer uses OnUnitInactiveSec=
instead, e.g. to 3s
. This is probably because in contrast to the other two options above, this one will actually wait the given amount of time after cdn-rsync.service
gets pulled down by the stop of consul, and waiting 3 seconds is enough for nixops to finish the deploy in that case.
Still racy, but an apparent workaround in my situation.
It is still unclear whether this also explains the same issue on the other machine (likely not), and if #33172 has some similar situation.
Yea, it really seems to be a race condition in there.
If I set OnUnitInactiveSec = "500ms"
, I can reproduce the issue super quickly, with OnUnitInactiveSec = "3s"
not at all.
This doesn't seem like a great solution though, magic sleeps will start breaking when other participants in the race are slower some day.
Here's my quick script btw:
#!/usr/bin/env python3
import argparse
import sys
import os
import subprocess
import time
parser = argparse.ArgumentParser(description='Process some integers.')
parser.add_argument('-d', '--deployment', type=str, help='deployment name')
parser.add_argument('--machine-name', type=str, help='nixops machine name to --include')
parser.add_argument('--ssh-host', type=str, help='SSH connection string to connect to for running ps')
args = parser.parse_args()
deployment = args.deployment
machine_name = args.machine_name
ssh_host = args.ssh_host
consul_nix = "../../nix-channel/nixpkgs/nixos/modules/services/networking/consul.nix"
line_no = 182 - 1
def update_consul_file():
with open(consul_nix) as f:
consul_nix_lines = f.read().split('\n')
line = consul_nix_lines[line_no]
consul_nix_lines[line_no] = line[:len(line)-1] + ' -ui"'
# print(consul_nix_lines[line_no])
with open(consul_nix, 'w') as f:
f.writelines('\n'.join(consul_nix_lines))
def deploy():
subprocess.check_call(['./ops', 'deploy', '-d', deployment, '--include', machine_name])
def count_ui():
ps_out = subprocess.check_output(['ssh', ssh_host, 'ps aux | grep "[c]onsul agent"'], universal_newlines=True)
count = len([word for word in ps_out.split(' ') if word == '-ui'])
return count
update_consul_file()
deploy()
initial_count = count = count_ui()
while True:
update_consul_file()
deploy()
new_count = count_ui()
print(new_count)
if new_count == count:
print("-ui count failed to change!")
print("initial_count:", initial_count)
print("new_count:", new_count)
break
count = new_count
I use it like ./consul-debug.py --deployment mydeployment --machine-name node-1 --ssh-host root@node-1.example.com
@nh2 so this was an issue with timers or a problem on consul module?
This is a problem with timers giving a conflicting job queue on the systemd-side of things every once in a while. We opened an upstream issue for this. https://github.com/systemd/systemd/issues/10626
I'm not sure if there's anything for us to fix here on the NixOS side of things. I'd be OK with closing.
Given that NixOS forces Consul to be restarted (even if it is server), we can adapt consul module to prevent such. Set skip_leave_on_interrupt = false
, remove ExecStop (so stopping/leaving will be done using SIGINT) and set systemd.services.consul.stopIfChanged = false;
(so stop&start are combined into single restart). I assume that single restart
won't have such side effects.
But yeah, this is very service related and looks quite like a hack.
@arianvp, @danbst, I agree with you that there's nothing really generically wrong on the NixOS side of things. Feel free to open a PR to change some behaviour on the consul module if appropriate.
Given that NixOS forces Consul to be restarted (even if it is server), we can adapt consul module to prevent such.
@danbst I don't quite understand. My original issue was that NixOS does NOT restart Consul when I expect it, making rolling upgrades unreliable.
The systemd issue https://github.com/systemd/systemd/issues/10626 where I commented by now indicates that even Facebook is hitting this problem in systemd btw.
@nh2 you are right :man_facepalming:. What would be best solution, what you think?
@danbst I am not sure, but until anybody from systemd has commented there, I don't think it's clear that this is not a NixOS issue, and we should probably reopen.
We ran into the same problem at Awake Security. I found the root cause here and I have a minimal reproduction.
First off, here is the reproducing NixOS configuration (for EC2, which you can tweak to your desired physical configuration):
let
# To reproduce the bug:
#
# * Deploy this configuration "as is"
# * Then increment "FOO" to "1"
# * Then redeploy the configuration
#
# If the reproduction succeeds the `restart-child` service should still be
# erroneously displaying `0` instead of `1`
FOO = "0";
in
{ imports = [ <nixpkgs/nixos/modules/virtualisation/amazon-image.nix> ];
ec2.hvm = true;
# This is the service that the `switch-to-configuration` script will fail to
# update correctly. This service prints the value of the `FOO` environment
# variable and will always lag behind by one deploy due to being prematurely
# restarted in the middle of a `switch-to-configuration` step.
systemd.services.restart-child = {
enable = true;
wantedBy = [ "restart-parent.service" ];
environment = { inherit FOO; };
script = ''
while true; do echo "$FOO"; sleep 1; done
'';
};
# This service is responsible for restarting `restart-child` every second,
# since the bug is triggered by prematurely restarting a service that is
# being upgraded by NixOS.
systemd.services.restart-parent = {
enable = true;
wantedBy = [ "multi-user.target" ];
script = "\n";
};
systemd.timers.restart-parent = {
wantedBy = [ "timers.target" ];
timerConfig = {
Type = "oneshot";
OnUnitInactiveSec = "1s";
AccuracySec = "1s";
};
};
# This service adds a delay to the `systemctl stop …` step of the
# `switch-to-configuration` script so that the `restart-parent` service has
# enough time to restart the `restart-child` service.
systemd.services.delay-stop = {
enable = true;
wantedBy = [ "multi-user.target" ];
environment = { inherit FOO; };
script = ''
while true; do sleep 1; done
'';
# The purpose of this `preStop` delay is to give the `restart-parent`
# service a chance to restart the `restart-child` service before
# the `switch-to-configuration` script can run `systemctl daemon-reload`
preStop = ''
sleep 10
'';
};
}
Here's the explanation of what triggers the bug:
switch-to-configuration switch
script (either directly or indirectly via nixos-rebuild switch
), which normally performs the following actions:
systemctl stop
all changed servicessystemctl daemon-reload
to load the service changessystemctl start
all changed servicessystemctl stop
something else on the machine prematurely restarts one of the services being stopped
systemctl daemon-reload
does nothing to interrupt this stale running service
systemd
behaving as expectedsystemctl start
does nothing to replace the old running service (because it's already running, albeit stale)
systemd
behaving as expectedThis leads me to conclude that this is a NixOS bug and not a systemd
bug. Specifically, I believe the most appropriate fix is to change the switch-to-configuration
script to replace systemctl start
with systemctl restart
.
@Gabriel439 Great investigation!
Restarting makes sense to me.
But would we not also have to do stop
after daemon-reload
?
My understanding is that stop
is done to get rid of services that were deleted from the configuration. In the current racy setting, it can happen that you remove a.service
, stop
happens, something restarts a.service
, and then daemon-reload
and start
happen, so the old service is still running. Even when changing the last start
to restart
, it would still be running.
So it seems that the "real" stop can only be after daemon-reload
.
@nh2: Yeah, I like the idea of just moving the stop
after the daemon-reload
instead
systemctl restart
has the problem that it is less correct and can lead to subtle bugs. Hence we don't default to it. See
systemd.services.<name>.stopIfChanged
If set, a changed unit is restarted by calling systemctl stop in the old
configuration, then systemctl start in the new one. Otherwise, it is restarted in a
single step using systemctl restart in the new configuration. The latter is less
correct because it runs the ExecStop commands from the new configuration.
Moving the stop to after the daemon-reload
has similar problems. and would be a major breaking change IMO
@arianvp: Why would moving stop
after daemon-reload
have the same problem?
After the daemon-reload
the unit of the new generation is loaded, hence you execute ExecStop
of the new version of the unit instead of the old one; which is wrong.
The current order is:
stopIfChanged = false
; intead of stop)If you now move the stop after the daemon-reload, the wrong ExecStop
is executed. To implement the behaviour of ExecStop
in old generation and ExecStart
in the new new, we must do stop
before daemon-reload
If you move it after the daemon-reload, there is no difference between stop & start
and restart
, except for transactionality. If you want to "stop and start" the service after daemon-reload
, you can do that today by setting stopIfChanged = false
on your systemd unit, and that's probably the fix you're looking for instead.
Personally I wouldn't mind deprecating (or better documenting) this functionality, as its semantics are a bit confusing (https://github.com/NixOS/nixpkgs/issues/49528) and I never found a service where restart
doesn't just simply work.
Edit:
I think a good example where things break is scipted networking. Where it's important that the ExecStop
of the previous generation is executed before starting the ExecStart
of the new generation
@nh2 @Gabriel439 would setting stopIfChanged = false;
fix this issue for you?
Maybe we close this by documenting this better? "If you want transactional stops & starts, please set stopIfChanged = false" .
"If you want to sacrifice transactionality for usecase where ExecStop is in the old generation and ExecStart is in the new, please leave it to the default true
"
Maybe also defaulting to false
is just the better option here... But I'm not sure.
@arianvp: Individually setting stopIfChanged = false
would not fix it because there are too many services we'd have to mark in this way
Something that would fix things for us would be to add a custom X-RestartOnReconfiguration
option or something similar. That way we could ensure that any reverse dependency which triggers restarts of other services could always be stopped and started alongside the upgraded services to avoid triggering a premature restart.
systemctl start
withsystemctl restart
.
~And another problem: I found that restart
does not start a service if it is already stopped (found as part of https://github.com/NixOS/nixpkgs/issues/72394#issuecomment-611499518, where mdadm was in state failed (Result: exit-code)
, and redeploying with nixops did not restart it). Only start
seems to do that.~
~Does systemd have some restart-or-start-if-not-already-started command?~
Edit: This was probably wrong, I just didn't notice that my service restarted and failed. Would still be nice for somebody in the knows to confirm how it will behave.
@nh2 the restart behaviour comes from nixos/modules/system/activation/switch-to-configuration.pl
. Unfortunately, it's not really well documented, contains some bugs (https://github.com/NixOS/nixpkgs/issues/84633#issuecomment-610927139, https://github.com/NixOS/nixpkgs/issues/73853, https://github.com/NixOS/nixpkgs/pull/73871) and more.
Something that would fix things for us would be to add a custom X-RestartOnReconfiguration option or something similar. That way we could ensure that any reverse dependency which triggers restarts of other services could always be stopped and started alongside the upgraded services to avoid triggering a premature restart.
I don't fully understand what you are saying here. systemd services restart in isolation unless PartOf=
is set. That is the restart of one service should never trigger the restart of another
What you're suggesting is also exactly how restartIfChanged = true && stopIfChanged = false
is implemented. theyr'e a property on the unit, that gets read during activation of the new configuration, and then decides based on that whether we should do systemctl stop x.service && systemctl start x.service
which is prone to these race conditions reported here or does an atomic systemctl restart x.service
if (!boolIsTrue($unitInfo->{'X-StopIfChanged'} // "yes")) {
# This unit should be restarted instead of
# stopped and started.
$unitsToRestart{$unit} = 1;
recordUnit($restartListFile, $unit);
I commented more in detail on how to possibly adress this in https://github.com/NixOS/nixpkgs/issues/49528#issuecomment-622196884.
The issue mentioned here is an instance of the "bug" mentioned there (only reloading, even though a restart would be necessary).
Let's move the discussion over there, closing here.
I keep encountering this; posting a concrete situation again, this time with the nebula
VPN service (and a timer that starts a service that requires
nebula@servers.service
):
updating GRUB 2 menu...
stopping the following units: my-server.service, nebula@servers.service
Job for nebula@servers.service canceled.
activating the configuration...
warning: user ‘postgres’ has unknown group ‘postgres’
setting up /etc...
reloading user units for root...
setting up tmpfiles
reloading the following units: dbus.service
starting the following units: my-server.service, nebula@servers.service
activation finished successfully
Here we see again: Job for nebula@servers.service canceled.
@arianvp
systemctl restart
has the problem that it is less correct and can lead to subtle bugs. Hence we don't default to it.The current order is:
1. stop 2. activate configuration 3. (restart if opted in with `stopIfChanged = false` ; intead of stop) 4. daemon-reload 5. start
Cannot we just replace 5. start
by restart
?
The service would still be stop
in the earlier step.
But restart
would make sure that if a timer triggered a restart duing 1. stop
,
I believe that's what @Gabriella439 meant in https://github.com/NixOS/nixpkgs/issues/49415#issuecomment-601936292: To replace 1. stop + 5. start
by 1. stop + 5. restart
, not to replace the whole thing by a single restart
.
So, concretely, I propose:
- print STDERR "starting the following units: ", join(", ", @units_to_start_filtered), "\n"
+ print STDERR "starting (via restart, see #49415) the following units: ", join(", ", @units_to_start_filtered), "\n"
}
-system("$new_systemd/bin/systemctl", "start", "--", sort(keys(%units_to_start))) == 0 or $res = 4;
+system("$new_systemd/bin/systemctl", "restart", "--", sort(keys(%units_to_start))) == 0 or $res = 4;
Edit:
This doesn't work trivially, it makes the system hang because it restarts the network. Why? Because in this commit the logic was introduced that what's started isn't actually what's printed:
print STDERR "starting the following units: ", join(", ", @units_to_start_filtered), "\n"
^
_filtered here
system("$new_systemd/bin/systemctl", "start", "--", sort(keys(%units_to_start))) == 0 or $res = 4;
^
NO _filtered here
I am reopening this issue to
The issue mentioned here is an instance of the "bug" mentioned [in #49528] (only reloading, even though a restart would be necessary).
@flokli That isn't quite correct, at least not as written.
In here, reloading
isn't involved at all. This issue tracks the fact that even start + stop
does not work, because the start
can be replaced by a too-early restart
in the presence of timers.
I am re-opening this issue because I believe we can fix this issue -- stop
not happening at all -- more easily than #49528 ("Interactions of restartIfChanged
, reloadIfChanged
, stopIfChanged
").
I have created a PR that implements stop + restart
(instead of stop + start
) of the stopped units:
Please help test!
The original title of this issue was
It was renamed after the root cause was identified to be systemd timers in https://github.com/NixOS/nixpkgs/issues/49415#issuecomment-601936292 and after.
Issue description
I just deployed the upgrade from Consul 0.9.3 -> 1.3.0 (#49165) to my 18.03 server cluster (in my case, I upgraded from 1.0.6).
The machines run identical configurations deployed by
nixops
, and deploying should have stopped all old consul processes and launched the new version.Yet I found that on 2 out of 7 machines, the old consul version was still running.
On a node where the deployment worked, the
journalctl
looks like this:Note the presence of
Stopping consul.service...
.On a node where it didn't work, it looks like this:
Note the absence of
Stopping consul.service...
This is despite
nixops
having the following output:Observe how:
stopping the following units: ..., consul.service
was printedStopping consul.service
didn't occur.Possibly related is the systemd warning
consul.service: Current command vanished from the unit file
.