nh2 commented 6 years ago

The original title of this issue was

systemd services are not restarted reliably when package is upgraded

It was renamed after the root cause was identified to be systemd timers in https://github.com/NixOS/nixpkgs/issues/49415#issuecomment-601936292 and after.

Issue description

I just deployed the upgrade from Consul 0.9.3 -> 1.3.0 (#49165) to my 18.03 server cluster (in my case, I upgraded from 1.0.6).

The machines run identical configurations deployed by nixops, and deploying should have stopped all old consul processes and launched the new version.

Yet I found that on 2 out of 7 machines, the old consul version was still running.

On a node where the deployment worked, the journalctl looks like this:

Oct 29 14:44:42 node-1 nixos[22687]: switching to system configuration /nix/store/fxfk4mfi5b90h1vl0hj99gy57zzpl9hy-nixos-system-node-1-18.03pre-git
Oct 29 14:44:42 node-1 systemd[1]: Stopping rsync CDN files to CDN nodes...
Oct 29 14:44:42 node-1 systemd[1]: Stopping ceph-mount.service...
Oct 29 14:44:42 node-1 systemd[1]: Stopping cephfsRunningCheckWatchdog.service...
Oct 29 14:44:42 node-1 systemd[1]: Stopping Nginx Web Server...
Oct 29 14:44:42 node-1 systemd[1]: Stopping consul-alerts.service...
Oct 29 14:44:42 node-1 systemd[1]: Stopped target Local File Systems.
Oct 29 14:44:42 node-1 systemd[1]: Stopped target All Network Interfaces (deprecated).
Oct 29 14:44:42 node-1 systemd[1]: Stopped target Remote File Systems.
Oct 29 14:44:42 node-1 systemd[1]: Stopped server.service.
Oct 29 14:44:42 node-1 systemd[1]: Stopped consul-alerts.service.
Oct 29 14:44:42 node-1 systemd[1]: Stopping consul.service...
Oct 29 14:44:42 node-1 consul[22232]:     2018/10/29 14:44:42 [INFO] Caught signal:  terminated
Oct 29 14:44:42 node-1 systemd[1]: Stopped cephfsRunningCheckWatchdog.service.
Oct 29 14:44:42 node-1 systemd[1]: Stopped Nginx Web Server.
Oct 29 14:44:43 node-1 systemd[1]: Stopped rsync CDN files to CDN nodes.
Oct 29 14:44:43 node-1 systemd[1]: Reloading.
Oct 29 14:44:43 node-1 systemd[1]: consul.service: Current command vanished from the unit file, execution of the command list won't be resumed.
Oct 29 14:44:43 node-1 systemd[1]: Started Create or wait for ceph..
Oct 29 14:44:43 node-1 systemd[1]: Starting ceph-mount.service...
Oct 29 14:44:43 node-1 systemd[1]: Reloading D-Bus System Message Bus.
Oct 29 14:44:43 node-1 systemd[1]: Reloaded D-Bus System Message Bus.
Oct 29 14:44:43 node-1 systemd[1]: Started ceph-mount.service.
Oct 29 14:44:43 node-1 systemd[1]: Starting ceph-mounted.service...
Oct 29 14:44:43 node-1 consul[22232]:     2018/10/29 14:44:43 [INFO] Exit code: 1
Oct 29 14:44:43 node-1 systemd[1]: consul.service: Main process exited, code=exited, status=1/FAILURE
Oct 29 14:44:43 node-1 systemd[1]: consul.service: Failed with result 'exit-code'.
Oct 29 14:44:43 node-1 systemd[1]: Starting consul.service...
Oct 29 14:44:43 node-1 systemd[1]: Started consul.service.
Oct 29 14:44:43 node-1 systemd[1]: Starting consulReady.service...
Oct 29 14:44:43 node-1 systemd[1]: Started ceph-mounted.service.
Oct 29 14:44:43 node-1 systemd[1]: Starting Nginx Web Server...
Oct 29 14:44:43 node-1 systemd[1]: Started Nginx Web Server.
Oct 29 14:44:43 node-1 systemd[1]: Starting ceph-mounted.service...
Oct 29 14:44:43 node-1 systemd[1]: Starting camPostgresRole.service...
Oct 29 14:44:43 node-1 systemd[1]: Reached target Local File Systems.
Oct 29 14:44:43 node-1 systemd[1]: Reached target All Network Interfaces (deprecated).
Oct 29 14:44:43 node-1 systemd[1]: Reached target Remote File Systems.
Oct 29 14:44:43 node-1 systemd[1]: Started ceph-mounted.service.
Oct 29 14:44:43 node-1 systemd[1]: Started camPostgresRole.service.
Oct 29 14:44:43 node-1 systemd[1]: Starting server.service...
Oct 29 14:44:44 node-1 systemd[1]: Started consulReady.service.
Oct 29 14:44:44 node-1 systemd[1]: Started rsync CDN files to CDN nodes.
Oct 29 14:44:44 node-1 systemd[1]: Starting consul-alerts.service...
Oct 29 14:44:44 node-1 systemd[1]: Started server.service.
Oct 29 14:44:44 node-1 systemd[1]: Started consul-alerts.service.
Oct 29 14:44:44 node-1 nixos[22687]: finished switching to system configuration /nix/store/fxfk4mfi5b90h1vl0hj99gy57zzpl9hy-nixos-system-node-1-18.03pre-git

Note the presence of Stopping consul.service....

On a node where it didn't work, it looks like this:

Oct 29 13:52:38 node-2 nixos[27726]: switching to system configuration /nix/store/1x475px1qyk9f9gcdk0pvpbfhip0v1rs-nixos-system-node-2-18.03pre-git
Oct 29 13:52:38 node-2 systemd[1]: Stopping rsync CDN files to CDN nodes...
Oct 29 13:52:38 node-2 systemd[1]: Stopped target Multi-User System.
Oct 29 13:52:38 node-2 systemd[1]: Stopping server.service...
Oct 29 13:52:38 node-2 systemd[1]: Stopping consul-alerts.service...
Oct 29 13:52:38 node-2 systemd[1]: Stopped target Local File Systems.
Oct 29 13:52:38 node-2 systemd[1]: Stopped target All Network Interfaces (deprecated).
Oct 29 13:52:38 node-2 systemd[1]: Stopped target Remote File Systems.
Oct 29 13:52:38 node-2 systemd[1]: Stopped server.service.
Oct 29 13:52:38 node-2 systemd[1]: Stopped rsync CDN files to CDN nodes.
Oct 29 13:52:38 node-2 systemd[1]: Starting ceph-mounted.service...
Oct 29 13:52:38 node-2 systemd[1]: Started ceph-mounted.service.
Oct 29 13:52:51 node-2 systemd[1]: Stopped consul-alerts.service.
Oct 29 13:52:51 node-2 systemd[1]: Starting consulReady.service...
Oct 29 13:52:51 node-2 systemd[1]: Started consulReady.service.
Oct 29 13:52:51 node-2 systemd[1]: Started rsync CDN files to CDN nodes.
Oct 29 13:52:51 node-2 systemd[1]: Reloading.
Oct 29 13:52:51 node-2 systemd[1]: consul.service: Current command vanished from the unit file, execution of the command list won't be resumed.
Oct 29 13:52:51 node-2 systemd[1]: cdn-rsync.service: Current command vanished from the unit file, execution of the command list won't be resumed.
Oct 29 13:52:51 node-2 systemd[1]: nginx.service: Current command vanished from the unit file, execution of the command list won't be resumed.
Oct 29 13:52:51 node-2 systemd[1]: Reloading D-Bus System Message Bus.
Oct 29 13:52:51 node-2 systemd[1]: Reloaded D-Bus System Message Bus.
Oct 29 13:52:51 node-2 systemd[1]: Stopping Nginx Web Server...
Oct 29 13:52:51 node-2 systemd[1]: Stopped Nginx Web Server.
Oct 29 13:52:51 node-2 systemd[1]: Starting ceph-mounted.service...
Oct 29 13:52:51 node-2 systemd[1]: Started ceph-mounted.service.
Oct 29 13:52:51 node-2 systemd[1]: Starting Nginx Web Server...
Oct 29 13:52:51 node-2 systemd[1]: Started Nginx Web Server.
Oct 29 13:52:51 node-2 systemd[1]: Starting ceph-mounted.service...
Oct 29 13:52:51 node-2 systemd[1]: Starting consulReady.service...
Oct 29 13:52:51 node-2 systemd[1]: Starting camPostgresRole.service...
Oct 29 13:52:51 node-2 systemd[1]: Reached target Local File Systems.
Oct 29 13:52:51 node-2 systemd[1]: Reached target Remote File Systems.
Oct 29 13:52:51 node-2 systemd[1]: Starting stolonCreatePostgresDataDir.service...
Oct 29 13:52:51 node-2 systemd[1]: Reached target All Network Interfaces (deprecated).
Oct 29 13:52:51 node-2 systemd[1]: Started ceph-mounted.service.
Oct 29 13:52:51 node-2 systemd[1]: Started stolonCreatePostgresDataDir.service.
Oct 29 13:52:51 node-2 systemd[1]: Started camPostgresRole.service.
Oct 29 13:52:51 node-2 systemd[1]: Starting server.service...
Oct 29 13:52:51 node-2 systemd[1]: Started consulReady.service.
Oct 29 13:52:51 node-2 systemd[1]: Starting stolonClusterDataReady.service...
Oct 29 13:52:51 node-2 systemd[1]: Starting consul-alerts.service...
Oct 29 13:52:51 node-2 systemd[1]: Started stolonClusterDataReady.service.
Oct 29 13:52:51 node-2 systemd[1]: Starting stolonConfigChange.service...
Oct 29 13:52:51 node-2 systemd[1]: Started stolonConfigChange.service.
Oct 29 13:52:52 node-2 systemd[1]: Started consul-alerts.service.
Oct 29 13:52:52 node-2 systemd[1]: Started server.service.
Oct 29 13:52:52 node-2 systemd[1]: Reached target Multi-User System.
Oct 29 13:52:52 node-2 nixos[27726]: finished switching to system configuration /nix/store/1x475px1qyk9f9gcdk0pvpbfhip0v1rs-nixos-system-node-2-18.03pre-git

Note the absence of Stopping consul.service...

This is despite nixops having the following output:

ceph-deployment> closures copied successfully
node-2..............> uploading key ‘secrets’...
node-2..............> uploading key ‘nginxSSLprivkey’...
node-2..............> uploading key ‘cdn-rsync-ssh-privkey’...
node-2..............> updating GRUB 2 menu...
node-2..............> stopping the following units: cdn-rsync.service, server.service, consul-alerts.service, consul.service
node-2..............> Job for consul.service canceled.
node-2..............> activating the configuration...
node-2..............> warning: user ‘postgres’ has unknown group ‘postgres’
node-2..............> setting up /etc...
node-2..............> setting up tmpfiles
node-2..............> reloading the following units: dbus.service
node-2..............> restarting the following units: nginx.service
node-2..............> starting the following units: cdn-rsync.service, server.service, consul-alerts.service, consul.service
node-2..............> activation finished successfully
ceph-deployment> deployment finished successfully

Observe how:

stopping the following units: ..., consul.service was printed
but on the machine, Stopping consul.service didn't occur.

Possibly related is the systemd warning consul.service: Current command vanished from the unit file.

arianvp commented 6 years ago

Did you have

 } // (optionalAttrs (cfg.leaveOnStop) {
          ExecStop = "${cfg.package.bin}/bin/consul leave";
        });

Enabled? Perhaps leaving took to long and systemd bailed out?

If you can tell me whether you had this option enabled or not I can try to reproduce the issue

arianvp commented 6 years ago

~~I actually think that that config is buggy!~~

It should be ExecStopPre= consul leave

ExecStop is responsible for killing the PID which consul leave doesn't do iirc

edit: Nope consul leave does exit the unit gracefully

arianvp commented 6 years ago

You also seem to have opened that issue :). I think the Current command vanished from the unit file part is very relevant

nh2 commented 6 years ago

@arianvp No, I have no ExecStop in systemctl cat consul.

I have:

ExecReload=/nix/store/kl0vlvk2bd6f1hls53xkvkixlz6vp72r-consul-1.3.0-bin/bin/consul reload
ExecStart=@/nix/store/kl0vlvk2bd6f1hls53xkvkixlz6vp72r-consul-1.3.0-bin/bin/consul consul agent -config-dir /etc/consul.d -config-file /etc/consul.json -conf>
ExecStartPre=/nix/store/453nm1q1r7rzqzh5arx6z04qh1klg7f0-unit-script/bin/consul-pre-start

Also, I think that if systemd tried to stop consul at all, even via ExecStop, it would have printed Stopping consul.service... (which it didn't), wouldn't it?

nh2 commented 6 years ago

The difference to #33172 is that there it at least says Stopping Nginx Web Server....

arianvp commented 6 years ago

I just noticed

node-2..............> Job for consul.service canceled.

This is a string that systemctl stop returns when it fails to stop the job (See https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=768208 for example)

arianvp commented 6 years ago

Further research:

What is certain, is that whatever caused the Job canceled for consul.service, caused the service to not stop

edit: Scrap this part below. I just realised that in the logs that failed to have Stopping consul... consul didn't exit with a non-zero exit code

`SIGTERM` causes a `consul server` to exit with exit code `1`, which (I think?) could trigger the `Restart` , and thus a race condition occurs between `systemctl stop` and `Restart`, causing the stop job to be canceled. Though I am a bit skeptical about this hypothesis, since you don't have `Stopping consul.service ...` in your logs. ``` $ nix run nixpkgs.consul $ consul agent -data-dir /tmp/consul -server=true & $ pkill -TERM consul 2018/10/31 16:54:40 [INFO] Exit code: 1 ``` Whilst non-servers gracefully exit on `SIGTERM`: ``` $ consul agent -data-dir /tmp/consul -server=false & $ pkill -TERM consul 2018/10/31 16:54:40 [INFO] Exit code: 0 ``` I think the fix here is to make the the systemd unit send `SIGINT` instead of `SIGTERM`, as advised here: https://www.consul.io/docs/agent/basics.html#stopping-an-agent This should fix the race condition occuring. I haven't been able to trigger the race condition though, so I might be wrong. on my local machine `systemctl stop consul` works fine ,even in `-server` mode.

nh2 commented 6 years ago

Job for consul.service canceled.

This is a string that systemctl stop returns when it fails to stop the job

I can actually reproduce this part.

If I just type systemctl stop consul on my deployed system, I immediately get Job for consul.service canceled.

Afterwards systemctl status consul shows the service as running.

However, journalctl also shows that consul did get a stop command (agent: Caught signal: terminated) and exited 1. But then it restarted (with Starting consul.service... before it).

nh2 commented 6 years ago

[root@node-3:~]# systemctl cat consul
# /nix/store/amy5ccykgajhc2q75mkbjv362z599xvr-unit-consul.service/consul.service
[Unit]
After=network.target
X-Restart-Triggers=/nix/store/0aa2x43nwmid5vwcbqxq6a1m5hpck6jh-etc-consul.json /nix/store/fiawx2232c8z11jca84b1ibd7zfjjz9d-etc-server-service.json /nix/store/5pk>

[Service]
Environment="LOCALE_ARCHIVE=/nix/store/v2hvbpzhndgjm04djqql93bmb4w69gql-glibc-locales-2.26-131/lib/locale/locale-archive"
Environment="PATH=/nix/store/2afkicxfldn57gp22l45xavbpi4sa57g-iproute2-4.15.0/bin:/nix/store/iywd02hbirf506q740z0v6zyrvsa9gcf-gnugrep-3.1/bin:/nix/store/crda0ambd6cjfgp>
Environment="TZDIR=/nix/store/k8n4sf0yfas70yfv156knrpv9xfipjwm-tzdata-2017c/share/zoneinfo"

ExecReload=/nix/store/kl0vlvk2bd6f1hls53xkvkixlz6vp72r-consul-1.3.0-bin/bin/consul reload
ExecStart=@/nix/store/kl0vlvk2bd6f1hls53xkvkixlz6vp72r-consul-1.3.0-bin/bin/consul consul agent -config-dir /etc/consul.d -config-file /etc/consul.json -config-file /et>
ExecStartPre=/nix/store/453nm1q1r7rzqzh5arx6z04qh1klg7f0-unit-script/bin/consul-pre-start
PermissionsStartOnly=true
Restart=on-failure
TimeoutStartSec=0
User=consul

c0bw3b commented 6 years ago

From the doc regarding Restart= :

Setting this to on-failure is the recommended choice for long-running services, in order to increase reliability by attempting automatic recovery from errors. For services that shall be able to terminate on their own choice (and avoid immediate restarting), on-abnormal is an alternative choice.

I'm not sure but maybe setting Restart=on-abnormal in the service unit would allow you to stop the service manually? i.e. the stop job would not be canceled immediately

nh2 commented 6 years ago

I asked about that on https://github.com/NixOS/nixpkgs/commit/9ce0c1cb7113#r23138805 but never got an answer from @wkennington.

nh2 commented 6 years ago

Another service that have a suspicious TimeoutStartSec=0:

https://github.com/NixOS/nixpkgs/blob/89ede978a0614fd7215e0aa2c483df142cc79af9/nixos/modules/services/monitoring/osquery.nix#L81

(I commented). (Update: Answer that it's still unclear. I filed https://github.com/NixOS/nixpkgs/issues/49700 for it.)

nh2 commented 6 years ago

I have filed one systemd issue that seems to be at least related: https://github.com/systemd/systemd/issues/10626

I don't think the timer issue explains the full situation, because of the

2 out of 7 machines

from the issue description where consul didn't get restarted, only 1 had a systemd timer installed.

nh2 commented 6 years ago

I am getting more and more convinced that the Job for ... canceled. issue and the problem of an old version running are totally separate.

In particular, I suspect that the old-version problem has more to do with Current command vanished from the unit file, execution of the command list won't be resumed..

I have done the following experiment, which supports this:

I've started modifying https://github.com/NixOS/nixpkgs/blob/c9e13806267f7fd3351d52a19cc6db5fa2985ca9/nixos/modules/services/networking/consul.nix#L181

adding more and more -ui switches and deploying with nixops in between. After the deploy is done, I run ps aux | grep consul | grep agent to see what command line is running.

After a deploy with -ui -ui -ui -ui -ui (5 -ui), ps shows only 4 -ui:

Nov 02 17:52:52 node-1 systemd[26291]: consul.service: Executing: consul agent -config-dir /etc/consul.d -ui -ui -ui -ui -config-file /etc/consul.json -config-file /etc/consul-addrs.json

That is even though systemctl cat shows 5 -ui:

# systemctl cat consul.service | grep ExecStart
ExecStart=@/nix/store/kl0vlvk2bd6f1hls53xkvkixlz6vp72r-consul-1.3.0-bin/bin/consul consul agent -config-dir /etc/consul.d -ui -ui -ui -ui -ui   -config-file /etc/consul.json   -config-file /etc/consul-addrs.json

I had systemd-analyze log-level debug on, and in journalctl -f could see that indeed systemd did start consul with the command line showing only 4 -ui:

Nov 02 17:52:51 node-1 nixos[26199]: switching to system configuration /nix/store/z0c30mny9rmsywvisf318wafqxqcb8ck-nixos-system-node-1-18.03pre-git
Nov 02 17:52:51 node-1 systemd[1]: Stopping consul.service...
Nov 02 17:52:51 node-1 consul[14968]:     2018/11/02 17:52:51 [INFO] agent: Caught signal:  terminated
Nov 02 17:52:52 node-1 consul[14968]:     2018/11/02 17:52:52 [INFO] agent: Exit code: 1
Nov 02 17:52:52 node-1 systemd[1]: consul.service: Main process exited, code=exited, status=1/FAILURE
Nov 02 17:52:52 node-1 systemd[1]: consul.service: Failed with result 'exit-code'.
Nov 02 17:52:52 node-1 systemd[1]: Starting consul.service...
Nov 02 17:52:52 node-1 systemd[26286]: cephfs-mounted.service: Executing: /nix/store/xwb1wk1vxjvq4vwxfbphggygfqcb15si-unit-script/bin/cephfs-mounted-pre-start
Nov 02 17:52:52 node-1 systemd[26287]: consul.service: Executing: /nix/store/fqk7rpkdf9741blndfkrlxywdyl477j8-unit-script/bin/consul-pre-start
Nov 02 17:52:52 node-1 systemd[1]: Started consul.service.
Nov 02 17:52:52 node-1 systemd[26291]: consul.service: Executing: consul agent -config-dir /etc/consul.d -ui -ui -ui -ui -config-file /etc/consul.json -config-file /etc/consul-addrs.json
Nov 02 17:52:52 node-1 consul[26291]: ==> Consul agent running!
Nov 02 17:52:52 node-1 systemd[1]: Reloading.
Nov 02 17:52:52 node-1 systemd[1]: consul.service: Current command vanished from the unit file, execution of the command list won't be resumed.
Nov 02 17:52:52 node-1 nixos[26199]: finished switching to system configuration /nix/store/z0c30mny9rmsywvisf318wafqxqcb8ck-nixos-system-node-1-18.03pre-git

And in this case, no Job for consul.service canceled was printed in nixops.

Trying this experiment more times, I also observe deploys where Job for consul.service canceled is printed, but consul starts with the correct number of -uis nevertheless.

nh2 commented 5 years ago

I wrote a script to ease debugging this automatically, but encountered another systemd issue that stopped my script from being effective, because the systemd timer randomly stopped firing at all:

https://github.com/systemd/systemd/issues/6680#issuecomment-435597174

nh2 commented 5 years ago

After adding the Requires= from timer to service (as described in the systemd link), with my script I can now reproduce the issue of the old ExecStart line running still after deploy within < 10 deploys on average (so in around < 2 minutes).

My script does nothing special, just adding more -uis as described above and checking whether they are reflected in ps; and that in a loop, printing after how many nixops deploys the change didn't get reflected.

nh2 commented 5 years ago

OK, so some more info:

I have an rsync job with a systemd timer; the rsync job service Requires = consul.service.

Involved are: consul.service, cdn-rsync.service, cdn-rsync.timer.

See https://gist.github.com/nh2/be476d7ce3466ea5b92dfbc39b235770 for their contents.

It seems like at least for one of the "2 out of 7 machines" (the other one doens't run a timer, so there it's totally unclear to me so far), the deploy not getting the new consul started correctly is dependent on what those timers do.

@arianvp and me know from observing systemd (https://github.com/systemd/systemd/issues/10626) that a timer with OnActiveSec= or OnUnitActiveSec=, will, contrary to expectations of people on the #systemd IRC channel, immediately restart the service when it goes down.

I suspect that what happens is this:

switch-configuration does systemctl stop consul.service
That stops cdn-rsync.service; the timer immediately restarts it, and this immediately brings up consul
- racing against the systemctl daemon-reload and subsequent systemctl start consul that switch-configuration does

I haven't been able to reproduce the probelm if my timer uses OnUnitInactiveSec= instead, e.g. to 3s. This is probably because in contrast to the other two options above, this one will actually wait the given amount of time after cdn-rsync.service gets pulled down by the stop of consul, and waiting 3 seconds is enough for nixops to finish the deploy in that case.

Still racy, but an apparent workaround in my situation.

It is still unclear whether this also explains the same issue on the other machine (likely not), and if #33172 has some similar situation.

nh2 commented 5 years ago

Yea, it really seems to be a race condition in there.

If I set OnUnitInactiveSec = "500ms", I can reproduce the issue super quickly, with OnUnitInactiveSec = "3s" not at all.

This doesn't seem like a great solution though, magic sleeps will start breaking when other participants in the race are slower some day.

nh2 commented 5 years ago

Here's my quick script btw:

#!/usr/bin/env python3

import argparse
import sys
import os
import subprocess
import time

parser = argparse.ArgumentParser(description='Process some integers.')
parser.add_argument('-d', '--deployment', type=str, help='deployment name')
parser.add_argument('--machine-name', type=str, help='nixops machine name to --include')
parser.add_argument('--ssh-host', type=str, help='SSH connection string to connect to for running ps')
args = parser.parse_args()

deployment = args.deployment
machine_name = args.machine_name
ssh_host = args.ssh_host

consul_nix = "../../nix-channel/nixpkgs/nixos/modules/services/networking/consul.nix"
line_no = 182 - 1

def update_consul_file():
  with open(consul_nix) as f:
    consul_nix_lines = f.read().split('\n')

  line = consul_nix_lines[line_no]
  consul_nix_lines[line_no] = line[:len(line)-1] + ' -ui"'

  # print(consul_nix_lines[line_no])

  with open(consul_nix, 'w') as f:
    f.writelines('\n'.join(consul_nix_lines))

def deploy():
  subprocess.check_call(['./ops', 'deploy', '-d', deployment, '--include', machine_name])

def count_ui():
  ps_out = subprocess.check_output(['ssh', ssh_host, 'ps aux | grep "[c]onsul agent"'], universal_newlines=True)
  count = len([word for word in ps_out.split(' ') if word == '-ui'])
  return count

update_consul_file()
deploy()
initial_count = count = count_ui()
while True:
  update_consul_file()
  deploy()
  new_count = count_ui()
  print(new_count)
  if new_count == count:
    print("-ui count failed to change!")
    print("initial_count:", initial_count)
    print("new_count:", new_count)
    break
  count = new_count

I use it like ./consul-debug.py --deployment mydeployment --machine-name node-1 --ssh-host root@node-1.example.com

danbst commented 5 years ago

@nh2 so this was an issue with timers or a problem on consul module?

arianvp commented 5 years ago

This is a problem with timers giving a conflicting job queue on the systemd-side of things every once in a while. We opened an upstream issue for this. https://github.com/systemd/systemd/issues/10626

I'm not sure if there's anything for us to fix here on the NixOS side of things. I'd be OK with closing.

danbst commented 5 years ago

Given that NixOS forces Consul to be restarted (even if it is server), we can adapt consul module to prevent such. Set skip_leave_on_interrupt = false, remove ExecStop (so stopping/leaving will be done using SIGINT) and set systemd.services.consul.stopIfChanged = false; (so stop&start are combined into single restart). I assume that single restart won't have such side effects.

But yeah, this is very service related and looks quite like a hack.

flokli commented 4 years ago

@arianvp, @danbst, I agree with you that there's nothing really generically wrong on the NixOS side of things. Feel free to open a PR to change some behaviour on the consul module if appropriate.

nh2 commented 4 years ago

Given that NixOS forces Consul to be restarted (even if it is server), we can adapt consul module to prevent such.

@danbst I don't quite understand. My original issue was that NixOS does NOT restart Consul when I expect it, making rolling upgrades unreliable.

The systemd issue https://github.com/systemd/systemd/issues/10626 where I commented by now indicates that even Facebook is hitting this problem in systemd btw.

danbst commented 4 years ago

@nh2 you are right :man_facepalming:. What would be best solution, what you think?

nh2 commented 4 years ago

@danbst I am not sure, but until anybody from systemd has commented there, I don't think it's clear that this is not a NixOS issue, and we should probably reopen.

Gabriella439 commented 4 years ago

We ran into the same problem at Awake Security. I found the root cause here and I have a minimal reproduction.

First off, here is the reproducing NixOS configuration (for EC2, which you can tweak to your desired physical configuration):

let
  # To reproduce the bug:
  #
  # * Deploy this configuration "as is"
  # * Then increment "FOO" to "1"
  # * Then redeploy the configuration
  #
  # If the reproduction succeeds the `restart-child` service should still be
  # erroneously displaying `0` instead of `1`
  FOO = "0";

in
  { imports = [ <nixpkgs/nixos/modules/virtualisation/amazon-image.nix> ];

    ec2.hvm = true;

    # This is the service that the `switch-to-configuration` script will fail to
    # update correctly.  This service prints the value of the `FOO` environment
    # variable and will always lag behind by one deploy due to being prematurely
    # restarted in the middle of a `switch-to-configuration` step.
    systemd.services.restart-child = {
      enable = true;

      wantedBy = [ "restart-parent.service" ];

      environment = { inherit FOO; };

      script = ''
        while true; do echo "$FOO"; sleep 1; done
      '';
    };

    # This service is responsible for restarting `restart-child` every second,
    # since the bug is triggered by prematurely restarting a service that is
    # being upgraded by NixOS.
    systemd.services.restart-parent = {
      enable = true;

      wantedBy = [ "multi-user.target" ];

      script = "\n";
    };

    systemd.timers.restart-parent = {
      wantedBy = [ "timers.target" ];

      timerConfig = {
        Type = "oneshot";

        OnUnitInactiveSec = "1s";

        AccuracySec = "1s";
      };
    };

    # This service adds a delay to the `systemctl stop …` step of the
    # `switch-to-configuration` script so that the `restart-parent` service has
    # enough time to restart the `restart-child` service.
    systemd.services.delay-stop = {
      enable = true;

      wantedBy = [ "multi-user.target" ];

      environment = { inherit FOO; };

      script = ''
        while true; do sleep 1; done
      '';

      # The purpose of this `preStop` delay is to give the `restart-parent`
      # service a chance to restart the `restart-child` service before
      # the `switch-to-configuration` script can run `systemctl daemon-reload`
      preStop = ''
        sleep 10
      '';
    };
  }

Here's the explanation of what triggers the bug:

The user runs the switch-to-configuration switch script (either directly or indirectly via nixos-rebuild switch), which normally performs the following actions:
- systemctl stop all changed services
- systemctl daemon-reload to load the service changes
- systemctl start all changed services
In the middle of the systemctl stop something else on the machine prematurely restarts one of the services being stopped
- This prematurely restarted service will still reflect the old NixOS configuration
systemctl daemon-reload does nothing to interrupt this stale running service
- I am fairly certain this is systemd behaving as expected
systemctl start does nothing to replace the old running service (because it's already running, albeit stale)
- I'm also fairly certain this is systemd behaving as expected

This leads me to conclude that this is a NixOS bug and not a systemd bug. Specifically, I believe the most appropriate fix is to change the switch-to-configuration script to replace systemctl start with systemctl restart.

nh2 commented 4 years ago

@Gabriel439 Great investigation!

Restarting makes sense to me.

But would we not also have to do stop after daemon-reload?

My understanding is that stop is done to get rid of services that were deleted from the configuration. In the current racy setting, it can happen that you remove a.service, stop happens, something restarts a.service, and then daemon-reload and start happen, so the old service is still running. Even when changing the last start to restart, it would still be running.

So it seems that the "real" stop can only be after daemon-reload.

Gabriella439 commented 4 years ago

@nh2: Yeah, I like the idea of just moving the stop after the daemon-reload instead

arianvp commented 4 years ago

systemctl restart has the problem that it is less correct and can lead to subtle bugs. Hence we don't default to it. See

       systemd.services.<name>.stopIfChanged
           If set, a changed unit is restarted by calling systemctl stop in the old
           configuration, then systemctl start in the new one. Otherwise, it is restarted in a
           single step using systemctl restart in the new configuration. The latter is less
           correct because it runs the ExecStop commands from the new configuration.

Moving the stop to after the daemon-reload has similar problems. and would be a major breaking change IMO

Gabriella439 commented 4 years ago

@arianvp: Why would moving stop after daemon-reload have the same problem?

arianvp commented 4 years ago

After the daemon-reload the unit of the new generation is loaded, hence you execute ExecStop of the new version of the unit instead of the old one; which is wrong.

The current order is:

stop
activate configuration
(restart if opted in with stopIfChanged = false ; intead of stop)
daemon-reload
start

If you now move the stop after the daemon-reload, the wrong ExecStop is executed. To implement the behaviour of ExecStop in old generation and ExecStart in the new new, we must do stop before daemon-reload

If you move it after the daemon-reload, there is no difference between stop & start and restart, except for transactionality. If you want to "stop and start" the service after daemon-reload, you can do that today by setting stopIfChanged = false on your systemd unit, and that's probably the fix you're looking for instead.

Personally I wouldn't mind deprecating (or better documenting) this functionality, as its semantics are a bit confusing (https://github.com/NixOS/nixpkgs/issues/49528) and I never found a service where restart doesn't just simply work.

Edit:

I think a good example where things break is scipted networking. Where it's important that the ExecStop of the previous generation is executed before starting the ExecStart of the new generation

arianvp commented 4 years ago

@nh2 @Gabriel439 would setting stopIfChanged = false; fix this issue for you?

Maybe we close this by documenting this better? "If you want transactional stops & starts, please set stopIfChanged = false" .

"If you want to sacrifice transactionality for usecase where ExecStop is in the old generation and ExecStart is in the new, please leave it to the default true"

Maybe also defaulting to false is just the better option here... But I'm not sure.

Gabriella439 commented 4 years ago

@arianvp: Individually setting stopIfChanged = false would not fix it because there are too many services we'd have to mark in this way

Something that would fix things for us would be to add a custom X-RestartOnReconfiguration option or something similar. That way we could ensure that any reverse dependency which triggers restarts of other services could always be stopped and started alongside the upgraded services to avoid triggering a premature restart.

nh2 commented 4 years ago

systemctl start with systemctl restart.

~And another problem: I found that restart does not start a service if it is already stopped (found as part of https://github.com/NixOS/nixpkgs/issues/72394#issuecomment-611499518, where mdadm was in state failed (Result: exit-code), and redeploying with nixops did not restart it). Only start seems to do that.~

~Does systemd have some restart-or-start-if-not-already-started command?~

Edit: This was probably wrong, I just didn't notice that my service restarted and failed. Would still be nice for somebody in the knows to confirm how it will behave.

flokli commented 4 years ago

@nh2 the restart behaviour comes from nixos/modules/system/activation/switch-to-configuration.pl. Unfortunately, it's not really well documented, contains some bugs (https://github.com/NixOS/nixpkgs/issues/84633#issuecomment-610927139, https://github.com/NixOS/nixpkgs/issues/73853, https://github.com/NixOS/nixpkgs/pull/73871) and more.

arianvp commented 4 years ago

Something that would fix things for us would be to add a custom X-RestartOnReconfiguration option or something similar. That way we could ensure that any reverse dependency which triggers restarts of other services could always be stopped and started alongside the upgraded services to avoid triggering a premature restart.

I don't fully understand what you are saying here. systemd services restart in isolation unless PartOf= is set. That is the restart of one service should never trigger the restart of another

What you're suggesting is also exactly how restartIfChanged = true && stopIfChanged = false is implemented. theyr'e a property on the unit, that gets read during activation of the new configuration, and then decides based on that whether we should do systemctl stop x.service && systemctl start x.service which is prone to these race conditions reported here or does an atomic systemctl restart x.service

                    if (!boolIsTrue($unitInfo->{'X-StopIfChanged'} // "yes")) {
                        # This unit should be restarted instead of
                        # stopped and started.
                        $unitsToRestart{$unit} = 1;
                        recordUnit($restartListFile, $unit);

flokli commented 4 years ago

I commented more in detail on how to possibly adress this in https://github.com/NixOS/nixpkgs/issues/49528#issuecomment-622196884.

The issue mentioned here is an instance of the "bug" mentioned there (only reloading, even though a restart would be necessary).

Let's move the discussion over there, closing here.

nh2 commented 6 months ago

I keep encountering this; posting a concrete situation again, this time with the nebula VPN service (and a timer that starts a service that requires nebula@servers.service):

updating GRUB 2 menu...
stopping the following units: my-server.service, nebula@servers.service
Job for nebula@servers.service canceled.
activating the configuration...
warning: user ‘postgres’ has unknown group ‘postgres’
setting up /etc...
reloading user units for root...
setting up tmpfiles
reloading the following units: dbus.service
starting the following units: my-server.service, nebula@servers.service
activation finished successfully

Here we see again: Job for nebula@servers.service canceled.

nh2 commented 6 months ago

@arianvp

systemctl restart has the problem that it is less correct and can lead to subtle bugs. Hence we don't default to it.

The current order is:
1. stop
2. activate configuration
3. (restart if opted in with `stopIfChanged = false` ; intead of stop)
4. daemon-reload
5. start

Cannot we just replace 5. start by restart?

The service would still be stop in the earlier step.

But restart would make sure that if a timer triggered a restart duing 1. stop,

I believe that's what @Gabriella439 meant in https://github.com/NixOS/nixpkgs/issues/49415#issuecomment-601936292: To replace 1. stop + 5. start by 1. stop + 5. restart, not to replace the whole thing by a single restart.

So, concretely, I propose:

https://github.com/NixOS/nixpkgs/blob/f7ae5ead090e8b217e425d0326d83cb857a5c518/nixos/modules/system/activation/switch-to-configuration.pl#L944-L946

-     print STDERR "starting the following units: ", join(", ", @units_to_start_filtered), "\n"
+     print STDERR "starting (via restart, see #49415) the following units: ", join(", ", @units_to_start_filtered), "\n"
 }
-system("$new_systemd/bin/systemctl", "start", "--", sort(keys(%units_to_start))) == 0 or $res = 4;
+system("$new_systemd/bin/systemctl", "restart", "--", sort(keys(%units_to_start))) == 0 or $res = 4;

Edit:

This doesn't work trivially, it makes the system hang because it restarts the network. Why? Because in this commit the logic was introduced that what's started isn't actually what's printed:

print STDERR "starting the following units: ", join(", ", @units_to_start_filtered), "\n"
                                                                          ^
                                                                          _filtered here

system("$new_systemd/bin/systemctl", "start", "--", sort(keys(%units_to_start))) == 0 or $res = 4;
                                                                            ^
                                                                            NO _filtered here

nh2 commented 6 months ago

I am reopening this issue to

The issue mentioned here is an instance of the "bug" mentioned [in #49528] (only reloading, even though a restart would be necessary).

@flokli That isn't quite correct, at least not as written.

In here, reloading isn't involved at all. This issue tracks the fact that even start + stop does not work, because the start can be replaced by a too-early restart in the presence of timers.

I am re-opening this issue because I believe we can fix this issue -- stop not happening at all -- more easily than #49528 ("Interactions of restartIfChanged, reloadIfChanged, stopIfChanged").

nh2 commented 6 months ago

I have created a PR that implements stop + restart (instead of stop + start) of the stopped units:

307562

Please help test!

NixOS / nixpkgs

systemd timers can result in changed services not being restarted reliably #49415

Issue description

307562