nh2 commented 6 years ago

Similar to NixOS/nixops#640, but this time I didn't add any extra package myself like dnsmasq in that ticket; this time I only updated packages to a newer version on top of release-17.09 (from a2845aa0 to df6f8d4c).

In nixops I see

...
node-3............> updating GRUB 2 menu...
node-3............> stopping the following units: alsa-store.service, audit.service, consul.service, consulAfterTinc.service, cpufreq.service, dnsmasq.service, glusterVolumeRunningCheckWatchdog.service, glusterd.service, glustereventsd.service, kmod-static-nodes.service, network-link-eth0.service, network-local-commands.service, network-setup.service, nix-daemon.service, nix-daemon.socket, nscd.service, rngd.service, stolon-keeper.service, stolon-proxy.service, stolon-sentinel.service, systemd-journal-catalog-update.service, systemd-modules-load.service, systemd-sysctl.service, systemd-timesyncd.service, systemd-tmpfiles-clean.timer, systemd-tmpfiles-setup-dev.service, systemd-udev-trigger.service, systemd-udevd-control.socket, systemd-udevd-kernel.socket, systemd-udevd.service, tinc.myvpn.service, my-gluster-files-mount.service
node-3............> NOT restarting the following changed units: getty@tty1.service, systemd-backlight@backlight:acpi_video0.service, systemd-journal-flush.service, systemd-logind.service, systemd-random-seed.service, systemd-remount-fs.service, systemd-tmpfiles-setup.service, systemd-update-utmp.service, systemd-user-sessions.service, user@0.service
node-3............> activating the configuration...
node-3............> warning: user ‘postgres’ has unknown group ‘postgres’
node-3............> setting up /etc...
node-3............> restarting systemd...
node-3............> setting up tmpfiles
node-3............> reloading the following units: dbus.service, dev-hugepages.mount, dev-mqueue.mount, firewall.service, sys-fs-fuse-connections.mount, sys-kernel-debug.mount
node-3............> restarting the following units: network-addresses-eth0.service, nginx.service, sshd.service, systemd-journald.service
Timeout, server 1.2.3.4 not responding.

CC @basvandijk, @FPtje, @aszlig, @domenkozar, @fpletz, @groxxda

basvandijk commented 6 years ago

Bummer!

This Monday I performed a 17.09 upgrade on some Hetzner machines from 0bb77b83 to 39fe8d6d and haven't seen this behavior. Note that I'm not using dnsmasq though.

Maybe it has something to do with the recent dnscrypt-proxy patches...

nh2 commented 6 years ago

Hmm, it seems that this still happens reliably for nixpkgs upgrades.

Just got it again, this time updating to commit https://github.com/NixOS/nixpkgs/commit/c2b668ee7264000b10d554754150ea9305b6ef3d:

node-2............> restarting the following units: network-addresses-eth0.service, nginx.service, sshd.service, systemd-journald.service
Timeout, server 1.2.3.4 not responding.

fpletz commented 6 years ago

@nh2 Can you look into the log if there is something suspicious?

I only see one possibility how this could fail: The network-addresses service removes and adds the addresses on restart. This might also trigger the kernel to remove the default route and as network-setup.service is probably not restarted it won't be added again. Thus the machine is not reachable anymore.

I'm currently considering moving to systemd-networkd by default for 18.03 to remove our hacky networking scripts once and for all. These have proven to be very fragile in practice and grokking all the weird systemd unit dependency relations involved is not trivial, at least for me.

basvandijk commented 6 years ago

I actually just run into this again a few minutes ago when updating one of our machines in te field, which is now offline :-(. Fortunately I had a journalctl -f open in my SSH window so we can see that indeed the main IP address got removed:

Jan 30 22:50:27 my-machine network-addresses-eno1-pre-stop[8950]: deleting x.x.x.x/xx...

Previously this only happened to me on Hetzner. This is the first time this happens on our own hardware in the field. Interestingly a few days ago I performed the same upgrade on all our other machines and didn't get this erroneous behaviour.

basvandijk commented 6 years ago

@fpletz +1 for making systemd-networkd the default. We're already using that on our workstations. Now I'm considering switching all our production machines to it ASAP as well.

fpletz commented 6 years ago

Okay, so one of the problems here is that network-setup.service is not reloaded when a network-addresses service is restartet. The root problem we have in NixOS is that the default gateway is not tied to a interface and we don't have a route configuration option in general.

Be aware that the current state of our networkd implementation is a bit odd. For instance the 99-main.network unit will probably make the network-online.target fail because it matches all interfaces. If you override this and tweak the generated network units to your needs, you should be fine though. I know people who use it on NixOS for quite a while.

basvandijk commented 6 years ago

Just for completeness here's my (censored) nixops log:

my-machine.................> copying closure...
my-machine.................> copying 4 missing paths (0.07 MiB) to ‘root@x.x.x.x’...
my-net> closures copied successfully
my-machine.................> stopping the following units: alsa-store.service, audit.service, cpufreq.service, kmod-static-nodes.service, network-link-eno1.service, network-local-commands.service, network-setup.service, nix-daemon.service, nix-daemon.socket, nscd.service, rngd.service, smartd.service, systemd-journal-catalog-update.service, systemd-modules-load.service, systemd-sysctl.service, systemd-timesyncd.service, systemd-tmpfiles-clean.timer, systemd-tmpfiles-setup-dev.service, systemd-udev-trigger.service, systemd-udevd-control.socket, systemd-udevd-kernel.socket, systemd-udevd.service
Connection to x.x.x.x closed by remote host.
my-machine.................> setting up /etc...
my-machine.................> NOT restarting the following changed units: getty@tty1.service, systemd-backlight@backlight:acpi_video0.service, systemd-journal-flush.service, systemd-logind.service, systemd-random-seed.service, systemd-remount-fs.service, systemd-tmpfiles-setup.service, systemd-update-utmp.service, systemd-user-sessions.service, user@0.service
my-machine.................> activating the configuration...
my-machine.................> restarting systemd...
my-machine.................> setting up tmpfiles
my-machine.................> reloading the following units: dbus.service, dev-hugepages.mount, dev-mqueue.mount, firewall.service, sys-kernel-debug.mount
my-machine.................> restarting the following units: network-addresses-eno1.service, nginx.service, sshd.service, systemd-journald.service
my-machine.................> starting the following units: alsa-store.service, audit.service, cpufreq.service, kmod-static-nodes.service, network-link-eno1.service, network-local-commands.service, network-setup.service, nix-daemon.socket, nscd.service, rngd.service, smartd.service, systemd-journal-catalog-update.service, systemd-modules-load.service, systemd-sysctl.service, systemd-timesyncd.service, systemd-tmpfiles-clean.timer, systemd-tmpfiles-setup-dev.service, systemd-udev-trigger.service, systemd-udevd-control.socket, systemd-udevd-kernel.socket
my-machine.................> Job for multi-user.target canceled.
my-machine.................> error: Traceback (most recent call last):
  File "/nix/store/bx6sbpccbs11zwxhm10dshwm15xlp737-nixops-1.5.2/lib/python2.7/site-packages/nixops/deployment.py", line 705, in worker
    raise Exception("unable to activate new configuration")
Exception: unable to activate new configuration
error: activation of 1 of xxxx machines failed (namely on ‘my-machine’)

basvandijk commented 6 years ago

I have to take back my previous comments. After retrieving and investigating my-machine it turned out that the unreachability was caused by a service which rebooted the machine. Sorry for the noise.

nh2 commented 6 years ago

And it happened again to me just now: After some release-18.03 rebase and deploy, I got

restarting the following units: dhcpcd.service, network-addresses-eth0.service, nginx.service, sshd.service, systemd-journald.service

and the network is gone.

This is very reproducible, I can do it with one server after the other in turn.

domenkozar commented 6 years ago

nix-diff those two nixos drvs?

On Thu, Aug 30, 2018 at 4:11 PM Niklas Hambüchen notifications@github.com wrote:

And it happened again to me: After some release-18.03 rebase and deploy, I got

restarting the following units: dhcpcd.service, network-addresses-eth0.service, nginx.service, sshd.service, systemd-journald.service

and the network is gone.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NixOS/nixops/issues/833#issuecomment-417355154, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHtg_yKsGXdCm8sUlt0fhdz8U5kFVwYks5uWACXgaJpZM4Ra4Lq .

nh2 commented 6 years ago

@domenkozar Hmm, it's not too revealing on the nixops-machines:

nix-diff /nix/store/sl78hnxw1b631xn8dmh9jl46pk9m7g51-nixops-machines.drv /nix/store/cq1b7l8xd43j21q3nwdh40cdp1cp68fd-nixops-machines.drv
- /nix/store/sl78hnxw1b631xn8dmh9jl46pk9m7g51-nixops-machines.drv:{out}
+ /nix/store/cq1b7l8xd43j21q3nwdh40cdp1cp68fd-nixops-machines.drv:{out}
• The builders do not match
    - /nix/store/zqh3l3lyw32q1ayb15bnvg9f24j5v2p0-bash-4.4-p12/bin/bash
    + /nix/store/lw7xaqhakk0i1c631m3cvac3x4lc5gr5-bash-4.4-p12/bin/bash
• The set of input names do not match:
    + nixos-system-mymachine-1-18.03pre-git

aszlig commented 6 years ago

@nh2: Hm, could you look into the journal to see what went wrong? I have a suspicion that it's due to the additional routes we set in localCommands.

aszlig commented 6 years ago

On a side note: I'm going to refactor this now, as we should now be able to do this without localCommands.

nh2 commented 5 years ago

@basvandijk and me are investigating this at NixCon, we managed to reproduce it with this nixops specification:

# To reproduce the Hetzner disconnect problem
#   https://github.com/NixOS/nixops/issues/833#issuecomment-417491015
# deploy first with nixpkgs commit
#   a2845aa0
# and then switch to 
#   df6f8d4c
# and deploy again.

# Network arguments can be set with nixops's `set-args` feature.
{}:
let
  pkgs = import <nixpkgs> {};

  ip = "1.2.3.4";
  robotSubAccount = "#111111+AAAAA";

in
{
  network.enableRollback = true;

  builder = {
    deployment.targetEnv = "hetzner";
    deployment.hetzner.createSubAccount = false; # because we use 2-factor auth
    deployment.hetzner.robotUser = robotSubAccount;
    deployment.hetzner.mainIPv4 = ip;
    deployment.hetzner.partitions = ''
      clearpart --all --initlabel --drives=sda,sdb

      part raid.1a --recommended --label=swap1 --fstype=swap --ondisk=sda
      part raid.1b --recommended --label=swap2 --fstype=swap --ondisk=sdb

      part raid.2a --grow --ondisk=sda
      part raid.2b --grow --ondisk=sdb

      raid swap --level=1 --device=swap --fstype=swap --label=swap raid.1a raid.1b
      raid /    --level=1 --device=root --fstype=ext4 --label=root raid.2a raid.2b
    '';

    deployment.hasFastConnection = true;

    boot.kernelModules = [ "tcp_bbr" ];

    # Enable BBR congestion control
    boot.kernel.sysctl."net.ipv4.tcp_congestion_control" = "bbr";
    boot.kernel.sysctl."net.core.default_qdisc" = "fq"; # see https://news.ycombinator.com/item?id=14814530

    nix.buildCores = 0; # 0 is auto detection

    users.extraUsers.root.openssh.authorizedKeys.keys = (import ./ssh-keys.nix);

    environment.systemPackages = [
      pkgs.git
      pkgs.htop
      pkgs.nload
      pkgs.parted
      pkgs.python3
      pkgs.screen
      pkgs.vim
      pkgs.ncdu
    ];

    systemd.services.myservice = {
      requiredBy = [ "multi-user.target" ];
      after = [
        "network-online.target"
      ];
      script = ''
        set -euo pipefail
        while true; do
          echo hello
          sleep 5
        done
      '';
      serviceConfig = {
        Type = "simple";
        Restart = "always";
      };
      unitConfig = {
        StartLimitIntervalSec = 0; # ensure Restart=always is always honoured
      };
    };

    networking.nameservers = [
      "8.8.4.4"
    ];

    services.nginx = {
      enable = true;
    };

  };

}

This can probably be further minimised but I thought I'd post it for the record.

nh2 commented 5 years ago

Actually, the above doesn't reproduce that the server becomes offline and cannot be reached again:

It prints

builder> reloading the following units: dbus.service, dev-hugepages.mount, dev-mqueue.mount, firewall.service, sys-kernel-debug.mount
builder> restarting the following units: dhcpcd.service, network-addresses-eth0.service, nginx.service, sshd.service, systemd-journald.service
builder> starting the following units: alsa-store.service, audit.service, cpufreq.service, kmod-static-nodes.service, myservice.service, network-link-eth0.service, network-setup.service, nix-daemon.socket, nscd.service, rngd.service, systemd-modules-load.service, systemd-sysctl.service, systemd-timesyncd.service, systemd-tmpfiles-clean.timer, systemd-tmpfiles-setup-dev.service, systemd-udev-trigger.service, systemd-udevd-control.socket, systemd-udevd-kernel.socket

then it hangs 30 seconds (also a separate SSH connection to the server hangs during that time) and then becomes available again.

So this reproduces the eth0.service stuff being restarted but not the server disappearing until hard reboot.

aszlig commented 5 years ago

@nh2: Can you confirm whether this is also the case with https://gist.github.com/aszlig/84e9bf4c8c2a06e77e43d64f89a5c2fc applied?

nh2 commented 5 years ago

Below comes an investigation dump of what I did with @basvandijk and then later continued.

There is a summary at the bottom.

Start of investigation

Checking the network-setup.service ExecStart script:

[root@builder:~]# cat /nix/store/yyryz5nfzjy63991ha4nplkcc2qsljg8-unit-script/bin/network-setup-start
#! /nix/store/lw7xaqhakk0i1c631m3cvac3x4lc5gr5-bash-4.4-p12/bin/bash -e
# Set the static DNS configuration, if given.
/nix/store/v9dgkyj939d4zl0bczn5rhgg8s7205id-openresolv-3.9.0/sbin/resolvconf -m 1 -a static <<EOF

nameserver 213.133.98.98
nameserver 213.133.99.99
nameserver 213.133.100.100
nameserver 2a01:4f8:0:a0a1::add:1010
nameserver 2a01:4f8:0:a102::add:9999
nameserver 2a01:4f8:0:a111::add:9898
nameserver 8.8.4.4

EOF

echo "nh2: resolvconf done"
sleep 2

# Set the default gateway.

ip route replace default  via "46.4.89.1"   proto static

echo "nh2: ip route replace done"

always shows in journalctl:

Oct 27 14:47:39 builder systemd[1]: Starting Networking Setup...
Oct 27 14:47:39 builder network-setup-start[10445]: nh2: resolvconf done
Oct 27 14:47:41 builder network-setup-start[10445]: RTNETLINK answers: Network is unreachable
Oct 27 14:47:41 builder systemd[1]: network-setup.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Oct 27 14:47:41 builder systemd[1]: network-setup.service: Failed with result 'exit-code'.
Oct 27 14:47:41 builder systemd[1]: Failed to start Networking Setup.

running the ip route replace manually also results in

[root@builder:~]# ip route replace default  via "46.4.89.1"   proto static
RTNETLINK answers: Network is unreachable

Investigation with stackoverflow

before:

[root@builder:~]# ip route list
default via 46.4.89.1 dev eth0 proto static 
46.4.89.0/27 via 46.4.89.1 dev eth0

then following https://serverfault.com/questions/581159/unable-to-add-a-static-route-sioaddrt-network-is-unreachable/581162#581162

Instead its declared 192.168.36.0/23 should be passed to the gateway 192.168.36.254 which technically according to the rest of the routing table does not appear to be a neighbour (the kernel has no idea how to get to 192.168.36.254).

[root@builder:~]# ip route del 46.4.89.0/27 via 46.4.89.1 dev eth0

[root@builder:~]# ip route list
default via 46.4.89.1 dev eth0 proto static 

[root@builder:~]# ip route add 46.4.89.0/27 dev eth0

we get

[root@builder:~]# ip route list
default via 46.4.89.1 dev eth0 proto static 
46.4.89.0/27 dev eth0 scope link

the difference:

-46.4.89.0/27 via 46.4.89.1 dev eth0 
+46.4.89.0/27               dev eth0 scope link

Starting from plain again, Franz's suggestion

In the situation

[root@builder:~]# ip route
default via 46.4.89.1 dev eth0 proto static 
46.4.89.0/27 via 46.4.89.1 dev eth0

This doesn't work:

[root@builder:~]# ip route replace default  via "46.4.89.1"   proto static
RTNETLINK answers: Network is unreachable

But it works if we add another route that says "you can reach 46.4.89.1 directly":

[root@builder:~]# ip route add 46.4.89.1 dev eth0

[root@builder:~]# ip route replace default  via "46.4.89.1"   proto static

Final state:

[root@builder:~]# ip route
default via 46.4.89.1 dev eth0 proto static 
46.4.89.0/27 via 46.4.89.1 dev eth0 
46.4.89.1 dev eth0 scope link

TODO write down why aszlig does change instead of replace in local_commands (because he knows that the routes in question already exist)

The above problem not happening on older nixpkgs

On nixpkgs a2845aa the above problem doesn't exist. The route looks the same, but there's more being restarted: https://github.com/NixOS/nixops/issues/833#issuecomment-433639687 Somehow that more-restarting means that the network-setup.service runs through cleanly without the Network is unreachable. Running the ip route replace manually still results in that however.

And at some points the routes change to:

[root@builder:~]#  ip route
default via 46.4.89.1 dev eth0 proto static 
46.4.89.0/27 dev eth0 proto kernel scope link src 46.4.89.8

So more things to figure out:

What changed between a2845aa and newer (around 18.03) nixpkgs that results in this change?
Why does the old one not trigger the problem?
How do we fix the problem on the new one?

Also when switching from a2845aa to df6f8d4 we get a change from

[root@builder:~]#  ip route
default via 46.4.89.1 dev eth0 proto static 
46.4.89.0/27 via 46.4.89.1 dev eth0

to

[root@builder:~]#  ip route
default via 46.4.89.1 dev eth0 proto static 
46.4.89.0/27 dev eth0 proto kernel scope link src 46.4.89.8

To reset it, reboot. When switching the other way around, it doesn't happen.

After retrying again, with dnsmasq disabled, I couldn't reproduce this. But it did happen with dnsmasq enabled, but then when both switching from df6f8d4 to a2845aa, and the other way around. It happened every time (with me doing ip route del 46.4.89.0/27 in between to avoid the reboot). I could also reproduce it reliably when doing

ip route del 46.4.89.0/27 && ip route add 46.4.89.1 dev eth0 && ip route add 46.4.89.0/27 via 46.4.89.1 dev eth0 && ip route del 46.4.89.1 dev eth0 scope link

in between switching. Then I commented out dnsmasq, and could continue to reliably reproduce it this way (as opposed to me not being able to reproduce it that way just before, so weird). But after a reboot (with dnsmasq commented out), I could not reproduce it any more. So it seems like turning on dnsmasq does something to the system that persists until reboot. Then, trying the same stuff again after another reboot, it didn't appear again for a while. I'm suspecting there's something racy going on, or some systemd timeout.

I added logging into

https://github.com/NixOS/nixpkgs/blob/a2845aa/nixos/modules/tasks/network-interfaces-scripted.nix#L201

like this:

echo "nh2 ip route before"
ip route
echo "${address}" >> $state
if out=$(ip addr add "${address}" dev "${i.name}" 2>&1); then
  echo "added ip ${address}"
  echo "output: $out"
  echo
  echo "nh2 ip route"
  ip route
elif ! echo "$out" | grep "File exists" >/dev/null 2>&1; then
  echo "failed to add ${address}"
  echo "nh2 ip route"
  ip route
  exit 1
fi

and also put a set +x around the ip command, and found:

Oct 27 23:11:23 builder systemd[1]: Stopping Address configuration of eth0...
Oct 27 23:11:23 builder network-addresses-eth0-pre-stop[23006]: deleting 46.4.89.8/27...
Oct 27 23:11:23 builder systemd[1]: Stopped Address configuration of eth0.
Oct 27 23:11:23 builder systemd[1]: Starting Address configuration of eth0...
Oct 27 23:11:23 builder network-addresses-eth0-start[23009]: bringing up interface...
Oct 27 23:11:23 builder network-addresses-eth0-start[23009]: + exec /nix/store/yjd47a4rv7mbpvj6x75hpyssi60hhcn0-iproute2-4.13.0/bin/orig/ip link set eth0 up
Oct 27 23:11:23 builder network-addresses-eth0-start[23009]: nh2 ip route before
Oct 27 23:11:23 builder network-addresses-eth0-start[23009]: + exec /nix/store/yjd47a4rv7mbpvj6x75hpyssi60hhcn0-iproute2-4.13.0/bin/orig/ip route
Oct 27 23:11:23 builder network-addresses-eth0-start[23009]: added ip 46.4.89.8/27
Oct 27 23:11:23 builder network-addresses-eth0-start[23009]: output: + exec /nix/store/yjd47a4rv7mbpvj6x75hpyssi60hhcn0-iproute2-4.13.0/bin/orig/ip addr add 46.4.89.8/27 dev eth0
Oct 27 23:11:23 builder network-addresses-eth0-start[23009]: nh2 ip route
Oct 27 23:11:23 builder network-addresses-eth0-start[23009]: + exec /nix/store/yjd47a4rv7mbpvj6x75hpyssi60hhcn0-iproute2-4.13.0/bin/orig/ip route
Oct 27 23:11:23 builder network-addresses-eth0-start[23009]: 46.4.89.0/27 dev eth0 proto kernel scope link src 46.4.89.8
Oct 27 23:11:23 builder systemd[1]: Started Address configuration of eth0.

So here, there are first no routes at all, and as soon as ip addr add is done, we have the route

46.4.89.0/27 dev eth0 proto kernel scope link src 46.4.89.8

So ip addr add seems to add routes automatically. https://serverfault.com/a/886207/128321 also suggests that this is the case.

Because that entry exists,

ip route replace default via 46.4.89.1 proto static

runs through without Network is unreachable.

After a reboot with dnsmasq disabled, I found this via auditd:

ip route replace default via 46.4.89.1 proto static
ip -6 addr add 2a01:4f8:140:71c5::/64 dev eth0
ip -4 route change 46.4.89.0/27 via 46.4.89.1 dev eth0
ip -6 route add default via fe80::1 dev eth0

I suspect that the

ip -4 route change 46.4.89.0/27 via 46.4.89.1 dev eth0

(which is a nixops-introduced local_commands) is what turns

46.4.89.0/27 dev eth0 proto kernel scope link src 46.4.89.8

into

46.4.89.0/27 via 46.4.89.1 dev eth0

But why does that not happen with dnsmasq enabled?

A normal deploy that changes stuff e.g. by adding a space in echo "nh2 ip route before" looks like this:

Oct 27 23:28:36 builder systemd[1]: Stopped Extra networking commands..
Oct 27 23:28:36 builder systemd[1]: Stopped Networking Setup.
Oct 27 23:28:36 builder systemd[1]: Stopping Address configuration of eth0...
Oct 27 23:28:36 builder network-addresses-eth0-start[2974]: bringing up interface...
Oct 27 23:28:36 builder network-addresses-eth0-start[2974]: + exec /nix/store/yjd47a4rv7mbpvj6x75hpyssi60hhcn0-iproute2-4.13.0/bin/orig/ip link set eth0 up
Oct 27 23:28:36 builder network-addresses-eth0-start[2974]: nh2 ip route before
Oct 27 23:28:36 builder network-addresses-eth0-start[2974]: + exec /nix/store/yjd47a4rv7mbpvj6x75hpyssi60hhcn0-iproute2-4.13.0/bin/orig/ip route
Oct 27 23:28:36 builder network-addresses-eth0-start[2974]: added ip 46.4.89.8/27
Oct 27 23:28:36 builder network-addresses-eth0-start[2974]: output: + exec /nix/store/yjd47a4rv7mbpvj6x75hpyssi60hhcn0-iproute2-4.13.0/bin/orig/ip addr add 46.4.89.8/27 dev eth0
Oct 27 23:28:36 builder network-addresses-eth0-start[2974]: nh2 ip route
Oct 27 23:28:36 builder network-addresses-eth0-start[2974]: + exec /nix/store/yjd47a4rv7mbpvj6x75hpyssi60hhcn0-iproute2-4.13.0/bin/orig/ip route
Oct 27 23:28:36 builder network-addresses-eth0-start[2974]: 46.4.89.0/27 dev eth0 proto kernel scope link src 46.4.89.8
Oct 27 23:28:36 builder systemd[1]: Started Address configuration of eth0.
Oct 27 23:28:36 builder systemd[1]: Starting Networking Setup...
Oct 27 23:28:36 builder network-setup-start[3013]: + exec /nix/store/yjd47a4rv7mbpvj6x75hpyssi60hhcn0-iproute2-4.13.0/bin/orig/ip route replace default via 46.4.89.1 proto static
Oct 27 23:28:36 builder systemd[1]: Started Networking Setup.
Oct 27 23:28:36 builder systemd[1]: Starting Extra networking commands....
Oct 27 23:28:36 builder network-local-commands-start[3087]: + exec /nix/store/yjd47a4rv7mbpvj6x75hpyssi60hhcn0-iproute2-4.13.0/bin/orig/ip -6 addr add 2a01:4f8:140:71c5::/64 dev eth0
Oct 27 23:28:36 builder network-local-commands-start[3087]: RTNETLINK answers: File exists
Oct 27 23:28:36 builder network-local-commands-start[3087]: + exec /nix/store/yjd47a4rv7mbpvj6x75hpyssi60hhcn0-iproute2-4.13.0/bin/orig/ip -4 route change 46.4.89.0/27 via 46.4.89.1 dev eth0
Oct 27 23:28:36 builder network-local-commands-start[3087]: + exec /nix/store/yjd47a4rv7mbpvj6x75hpyssi60hhcn0-iproute2-4.13.0/bin/orig/ip -6 route add default via fe80::1 dev eth0
Oct 27 23:28:36 builder network-local-commands-start[3087]: RTNETLINK answers: File exists
Oct 27 23:28:36 builder systemd[1]: Started Extra networking commands..

But after adding dnsmasq and deploying we get:

Oct 27 23:29:34 builder systemd[1]: Stopped Extra networking commands..
Oct 27 23:29:34 builder systemd[1]: Stopped Networking Setup.
Oct 27 23:29:35 builder systemd[1]: Starting Networking Setup...
Oct 27 23:29:35 builder network-setup-start[3486]: + exec /nix/store/yjd47a4rv7mbpvj6x75hpyssi60hhcn0-iproute2-4.13.0/bin/orig/ip route replace default via 46.4.89.1 proto static
Oct 27 23:29:35 builder network-setup-start[3486]: RTNETLINK answers: Network is unreachable
Oct 27 23:29:35 builder systemd[1]: network-setup.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Oct 27 23:29:35 builder systemd[1]: Failed to start Networking Setup.
Oct 27 23:29:35 builder systemd[1]: network-setup.service: Unit entered failed state.
Oct 27 23:29:35 builder systemd[1]: network-setup.service: Failed with result 'exit-code'.

Looks like here the difference is a missing

Stopping Address configuration of eth0...

Probably that's why we get RTNETLINK answers: Network is unreachable, because the IP isn't removed and re-added, so the

46.4.89.0/27 dev eth0 proto kernel scope link src 46.4.89.8

isn't added that way so then the replace fails.

Even after changing nixpkgs and NixOps to insert the required

46.4.89.1 dev eth0 proto static scope link

which makes all errors disappear, there remains some race between [root@builder:~]# ip route default via 46.4.89.1 dev eth0 proto static 46.4.89.0/27 via 46.4.89.1 dev eth0 46.4.89.1 dev eth0 proto static scope link and

[root@builder:~]# ip route
default via 46.4.89.1 dev eth0 proto static 
46.4.89.0/27 dev eth0 proto kernel scope link src 46.4.89.8 
46.4.89.1 dev eth0 proto static scope link

In the latter case, network-local-commands-start just doesn't run for some unknown reason. But nixops's output even says so:

builder> starting the following units: alsa-store.service, audit.service, cpufreq.service, dnsmasq.service, kmod-static-nodes.service, myservice.service, network-link-eth0.service, network-local-commands.service, network-setup.service, nix-daemon.socket, nscd.service, rngd.service, systemd-modules-load.service, systemd-sysctl.service, systemd-timesyncd.service, systemd-tmpfiles-clean.timer, systemd-tmpfiles-setup-dev.service, systemd-udev-trigger.service, systemd-udevd-control.socket, systemd-udevd-kernel.socket

vs

builder> starting the following units: alsa-store.service, audit.service, cpufreq.service, dnsmasq.service, kmod-static-nodes.service, myservice.service, network-link-eth0.service, network-setup.service, nix-daemon.socket, nscd.service, rngd.service, systemd-modules-load.service, systemd-sysctl.service, systemd-timesyncd.service, systemd-tmpfiles-clean.timer, systemd-tmpfiles-setup-dev.service, systemd-udev-trigger.service, systemd-udevd-control.socket, systemd-udevd-kernel.socket

One of them doesn't list network-local-commands.service.

Aha:

[root@builder:~]# systemctl status network-local-commands
● network-local-commands.service - Extra networking commands.
   Loaded: loaded (/nix/store/2z1m96z6mxx886h1070sl1ps30iivfja-unit-network-local-c
   Active: inactive (dead) since Sun 2018-10-28 00:17:22 UTC; 19min ago
 Main PID: 2112 (code=exited, status=0/SUCCESS)

Oct 28 00:17:09 builder systemd[1]: Starting Extra networking commands....
Oct 28 00:17:09 builder network-local-commands-start[2112]: RTNETLINK answers: File
Oct 28 00:17:09 builder network-local-commands-start[2112]: RTNETLINK answers: File
Oct 28 00:17:09 builder systemd[1]: Started Extra networking commands..
Oct 28 00:17:22 builder systemd[1]: Stopped Extra networking commands..

So shows it as loaded / inactive (dead). After I do systemctl start network-local-commands.service, it starts running again on deploys folloing that, and is shown as:

[root@builder:~]# systemctl status network-local-commands
● network-local-commands.service - Extra networking commands.
   Loaded: loaded (/nix/store/zcbhw5nn2x9yaigrmwccx19h02i2mxm1-unit-network-local-commands.service/network-local-commands.service; enabled; vendor preset: enabled)
   Active: active (exited) since Sun 2018-10-28 00:37:11 UTC; 2min 8s ago
  Process: 5082 ExecStart=/nix/store/fpb2f3jl521qdm89cmg3m6cknv852y8y-unit-script/bin/network-local-commands-start (code=exited, status=0/SUCCESS)
 Main PID: 5082 (code=exited, status=0/SUCCESS)

So now, that's loaded / active (exited)

How does that come? It seems to happen when I comment out dnsmasq and deploy. Then nixops prints:

hetzner-reboot-test> closures copied successfully
builder> updating GRUB 2 menu...
builder> stopping the following units: dnsmasq.service, network-setup.service
builder> activating the configuration...
builder> removing user ‘dnsmasq’
builder> setting up /etc...
builder> setting up tmpfiles
builder> reloading the following units: dbus.service
builder> starting the following units: network-setup.service
builder> activation finished successfully

After I do that, I get

[root@builder:~]# systemctl status network-local-commands
● network-local-commands.service - Extra networking commands.
   Loaded: loaded (/nix/store/zcbhw5nn2x9yaigrmwccx19h02i2mxm1-unit-network-local-c
   Active: inactive (dead) since Sun 2018-10-28 00:42:06 UTC; 5s ago
...
Oct 28 00:42:06 builder systemd[1]: Stopped Extra networking commands..

So here, as opposed to the usual, it doesn't get started again. Commenting dnsmasq back in doesn't result in network-local-commands.service to be started again. But a reboot does. It looks like I can ensure that network-local-commands.service always runs by changing

https://github.com/NixOS/nixpkgs/blob/a2845aa0327713f5ca1a0631241559fe7950b203/nixos/modules/tasks/network-interfaces.nix#L1025

from

wantedBy = [ "network.target" ];

to

wantedBy = [ "network.target" "multi-user.target" ];

Not sure if that's the correct thing to do, but right now there doesn't seem to be anything that enforces that network-local-commands.service is run in general, which seems wrong.

Summary

So overall, the fixes we probably need to make are:

Ensure that Hetzner machines always have a link-local route entry to the gateway address
Ensure that network-local-commands.service gets started when it should (this is beyond Hetzner)

Also found that:

assertions in network-interfaces.nix doens't seem to be evaluated at all, this is probably bad

FPtje commented 4 years ago

To be honest I haven't seen this issue in quite some time. @nh2 have you?

Perhaps with the network changes in later NixOS versions this issue was fixed?

nh2 commented 4 years ago

Yes, the NixOps commit https://github.com/NixOS/nixops/commit/bb9fe125c78770351bd1abed54e3af0fa85df268 says:

Should help #833.

which is https://github.com/NixOS/nixops/issues/833 which is now this issue. So indeed, it probably did help with it.

Let's close until anybody sees it again.

NixOS / nixops-hetzner

Deploy disconnected my Hetzner machine again #13

Start of investigation

Investigation with stackoverflow

Starting from plain again, Franz's suggestion

The above problem not happening on older nixpkgs

Summary