Closed nh2 closed 4 years ago
Bummer!
This Monday I performed a 17.09 upgrade on some Hetzner machines from 0bb77b83
to 39fe8d6d
and haven't seen this behavior. Note that I'm not using dnsmasq
though.
Maybe it has something to do with the recent dnscrypt-proxy
patches...
Hmm, it seems that this still happens reliably for nixpkgs upgrades.
Just got it again, this time updating to commit https://github.com/NixOS/nixpkgs/commit/c2b668ee7264000b10d554754150ea9305b6ef3d:
node-2............> restarting the following units: network-addresses-eth0.service, nginx.service, sshd.service, systemd-journald.service
Timeout, server 1.2.3.4 not responding.
@nh2 Can you look into the log if there is something suspicious?
I only see one possibility how this could fail: The network-addresses service removes and adds the addresses on restart. This might also trigger the kernel to remove the default route and as network-setup.service is probably not restarted it won't be added again. Thus the machine is not reachable anymore.
I'm currently considering moving to systemd-networkd by default for 18.03 to remove our hacky networking scripts once and for all. These have proven to be very fragile in practice and grokking all the weird systemd unit dependency relations involved is not trivial, at least for me.
I actually just run into this again a few minutes ago when updating one of our machines in te field, which is now offline :-(. Fortunately I had a journalctl -f
open in my SSH window so we can see that indeed the main IP address got removed:
Jan 30 22:50:27 my-machine network-addresses-eno1-pre-stop[8950]: deleting x.x.x.x/xx...
Previously this only happened to me on Hetzner. This is the first time this happens on our own hardware in the field. Interestingly a few days ago I performed the same upgrade on all our other machines and didn't get this erroneous behaviour.
@fpletz +1 for making systemd-networkd
the default. We're already using that on our workstations. Now I'm considering switching all our production machines to it ASAP as well.
Okay, so one of the problems here is that network-setup.service
is not reloaded when a network-addresses
service is restartet. The root problem we have in NixOS is that the default gateway is not tied to a interface and we don't have a route configuration option in general.
Be aware that the current state of our networkd implementation is a bit odd. For instance the 99-main.network
unit will probably make the network-online.target
fail because it matches all interfaces. If you override this and tweak the generated network units to your needs, you should be fine though. I know people who use it on NixOS for quite a while.
Just for completeness here's my (censored) nixops log:
my-machine.................> copying closure...
my-machine.................> copying 4 missing paths (0.07 MiB) to ‘root@x.x.x.x’...
my-net> closures copied successfully
my-machine.................> stopping the following units: alsa-store.service, audit.service, cpufreq.service, kmod-static-nodes.service, network-link-eno1.service, network-local-commands.service, network-setup.service, nix-daemon.service, nix-daemon.socket, nscd.service, rngd.service, smartd.service, systemd-journal-catalog-update.service, systemd-modules-load.service, systemd-sysctl.service, systemd-timesyncd.service, systemd-tmpfiles-clean.timer, systemd-tmpfiles-setup-dev.service, systemd-udev-trigger.service, systemd-udevd-control.socket, systemd-udevd-kernel.socket, systemd-udevd.service
Connection to x.x.x.x closed by remote host.
my-machine.................> setting up /etc...
my-machine.................> NOT restarting the following changed units: getty@tty1.service, systemd-backlight@backlight:acpi_video0.service, systemd-journal-flush.service, systemd-logind.service, systemd-random-seed.service, systemd-remount-fs.service, systemd-tmpfiles-setup.service, systemd-update-utmp.service, systemd-user-sessions.service, user@0.service
my-machine.................> activating the configuration...
my-machine.................> restarting systemd...
my-machine.................> setting up tmpfiles
my-machine.................> reloading the following units: dbus.service, dev-hugepages.mount, dev-mqueue.mount, firewall.service, sys-kernel-debug.mount
my-machine.................> restarting the following units: network-addresses-eno1.service, nginx.service, sshd.service, systemd-journald.service
my-machine.................> starting the following units: alsa-store.service, audit.service, cpufreq.service, kmod-static-nodes.service, network-link-eno1.service, network-local-commands.service, network-setup.service, nix-daemon.socket, nscd.service, rngd.service, smartd.service, systemd-journal-catalog-update.service, systemd-modules-load.service, systemd-sysctl.service, systemd-timesyncd.service, systemd-tmpfiles-clean.timer, systemd-tmpfiles-setup-dev.service, systemd-udev-trigger.service, systemd-udevd-control.socket, systemd-udevd-kernel.socket
my-machine.................> Job for multi-user.target canceled.
my-machine.................> error: Traceback (most recent call last):
File "/nix/store/bx6sbpccbs11zwxhm10dshwm15xlp737-nixops-1.5.2/lib/python2.7/site-packages/nixops/deployment.py", line 705, in worker
raise Exception("unable to activate new configuration")
Exception: unable to activate new configuration
error: activation of 1 of xxxx machines failed (namely on ‘my-machine’)
I have to take back my previous comments. After retrieving and investigating my-machine
it turned out that the unreachability was caused by a service which rebooted the machine. Sorry for the noise.
And it happened again to me just now: After some release-18.03
rebase and deploy, I got
restarting the following units: dhcpcd.service, network-addresses-eth0.service, nginx.service, sshd.service, systemd-journald.service
and the network is gone.
This is very reproducible, I can do it with one server after the other in turn.
nix-diff those two nixos drvs?
On Thu, Aug 30, 2018 at 4:11 PM Niklas Hambüchen notifications@github.com wrote:
And it happened again to me: After some release-18.03 rebase and deploy, I got
restarting the following units: dhcpcd.service, network-addresses-eth0.service, nginx.service, sshd.service, systemd-journald.service
and the network is gone.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NixOS/nixops/issues/833#issuecomment-417355154, or mute the thread https://github.com/notifications/unsubscribe-auth/AAHtg_yKsGXdCm8sUlt0fhdz8U5kFVwYks5uWACXgaJpZM4Ra4Lq .
@domenkozar Hmm, it's not too revealing on the nixops-machines
:
nix-diff /nix/store/sl78hnxw1b631xn8dmh9jl46pk9m7g51-nixops-machines.drv /nix/store/cq1b7l8xd43j21q3nwdh40cdp1cp68fd-nixops-machines.drv
- /nix/store/sl78hnxw1b631xn8dmh9jl46pk9m7g51-nixops-machines.drv:{out}
+ /nix/store/cq1b7l8xd43j21q3nwdh40cdp1cp68fd-nixops-machines.drv:{out}
• The builders do not match
- /nix/store/zqh3l3lyw32q1ayb15bnvg9f24j5v2p0-bash-4.4-p12/bin/bash
+ /nix/store/lw7xaqhakk0i1c631m3cvac3x4lc5gr5-bash-4.4-p12/bin/bash
• The set of input names do not match:
+ nixos-system-mymachine-1-18.03pre-git
@nh2: Hm, could you look into the journal to see what went wrong? I have a suspicion that it's due to the additional routes we set in localCommands.
On a side note: I'm going to refactor this now, as we should now be able to do this without localCommands
.
@basvandijk and me are investigating this at NixCon, we managed to reproduce it with this nixops specification:
# To reproduce the Hetzner disconnect problem
# https://github.com/NixOS/nixops/issues/833#issuecomment-417491015
# deploy first with nixpkgs commit
# a2845aa0
# and then switch to
# df6f8d4c
# and deploy again.
# Network arguments can be set with nixops's `set-args` feature.
{}:
let
pkgs = import <nixpkgs> {};
ip = "1.2.3.4";
robotSubAccount = "#111111+AAAAA";
in
{
network.enableRollback = true;
builder = {
deployment.targetEnv = "hetzner";
deployment.hetzner.createSubAccount = false; # because we use 2-factor auth
deployment.hetzner.robotUser = robotSubAccount;
deployment.hetzner.mainIPv4 = ip;
deployment.hetzner.partitions = ''
clearpart --all --initlabel --drives=sda,sdb
part raid.1a --recommended --label=swap1 --fstype=swap --ondisk=sda
part raid.1b --recommended --label=swap2 --fstype=swap --ondisk=sdb
part raid.2a --grow --ondisk=sda
part raid.2b --grow --ondisk=sdb
raid swap --level=1 --device=swap --fstype=swap --label=swap raid.1a raid.1b
raid / --level=1 --device=root --fstype=ext4 --label=root raid.2a raid.2b
'';
deployment.hasFastConnection = true;
boot.kernelModules = [ "tcp_bbr" ];
# Enable BBR congestion control
boot.kernel.sysctl."net.ipv4.tcp_congestion_control" = "bbr";
boot.kernel.sysctl."net.core.default_qdisc" = "fq"; # see https://news.ycombinator.com/item?id=14814530
nix.buildCores = 0; # 0 is auto detection
users.extraUsers.root.openssh.authorizedKeys.keys = (import ./ssh-keys.nix);
environment.systemPackages = [
pkgs.git
pkgs.htop
pkgs.nload
pkgs.parted
pkgs.python3
pkgs.screen
pkgs.vim
pkgs.ncdu
];
systemd.services.myservice = {
requiredBy = [ "multi-user.target" ];
after = [
"network-online.target"
];
script = ''
set -euo pipefail
while true; do
echo hello
sleep 5
done
'';
serviceConfig = {
Type = "simple";
Restart = "always";
};
unitConfig = {
StartLimitIntervalSec = 0; # ensure Restart=always is always honoured
};
};
networking.nameservers = [
"8.8.4.4"
];
services.nginx = {
enable = true;
};
};
}
This can probably be further minimised but I thought I'd post it for the record.
Actually, the above doesn't reproduce that the server becomes offline and cannot be reached again:
It prints
builder> reloading the following units: dbus.service, dev-hugepages.mount, dev-mqueue.mount, firewall.service, sys-kernel-debug.mount
builder> restarting the following units: dhcpcd.service, network-addresses-eth0.service, nginx.service, sshd.service, systemd-journald.service
builder> starting the following units: alsa-store.service, audit.service, cpufreq.service, kmod-static-nodes.service, myservice.service, network-link-eth0.service, network-setup.service, nix-daemon.socket, nscd.service, rngd.service, systemd-modules-load.service, systemd-sysctl.service, systemd-timesyncd.service, systemd-tmpfiles-clean.timer, systemd-tmpfiles-setup-dev.service, systemd-udev-trigger.service, systemd-udevd-control.socket, systemd-udevd-kernel.socket
then it hangs 30 seconds (also a separate SSH connection to the server hangs during that time) and then becomes available again.
So this reproduces the eth0.service
stuff being restarted but not the server disappearing until hard reboot.
@nh2: Can you confirm whether this is also the case with https://gist.github.com/aszlig/84e9bf4c8c2a06e77e43d64f89a5c2fc applied?
Below comes an investigation dump of what I did with @basvandijk and then later continued.
There is a summary at the bottom.
Checking the network-setup.service
ExecStart
script:
[root@builder:~]# cat /nix/store/yyryz5nfzjy63991ha4nplkcc2qsljg8-unit-script/bin/network-setup-start
#! /nix/store/lw7xaqhakk0i1c631m3cvac3x4lc5gr5-bash-4.4-p12/bin/bash -e
# Set the static DNS configuration, if given.
/nix/store/v9dgkyj939d4zl0bczn5rhgg8s7205id-openresolv-3.9.0/sbin/resolvconf -m 1 -a static <<EOF
nameserver 213.133.98.98
nameserver 213.133.99.99
nameserver 213.133.100.100
nameserver 2a01:4f8:0:a0a1::add:1010
nameserver 2a01:4f8:0:a102::add:9999
nameserver 2a01:4f8:0:a111::add:9898
nameserver 8.8.4.4
EOF
echo "nh2: resolvconf done"
sleep 2
# Set the default gateway.
ip route replace default via "46.4.89.1" proto static
echo "nh2: ip route replace done"
always shows in journalctl
:
Oct 27 14:47:39 builder systemd[1]: Starting Networking Setup...
Oct 27 14:47:39 builder network-setup-start[10445]: nh2: resolvconf done
Oct 27 14:47:41 builder network-setup-start[10445]: RTNETLINK answers: Network is unreachable
Oct 27 14:47:41 builder systemd[1]: network-setup.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Oct 27 14:47:41 builder systemd[1]: network-setup.service: Failed with result 'exit-code'.
Oct 27 14:47:41 builder systemd[1]: Failed to start Networking Setup.
running the ip route replace
manually also results in
[root@builder:~]# ip route replace default via "46.4.89.1" proto static
RTNETLINK answers: Network is unreachable
before:
[root@builder:~]# ip route list
default via 46.4.89.1 dev eth0 proto static
46.4.89.0/27 via 46.4.89.1 dev eth0
then following https://serverfault.com/questions/581159/unable-to-add-a-static-route-sioaddrt-network-is-unreachable/581162#581162
Instead its declared 192.168.36.0/23 should be passed to the gateway 192.168.36.254 which technically according to the rest of the routing table does not appear to be a neighbour (the kernel has no idea how to get to 192.168.36.254).
[root@builder:~]# ip route del 46.4.89.0/27 via 46.4.89.1 dev eth0
[root@builder:~]# ip route list
default via 46.4.89.1 dev eth0 proto static
[root@builder:~]# ip route add 46.4.89.0/27 dev eth0
we get
[root@builder:~]# ip route list
default via 46.4.89.1 dev eth0 proto static
46.4.89.0/27 dev eth0 scope link
the difference:
-46.4.89.0/27 via 46.4.89.1 dev eth0
+46.4.89.0/27 dev eth0 scope link
In the situation
[root@builder:~]# ip route
default via 46.4.89.1 dev eth0 proto static
46.4.89.0/27 via 46.4.89.1 dev eth0
This doesn't work:
[root@builder:~]# ip route replace default via "46.4.89.1" proto static
RTNETLINK answers: Network is unreachable
But it works if we add another route that says "you can reach 46.4.89.1 directly":
[root@builder:~]# ip route add 46.4.89.1 dev eth0
[root@builder:~]# ip route replace default via "46.4.89.1" proto static
Final state:
[root@builder:~]# ip route
default via 46.4.89.1 dev eth0 proto static
46.4.89.0/27 via 46.4.89.1 dev eth0
46.4.89.1 dev eth0 scope link
TODO write down why aszlig does change
instead of replace
in local_commands
(because he knows that the routes in question already exist)
On nixpkgs a2845aa
the above problem doesn't exist.
The route looks the same, but there's more being restarted:
https://github.com/NixOS/nixops/issues/833#issuecomment-433639687
Somehow that more-restarting means that the network-setup.service runs through cleanly without the Network is unreachable
.
Running the ip route replace
manually still results in that however.
And at some points the routes change to:
[root@builder:~]# ip route
default via 46.4.89.1 dev eth0 proto static
46.4.89.0/27 dev eth0 proto kernel scope link src 46.4.89.8
So more things to figure out:
a2845aa
and newer (around 18.03) nixpkgs that results in this change?Also when switching from a2845aa to df6f8d4 we get a change from
[root@builder:~]# ip route
default via 46.4.89.1 dev eth0 proto static
46.4.89.0/27 via 46.4.89.1 dev eth0
to
[root@builder:~]# ip route
default via 46.4.89.1 dev eth0 proto static
46.4.89.0/27 dev eth0 proto kernel scope link src 46.4.89.8
To reset it, reboot. When switching the other way around, it doesn't happen.
After retrying again, with dnsmasq disabled, I couldn't reproduce this.
But it did happen with dnsmasq enabled, but then when both switching from df6f8d4 to a2845aa, and the other way around.
It happened every time (with me doing ip route del 46.4.89.0/27
in between to avoid the reboot).
I could also reproduce it reliably when doing
ip route del 46.4.89.0/27 && ip route add 46.4.89.1 dev eth0 && ip route add 46.4.89.0/27 via 46.4.89.1 dev eth0 && ip route del 46.4.89.1 dev eth0 scope link
in between switching. Then I commented out dnsmasq, and could continue to reliably reproduce it this way (as opposed to me not being able to reproduce it that way just before, so weird). But after a reboot (with dnsmasq commented out), I could not reproduce it any more. So it seems like turning on dnsmasq does something to the system that persists until reboot. Then, trying the same stuff again after another reboot, it didn't appear again for a while. I'm suspecting there's something racy going on, or some systemd timeout.
I added logging into
https://github.com/NixOS/nixpkgs/blob/a2845aa/nixos/modules/tasks/network-interfaces-scripted.nix#L201
like this:
echo "nh2 ip route before"
ip route
echo "${address}" >> $state
if out=$(ip addr add "${address}" dev "${i.name}" 2>&1); then
echo "added ip ${address}"
echo "output: $out"
echo
echo "nh2 ip route"
ip route
elif ! echo "$out" | grep "File exists" >/dev/null 2>&1; then
echo "failed to add ${address}"
echo "nh2 ip route"
ip route
exit 1
fi
and also put a set +x
around the ip
command, and found:
Oct 27 23:11:23 builder systemd[1]: Stopping Address configuration of eth0...
Oct 27 23:11:23 builder network-addresses-eth0-pre-stop[23006]: deleting 46.4.89.8/27...
Oct 27 23:11:23 builder systemd[1]: Stopped Address configuration of eth0.
Oct 27 23:11:23 builder systemd[1]: Starting Address configuration of eth0...
Oct 27 23:11:23 builder network-addresses-eth0-start[23009]: bringing up interface...
Oct 27 23:11:23 builder network-addresses-eth0-start[23009]: + exec /nix/store/yjd47a4rv7mbpvj6x75hpyssi60hhcn0-iproute2-4.13.0/bin/orig/ip link set eth0 up
Oct 27 23:11:23 builder network-addresses-eth0-start[23009]: nh2 ip route before
Oct 27 23:11:23 builder network-addresses-eth0-start[23009]: + exec /nix/store/yjd47a4rv7mbpvj6x75hpyssi60hhcn0-iproute2-4.13.0/bin/orig/ip route
Oct 27 23:11:23 builder network-addresses-eth0-start[23009]: added ip 46.4.89.8/27
Oct 27 23:11:23 builder network-addresses-eth0-start[23009]: output: + exec /nix/store/yjd47a4rv7mbpvj6x75hpyssi60hhcn0-iproute2-4.13.0/bin/orig/ip addr add 46.4.89.8/27 dev eth0
Oct 27 23:11:23 builder network-addresses-eth0-start[23009]: nh2 ip route
Oct 27 23:11:23 builder network-addresses-eth0-start[23009]: + exec /nix/store/yjd47a4rv7mbpvj6x75hpyssi60hhcn0-iproute2-4.13.0/bin/orig/ip route
Oct 27 23:11:23 builder network-addresses-eth0-start[23009]: 46.4.89.0/27 dev eth0 proto kernel scope link src 46.4.89.8
Oct 27 23:11:23 builder systemd[1]: Started Address configuration of eth0.
So here, there are first no routes at all, and as soon as ip addr add
is done, we have the route
46.4.89.0/27 dev eth0 proto kernel scope link src 46.4.89.8
So ip addr add
seems to add routes automatically.
https://serverfault.com/a/886207/128321 also suggests that this is the case.
Because that entry exists,
ip route replace default via 46.4.89.1 proto static
runs through without Network is unreachable
.
After a reboot with dnsmasq disabled, I found this via auditd:
ip route replace default via 46.4.89.1 proto static
ip -6 addr add 2a01:4f8:140:71c5::/64 dev eth0
ip -4 route change 46.4.89.0/27 via 46.4.89.1 dev eth0
ip -6 route add default via fe80::1 dev eth0
I suspect that the
ip -4 route change 46.4.89.0/27 via 46.4.89.1 dev eth0
(which is a nixops-introduced local_commands) is what turns
46.4.89.0/27 dev eth0 proto kernel scope link src 46.4.89.8
into
46.4.89.0/27 via 46.4.89.1 dev eth0
But why does that not happen with dnsmasq enabled?
A normal deploy that changes stuff e.g. by adding a space in echo "nh2 ip route before"
looks like this:
Oct 27 23:28:36 builder systemd[1]: Stopped Extra networking commands..
Oct 27 23:28:36 builder systemd[1]: Stopped Networking Setup.
Oct 27 23:28:36 builder systemd[1]: Stopping Address configuration of eth0...
Oct 27 23:28:36 builder network-addresses-eth0-start[2974]: bringing up interface...
Oct 27 23:28:36 builder network-addresses-eth0-start[2974]: + exec /nix/store/yjd47a4rv7mbpvj6x75hpyssi60hhcn0-iproute2-4.13.0/bin/orig/ip link set eth0 up
Oct 27 23:28:36 builder network-addresses-eth0-start[2974]: nh2 ip route before
Oct 27 23:28:36 builder network-addresses-eth0-start[2974]: + exec /nix/store/yjd47a4rv7mbpvj6x75hpyssi60hhcn0-iproute2-4.13.0/bin/orig/ip route
Oct 27 23:28:36 builder network-addresses-eth0-start[2974]: added ip 46.4.89.8/27
Oct 27 23:28:36 builder network-addresses-eth0-start[2974]: output: + exec /nix/store/yjd47a4rv7mbpvj6x75hpyssi60hhcn0-iproute2-4.13.0/bin/orig/ip addr add 46.4.89.8/27 dev eth0
Oct 27 23:28:36 builder network-addresses-eth0-start[2974]: nh2 ip route
Oct 27 23:28:36 builder network-addresses-eth0-start[2974]: + exec /nix/store/yjd47a4rv7mbpvj6x75hpyssi60hhcn0-iproute2-4.13.0/bin/orig/ip route
Oct 27 23:28:36 builder network-addresses-eth0-start[2974]: 46.4.89.0/27 dev eth0 proto kernel scope link src 46.4.89.8
Oct 27 23:28:36 builder systemd[1]: Started Address configuration of eth0.
Oct 27 23:28:36 builder systemd[1]: Starting Networking Setup...
Oct 27 23:28:36 builder network-setup-start[3013]: + exec /nix/store/yjd47a4rv7mbpvj6x75hpyssi60hhcn0-iproute2-4.13.0/bin/orig/ip route replace default via 46.4.89.1 proto static
Oct 27 23:28:36 builder systemd[1]: Started Networking Setup.
Oct 27 23:28:36 builder systemd[1]: Starting Extra networking commands....
Oct 27 23:28:36 builder network-local-commands-start[3087]: + exec /nix/store/yjd47a4rv7mbpvj6x75hpyssi60hhcn0-iproute2-4.13.0/bin/orig/ip -6 addr add 2a01:4f8:140:71c5::/64 dev eth0
Oct 27 23:28:36 builder network-local-commands-start[3087]: RTNETLINK answers: File exists
Oct 27 23:28:36 builder network-local-commands-start[3087]: + exec /nix/store/yjd47a4rv7mbpvj6x75hpyssi60hhcn0-iproute2-4.13.0/bin/orig/ip -4 route change 46.4.89.0/27 via 46.4.89.1 dev eth0
Oct 27 23:28:36 builder network-local-commands-start[3087]: + exec /nix/store/yjd47a4rv7mbpvj6x75hpyssi60hhcn0-iproute2-4.13.0/bin/orig/ip -6 route add default via fe80::1 dev eth0
Oct 27 23:28:36 builder network-local-commands-start[3087]: RTNETLINK answers: File exists
Oct 27 23:28:36 builder systemd[1]: Started Extra networking commands..
But after adding dnsmasq and deploying we get:
Oct 27 23:29:34 builder systemd[1]: Stopped Extra networking commands..
Oct 27 23:29:34 builder systemd[1]: Stopped Networking Setup.
Oct 27 23:29:35 builder systemd[1]: Starting Networking Setup...
Oct 27 23:29:35 builder network-setup-start[3486]: + exec /nix/store/yjd47a4rv7mbpvj6x75hpyssi60hhcn0-iproute2-4.13.0/bin/orig/ip route replace default via 46.4.89.1 proto static
Oct 27 23:29:35 builder network-setup-start[3486]: RTNETLINK answers: Network is unreachable
Oct 27 23:29:35 builder systemd[1]: network-setup.service: Main process exited, code=exited, status=2/INVALIDARGUMENT
Oct 27 23:29:35 builder systemd[1]: Failed to start Networking Setup.
Oct 27 23:29:35 builder systemd[1]: network-setup.service: Unit entered failed state.
Oct 27 23:29:35 builder systemd[1]: network-setup.service: Failed with result 'exit-code'.
Looks like here the difference is a missing
Stopping Address configuration of eth0...
Probably that's why we get RTNETLINK answers: Network is unreachable
,
because the IP isn't removed and re-added, so the
46.4.89.0/27 dev eth0 proto kernel scope link src 46.4.89.8
isn't added that way so then the replace
fails.
Even after changing nixpkgs and NixOps to insert the required
46.4.89.1 dev eth0 proto static scope link
which makes all errors disappear, there remains some race between [root@builder:~]# ip route default via 46.4.89.1 dev eth0 proto static 46.4.89.0/27 via 46.4.89.1 dev eth0 46.4.89.1 dev eth0 proto static scope link and
[root@builder:~]# ip route
default via 46.4.89.1 dev eth0 proto static
46.4.89.0/27 dev eth0 proto kernel scope link src 46.4.89.8
46.4.89.1 dev eth0 proto static scope link
In the latter case, network-local-commands-start just doesn't run for some unknown reason. But nixops's output even says so:
builder> starting the following units: alsa-store.service, audit.service, cpufreq.service, dnsmasq.service, kmod-static-nodes.service, myservice.service, network-link-eth0.service, network-local-commands.service, network-setup.service, nix-daemon.socket, nscd.service, rngd.service, systemd-modules-load.service, systemd-sysctl.service, systemd-timesyncd.service, systemd-tmpfiles-clean.timer, systemd-tmpfiles-setup-dev.service, systemd-udev-trigger.service, systemd-udevd-control.socket, systemd-udevd-kernel.socket
vs
builder> starting the following units: alsa-store.service, audit.service, cpufreq.service, dnsmasq.service, kmod-static-nodes.service, myservice.service, network-link-eth0.service, network-setup.service, nix-daemon.socket, nscd.service, rngd.service, systemd-modules-load.service, systemd-sysctl.service, systemd-timesyncd.service, systemd-tmpfiles-clean.timer, systemd-tmpfiles-setup-dev.service, systemd-udev-trigger.service, systemd-udevd-control.socket, systemd-udevd-kernel.socket
One of them doesn't list network-local-commands.service
.
Aha:
[root@builder:~]# systemctl status network-local-commands
● network-local-commands.service - Extra networking commands.
Loaded: loaded (/nix/store/2z1m96z6mxx886h1070sl1ps30iivfja-unit-network-local-c
Active: inactive (dead) since Sun 2018-10-28 00:17:22 UTC; 19min ago
Main PID: 2112 (code=exited, status=0/SUCCESS)
Oct 28 00:17:09 builder systemd[1]: Starting Extra networking commands....
Oct 28 00:17:09 builder network-local-commands-start[2112]: RTNETLINK answers: File
Oct 28 00:17:09 builder network-local-commands-start[2112]: RTNETLINK answers: File
Oct 28 00:17:09 builder systemd[1]: Started Extra networking commands..
Oct 28 00:17:22 builder systemd[1]: Stopped Extra networking commands..
So shows it as loaded / inactive (dead).
After I do systemctl start network-local-commands.service
, it starts running again on deploys folloing that, and is shown as:
[root@builder:~]# systemctl status network-local-commands
● network-local-commands.service - Extra networking commands.
Loaded: loaded (/nix/store/zcbhw5nn2x9yaigrmwccx19h02i2mxm1-unit-network-local-commands.service/network-local-commands.service; enabled; vendor preset: enabled)
Active: active (exited) since Sun 2018-10-28 00:37:11 UTC; 2min 8s ago
Process: 5082 ExecStart=/nix/store/fpb2f3jl521qdm89cmg3m6cknv852y8y-unit-script/bin/network-local-commands-start (code=exited, status=0/SUCCESS)
Main PID: 5082 (code=exited, status=0/SUCCESS)
So now, that's loaded / active (exited)
How does that come? It seems to happen when I comment out dnsmasq and deploy. Then nixops prints:
hetzner-reboot-test> closures copied successfully
builder> updating GRUB 2 menu...
builder> stopping the following units: dnsmasq.service, network-setup.service
builder> activating the configuration...
builder> removing user ‘dnsmasq’
builder> setting up /etc...
builder> setting up tmpfiles
builder> reloading the following units: dbus.service
builder> starting the following units: network-setup.service
builder> activation finished successfully
After I do that, I get
[root@builder:~]# systemctl status network-local-commands
● network-local-commands.service - Extra networking commands.
Loaded: loaded (/nix/store/zcbhw5nn2x9yaigrmwccx19h02i2mxm1-unit-network-local-c
Active: inactive (dead) since Sun 2018-10-28 00:42:06 UTC; 5s ago
...
Oct 28 00:42:06 builder systemd[1]: Stopped Extra networking commands..
So here, as opposed to the usual, it doesn't get started again.
Commenting dnsmasq back in doesn't result in network-local-commands.service to be started again.
But a reboot does.
It looks like I can ensure that network-local-commands.service
always runs by changing
https://github.com/NixOS/nixpkgs/blob/a2845aa0327713f5ca1a0631241559fe7950b203/nixos/modules/tasks/network-interfaces.nix#L1025
from
wantedBy = [ "network.target" ];
to
wantedBy = [ "network.target" "multi-user.target" ];
Not sure if that's the correct thing to do, but right now there doesn't seem to be anything that enforces that network-local-commands.service
is run in general, which seems wrong.
So overall, the fixes we probably need to make are:
Also found that:
assertions
in network-interfaces.nix
doens't seem to be evaluated at all, this is probably badTo be honest I haven't seen this issue in quite some time. @nh2 have you?
Perhaps with the network changes in later NixOS versions this issue was fixed?
Yes, the NixOps commit https://github.com/NixOS/nixops/commit/bb9fe125c78770351bd1abed54e3af0fa85df268 says:
Should help #833.
which is https://github.com/NixOS/nixops/issues/833 which is now this issue. So indeed, it probably did help with it.
Let's close until anybody sees it again.
Similar to NixOS/nixops#640, but this time I didn't add any extra package myself like
dnsmasq
in that ticket; this time I only updated packages to a newer version on top ofrelease-17.09
(froma2845aa0
todf6f8d4c
).In nixops I see
CC @basvandijk, @FPtje, @aszlig, @domenkozar, @fpletz, @groxxda