Open nh2 opened 6 years ago
Perhaps some race in
nixos-container
?
Alternatively, maybe an off-by-one error in nixops, triggered by some concurrent operation?
From some more inspection, the nixops DB and the nix store seem to contain only correct information, e.g. cat /nix/store/ca1jmqbqnq217rmb7ci76p2w3vfiiyfb-nixops-machines/sample-node-3/etc/hostname
contains sample-node-3
, but /etc/hostname
on the deployed machine 3 contains sample-node-2
.
So I suspect that it's an off-by-one in nixops somewhere (namely, applying the contents of machine N to machine N+1, in my case container --45
is the real node 2 and --46
should be node 3).
No longer sure it's an off-by-one: Right now I got 3 machines all being the same host.
OK, more info:
stress-ng --io 4
in parallel to nixops to create some IO load[root@machine:~]# ls -lah /nix/var/nix/profiles/per-container/myproject--81/
total 28K
drwxr-xr-x 3 root root 4.0K Mar 11 01:27 .
drwx------ 87 root root 12K Mar 11 01:18 ..
drwxrwxrwt 3 root root 4.0K Mar 11 01:19 per-user
lrwxrwxrwx 1 root root 13 Mar 11 01:27 system -> system-2-link
lrwxrwxrwx 1 root root 83 Mar 11 01:18 system-1-link -> /nix/store/10sqnsw5w17jv2zj9xs63iy5nxxn0hnj-nixos-system-myproject-node-2-17.09pre-git
lrwxrwxrwx 1 root root 83 Mar 11 01:19 system-2-link -> /nix/store/jyrldwdzckwsmpn6x1r4nwrrhii8dh8a-nixos-system-myproject-node-3-17.09pre-git
[root@machine:~]# ls -lah /nix/var/nix/profiles/per-container/myproject--80
total 28K
drwxr-xr-x 3 root root 4.0K Mar 11 01:27 .
drwx------ 87 root root 12K Mar 11 01:18 ..
drwxrwxrwt 3 root root 4.0K Mar 11 01:19 per-user
lrwxrwxrwx 1 root root 13 Mar 11 01:27 system -> system-2-link
lrwxrwxrwx 1 root root 83 Mar 11 01:18 system-1-link -> /nix/store/10sqnsw5w17jv2zj9xs63iy5nxxn0hnj-nixos-system-myproject-node-2-17.09pre-git
lrwxrwxrwx 1 root root 83 Mar 11 01:19 system-2-link -> /nix/store/9fanrj37wrfmaq20hskm50cygh3hzhhm-nixos-system-myproject-node-2-17.09pre-git
Notice ow in the bad machine above system-2-link
is correct but system-1-link
is set to the wrong hostname.
Using printf debugging I could determine that in this line https://github.com/NixOS/nixops/blob/b267e2ba592d97d55296e8183689652dc4637416/nixops/backends/container.py#L149
the path
is already wrong, and duplicated across machines, when the issue occurs.
The path
is returned incorrectly via https://github.com/NixOS/nixops/blob/b267e2ba592d97d55296e8183689652dc4637416/nixops/backends/container.py#L139-L142
even when expr_file
has correct contents, such as:
[root@machine:~]# cat /run/user/2000/nixops-tmpmjSJjW/myproject-node-2-initial.nix
{ imports = [ <nixops/container-base.nix> ]; boot.isContainer = true; networking.hostName = "myproject-node-2"; users.extraUsers.root.openssh.authorizedKeys.keys = [ "ssh-ed25519 ... NixOps auto-generated key" ]; }
[root@machine:~]# cat /run/user/2000/nixops-tmpmjSJjW/myproject-node-3-initial.nix
{ imports = [ <nixops/container-base.nix> ]; boot.isContainer = true; networking.hostName = "myproject-node-3"; users.extraUsers.root.openssh.authorizedKeys.keys = [ "ssh-ed25519 ... NixOps auto-generated key" ]; }
Looks like something is wrong with nix-build
. Starting 2 nix-build
invocations as the one linked above in parallel, twice:
root@machine % NIX_PATH=$PWD/../../nix-channel nix-build '<nixpkgs/nixos>' '-A' 'system' '-I' 'nixos-config=/root/myproject-node-2-initial.nix' '--option' 'ssh-substituter-hosts' 'nix-store-user@machine.example.com' '-I' 'nixops=/nix/store/z6adnmb4l7l49nybqw3z3slx7g1zsaqa-nixops/nixops/../nix' &; NIX_PATH=$PWD/../../nix-channel nix-build '<nixpkgs/nixos>' '-A' 'system' '-I' 'nixos-config=/root/myproject-node-3-initial.nix' '--option' 'ssh-substituter-hosts' 'nix-store-user@machine.example.com' '-I' 'nixops=/nix/store/z6adnmb4l7l49nybqw3z3slx7g1zsaqa-nixops/nixops/../nix'
[1] 30338
/nix/store/7w19g3wyfhnlb6zgnibwq0vix2258fj8-nixos-system-myproject-node-2-17.09pre-git
[1] + 30338 done NIX_PATH=$PWD/../../nix-channel nix-build '<nixpkgs/nixos>' '-A' 'system' '-I
/nix/store/bnrlaid83sxzdn1sv365syrprs8fpw20-nixos-system-myproject-node-3-17.09pre-git
root@machine % NIX_PATH=$PWD/../../nix-channel nix-build '<nixpkgs/nixos>' '-A' 'system' '-I' 'nixos-config=/root/myproject-node-2-initial.nix' '--option' 'ssh-substituter-hosts' 'nix-store-user@machine.example.com' '-I' 'nixops=/nix/store/z6adnmb4l7l49nybqw3z3slx7g1zsaqa-nixops/nixops/../nix' &; NIX_PATH=$PWD/../../nix-channel nix-build '<nixpkgs/nixos>' '-A' 'system' '-I' 'nixos-config=/root/myproject-node-3-initial.nix' '--option' 'ssh-substituter-hosts' 'nix-store-user@machine.example.com' '-I' 'nixops=/nix/store/z6adnmb4l7l49nybqw3z3slx7g1zsaqa-nixops/nixops/../nix'
[1] 30369
/nix/store/bnrlaid83sxzdn1sv365syrprs8fpw20-nixos-system-myproject-node-3-17.09pre-git
/nix/store/bnrlaid83sxzdn1sv365syrprs8fpw20-nixos-system-myproject-node-3-17.09pre-git
[1] + 30369 done NIX_PATH=$PWD/../../nix-channel nix-build '<nixpkgs/nixos>' '-A' 'system' '-I
The first time they print the expected outputs, the second time both print ...myproject-node-3-17.09pre-git
.
I ran into an issue with hostname resolution. It seems like it has to do with the snippet linked above:
What happened was that my network definition looks something like this:
{
network.description = "Example network";
reverseproxy = { ... }:
let
myServiceHost = "my-service";
in
{
deployment.targetEnv = "virtualbox";
# nginx service defintion
# Uses myServiceHost to route traffic
# ..
};
my-service = { resources, ... }:
{
deployment.targetEnv = "container";
deployment.container.host = resources.machines.reverseproxy;
# ...
};
}
I was expecting that the /etc/hosts rules would be setup to use the machine name from the network definition, but it gets truncated to 7 characters, so for example from inside reverseproxy
I can resolve my-serv
but not my-service
.
I see that there is a restriction in nixos-container.pl
:
# Due to interface name length restrictions, container names must
# be restricted too.
die "$0: container name ‘$containerName’ is too long\n" if length $containerName > 11;
So I see why it is done, but is it correct for the hostname rules to use the truncated name? Am I just doing this wrong, since the hostname is hardcoded and not referenced resources
or similar (I couldn't find any documentation for this)?
This seems to happen very rarely.
With the container backend,
nixops info
shows:So now it claims it's
node-2
when in the table it showsnode-3
. I have twonode-2
s now in mymachinectl
.As a result, I cannot
nixops ssh sample-node-3
, because the privkey it's offering doesn't match the pubkey on the machine (it offers the one fornode-3
but the pubkey in the container is the one fornode-2
).Perhaps some race in
nixos-container
?