NixOS / nixops

NixOps is a tool for deploying to NixOS machines in a network or cloud.
https://nixos.org/nixops
GNU Lesser General Public License v3.0
1.83k stars 363 forks source link

containers backend sometimes assigns wrong host(name) #889

Open nh2 opened 6 years ago

nh2 commented 6 years ago

This seems to happen very rarely.

With the container backend, nixops info shows:

+-----------------+-----------------+-----------+-------------+-------------+
| Name            |      Status     | Type      | Resource Id | IP address  |
+-----------------+-----------------+-----------+-------------+-------------+
| sample-node-4   |  Up / Outdated  | container | sample--44  | 10.233.26.2 |
| sample-logging  | Up / Up-to-date | container | sample--47  | 10.233.29.2 |
| sample-node-1   |  Up / Outdated  | container | sample--48  | 10.233.30.2 |
| sample-node-2   |  Up / Outdated  | container | sample--45  | 10.233.27.2 |
| sample-node-3   |  Up / Outdated  | container | sample--46  | 10.233.28.2 |
+-----------------+-----------------+-----------+-------------+-------------+
# machinectl | grep 10.233.28.2
sample--46 container systemd-nspawn nixos 17.09pre-git 10.233.28.2...
# machinectl shell sample--46
Connected to machine sample--46. Press ^] three times within 1s to exit session.

[root@sample-node-2:~]# 

So now it claims it's node-2 when in the table it shows node-3. I have two node-2s now in my machinectl.

As a result, I cannot nixops ssh sample-node-3, because the privkey it's offering doesn't match the pubkey on the machine (it offers the one for node-3 but the pubkey in the container is the one for node-2).

Perhaps some race in nixos-container?

nh2 commented 6 years ago

Perhaps some race in nixos-container?

Alternatively, maybe an off-by-one error in nixops, triggered by some concurrent operation?

nh2 commented 6 years ago

From some more inspection, the nixops DB and the nix store seem to contain only correct information, e.g. cat /nix/store/ca1jmqbqnq217rmb7ci76p2w3vfiiyfb-nixops-machines/sample-node-3/etc/hostname contains sample-node-3, but /etc/hostname on the deployed machine 3 contains sample-node-2.

So I suspect that it's an off-by-one in nixops somewhere (namely, applying the contents of machine N to machine N+1, in my case container --45 is the real node 2 and --46 should be node 3).

nh2 commented 6 years ago

No longer sure it's an off-by-one: Right now I got 3 machines all being the same host.

nh2 commented 6 years ago

OK, more info:

[root@machine:~]# ls -lah /nix/var/nix/profiles/per-container/myproject--81/
total 28K
drwxr-xr-x  3 root root 4.0K Mar 11 01:27 .
drwx------ 87 root root  12K Mar 11 01:18 ..
drwxrwxrwt  3 root root 4.0K Mar 11 01:19 per-user
lrwxrwxrwx  1 root root   13 Mar 11 01:27 system -> system-2-link
lrwxrwxrwx  1 root root   83 Mar 11 01:18 system-1-link -> /nix/store/10sqnsw5w17jv2zj9xs63iy5nxxn0hnj-nixos-system-myproject-node-2-17.09pre-git
lrwxrwxrwx  1 root root   83 Mar 11 01:19 system-2-link -> /nix/store/jyrldwdzckwsmpn6x1r4nwrrhii8dh8a-nixos-system-myproject-node-3-17.09pre-git

[root@machine:~]# ls -lah /nix/var/nix/profiles/per-container/myproject--80
total 28K
drwxr-xr-x  3 root root 4.0K Mar 11 01:27 .
drwx------ 87 root root  12K Mar 11 01:18 ..
drwxrwxrwt  3 root root 4.0K Mar 11 01:19 per-user
lrwxrwxrwx  1 root root   13 Mar 11 01:27 system -> system-2-link
lrwxrwxrwx  1 root root   83 Mar 11 01:18 system-1-link -> /nix/store/10sqnsw5w17jv2zj9xs63iy5nxxn0hnj-nixos-system-myproject-node-2-17.09pre-git
lrwxrwxrwx  1 root root   83 Mar 11 01:19 system-2-link -> /nix/store/9fanrj37wrfmaq20hskm50cygh3hzhhm-nixos-system-myproject-node-2-17.09pre-git

Notice ow in the bad machine above system-2-link is correct but system-1-link is set to the wrong hostname.

nh2 commented 6 years ago

Using printf debugging I could determine that in this line https://github.com/NixOS/nixops/blob/b267e2ba592d97d55296e8183689652dc4637416/nixops/backends/container.py#L149

the path is already wrong, and duplicated across machines, when the issue occurs.

The path is returned incorrectly via https://github.com/NixOS/nixops/blob/b267e2ba592d97d55296e8183689652dc4637416/nixops/backends/container.py#L139-L142

even when expr_file has correct contents, such as:

[root@machine:~]# cat /run/user/2000/nixops-tmpmjSJjW/myproject-node-2-initial.nix
{ imports = [ <nixops/container-base.nix> ];   boot.isContainer = true;   networking.hostName = "myproject-node-2";   users.extraUsers.root.openssh.authorizedKeys.keys = [ "ssh-ed25519 ... NixOps auto-generated key" ]; }

[root@machine:~]# cat /run/user/2000/nixops-tmpmjSJjW/myproject-node-3-initial.nix
{ imports = [ <nixops/container-base.nix> ];   boot.isContainer = true;   networking.hostName = "myproject-node-3";   users.extraUsers.root.openssh.authorizedKeys.keys = [ "ssh-ed25519 ... NixOps auto-generated key" ]; }
nh2 commented 6 years ago

Looks like something is wrong with nix-build. Starting 2 nix-build invocations as the one linked above in parallel, twice:

root@machine % NIX_PATH=$PWD/../../nix-channel nix-build '<nixpkgs/nixos>' '-A' 'system' '-I' 'nixos-config=/root/myproject-node-2-initial.nix' '--option' 'ssh-substituter-hosts' 'nix-store-user@machine.example.com' '-I' 'nixops=/nix/store/z6adnmb4l7l49nybqw3z3slx7g1zsaqa-nixops/nixops/../nix' &; NIX_PATH=$PWD/../../nix-channel nix-build '<nixpkgs/nixos>' '-A' 'system' '-I' 'nixos-config=/root/myproject-node-3-initial.nix' '--option' 'ssh-substituter-hosts' 'nix-store-user@machine.example.com' '-I' 'nixops=/nix/store/z6adnmb4l7l49nybqw3z3slx7g1zsaqa-nixops/nixops/../nix'
[1] 30338
/nix/store/7w19g3wyfhnlb6zgnibwq0vix2258fj8-nixos-system-myproject-node-2-17.09pre-git
[1]  + 30338 done       NIX_PATH=$PWD/../../nix-channel nix-build '<nixpkgs/nixos>' '-A' 'system' '-I
/nix/store/bnrlaid83sxzdn1sv365syrprs8fpw20-nixos-system-myproject-node-3-17.09pre-git

root@machine % NIX_PATH=$PWD/../../nix-channel nix-build '<nixpkgs/nixos>' '-A' 'system' '-I' 'nixos-config=/root/myproject-node-2-initial.nix' '--option' 'ssh-substituter-hosts' 'nix-store-user@machine.example.com' '-I' 'nixops=/nix/store/z6adnmb4l7l49nybqw3z3slx7g1zsaqa-nixops/nixops/../nix' &; NIX_PATH=$PWD/../../nix-channel nix-build '<nixpkgs/nixos>' '-A' 'system' '-I' 'nixos-config=/root/myproject-node-3-initial.nix' '--option' 'ssh-substituter-hosts' 'nix-store-user@machine.example.com' '-I' 'nixops=/nix/store/z6adnmb4l7l49nybqw3z3slx7g1zsaqa-nixops/nixops/../nix'
[1] 30369
/nix/store/bnrlaid83sxzdn1sv365syrprs8fpw20-nixos-system-myproject-node-3-17.09pre-git
/nix/store/bnrlaid83sxzdn1sv365syrprs8fpw20-nixos-system-myproject-node-3-17.09pre-git
[1]  + 30369 done       NIX_PATH=$PWD/../../nix-channel nix-build '<nixpkgs/nixos>' '-A' 'system' '-I

The first time they print the expected outputs, the second time both print ...myproject-node-3-17.09pre-git.

matt-snider commented 4 years ago

I ran into an issue with hostname resolution. It seems like it has to do with the snippet linked above:

https://github.com/NixOS/nixops/blob/b267e2ba592d97d55296e8183689652dc4637416/nixops/backends/container.py#L149

What happened was that my network definition looks something like this:

{ 
  network.description = "Example network";

  reverseproxy  = { ... }:
  let
    myServiceHost = "my-service";
  in 
  {
     deployment.targetEnv = "virtualbox";
     # nginx service defintion
     # Uses myServiceHost to route traffic
     # ..
  };

  my-service = { resources, ... }:
  { 
     deployment.targetEnv = "container";
     deployment.container.host = resources.machines.reverseproxy;
     # ...
  };
}

I was expecting that the /etc/hosts rules would be setup to use the machine name from the network definition, but it gets truncated to 7 characters, so for example from inside reverseproxy I can resolve my-serv but not my-service.

I see that there is a restriction in nixos-container.pl:

    # Due to interface name length restrictions, container names must
    # be restricted too.
    die "$0: container name ‘$containerName’ is too long\n" if length $containerName > 11;

So I see why it is done, but is it correct for the hostname rules to use the truncated name? Am I just doing this wrong, since the hostname is hardcoded and not referenced resources or similar (I couldn't find any documentation for this)?