NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
18.38k stars 14.33k forks source link

Networkd cannot match interface types #49534

Closed Mic92 closed 5 years ago

Mic92 commented 6 years ago

Issue description

We still have custom rules https://github.com/NixOS/nixpkgs/blob/master/nixos/modules/services/hardware/udev.nix#L120 that were copied from systemd at some point. However our rules are out-dated and break the network units from systemd-networkd that detect container interfaces (the new rules provide more information about the interface type). The fix would be to use the new rules when networkd is enabled and use the old rules otherwise. The new rules are depending on networkd to do the actual rename. An alternative would be to always rename interfaces with networkd instead. There is also a test nixos/tests/predictable-interface-names.nix that ensures we make all users happy.

cc @arianvp @fpletz @flokli

arianvp commented 6 years ago

FWIW, the updated udev rule files are shipped within the systemd package:

${udev}/lib/udev/rules.d/80-net-setup-link.rules

should do the trick

fpletz commented 6 years ago

It looks like that udev rules file does not rename the network interfaces anymore. As doesn't any other of the shipped rules with systemd.

It looks like this functionality was moved into networkd through https://github.com/NixOS/systemd/blob/nixos-v239/network/99-default.link. Not sure if we want to enable networkd by default just for predictable interface names. I've just tested it and it would work fine with our scripted networking though.

Mic92 commented 6 years ago

There are two options:

  1. Either always do it with networkd
  2. Only replace udev rules, when networkd is enabled.

Networkd used to terminate itself, when it was finished with static network configuration and nothing is left to do. I don't know if this is still the case. I remember they have disabled something like that but I cannot tell if this was only temporary.

andir commented 6 years ago

I haven't seen networkd exit in a long time. It always waits for new interfaces and configures them according to the configuration.

arianvp commented 6 years ago

( Related: https://github.com/NixOS/nixpkgs/commit/788c5195f36fe101ecbf016137e017655063bc6b , by the way)

@Mic92 I don't see how the new shipped udev rules would aid in container detection. Afaik that has nothing to do with udev but instead with the ConditionVirtualization stuff in systemd.

man systemd.network
       Virtualization=
           Checks whether the system is executed in a virtualized environment and optionally test whether it is a specific implementation. See
           "ConditionVirtualization=" in systemd.unit(5) for details.
man systemd.unit

           ConditionVirtualization= may be used to check whether the system is executed in a virtualized environment and optionally test whether it
           is a specific implementation. Takes either boolean value to check if being executed in any virtualized environment, or one of vm and
           container to test against a generic type of virtualization solution, or one of qemu, kvm, zvm, vmware, microsoft, oracle, xen, bochs,
           uml, bhyve, qnx, openvz, lxc, lxc-libvirt, systemd-nspawn, docker, rkt to test against a specific implementation, or private-users to
           check whether we are running in a user namespace. See systemd-detect-virt(1) for a full list of known virtualization technologies and
           their identifiers. If multiple virtualization technologies are nested, only the innermost is considered. The test may be negated by
           prepending an exclamation mark.

What exactly is currently broken in matching on containers in networkd and what makes you think it has to do with udev?

Mic92 commented 6 years ago

No it is not about the detecting containers, but container network interfaces:

# lib/systemd/network/80-container-ve.network
[Match]
Name=ve-*
Driver=veth # <-- This driver is not detected with our udev rules

[Network]
# Default to using a /28 prefix, giving up to 13 addresses per container.
Address=0.0.0.0/28
LinkLocalAddressing=yes
DHCPServer=yes
IPMasquerade=yes
LLDP=yes
EmitLLDP=customer-bridge
arianvp commented 5 years ago

Given that networkd doesn't touch any interfaces it doesn't explicitly manage, I think it's harmless to enable it by default @fpletz .
With enable I mean run networkd regardless of whether networking.useNetworkd = true

flokli commented 5 years ago

Having networkd manage interfaces created by nspawn and friends would be nice. When networkd is enabled, it honors the following files:

${pkgs.systemd}/lib/systemd/network/80-container-host0.network
${pkgs.systemd}/lib/systemd/network/80-container-ve.network
${pkgs.systemd}/lib/systemd/network/80-container-vz.network
${pkgs.systemd}/lib/systemd/network/99-default.link

On top of that, there's a /etc/systemd/network/99-main.network created by nixos/modules/tasks/network-interfaces-systemd.nix, and a /etc/systemd/network/40-vboxnet0.network created by nixos/modules/virtualisation/virtualbox-host.nix

I'm not sure if simply enabling networkd breaks some scenarios.

Things like nixos/modules/virtualisation/containers.nix do some shell-based network interface setup. We might need to change some of the logic in there to make use of the native networkd-provided networking - or provide some more explicit configuration in the module if needed.

On top of that, when switching my configuration, systemd-networkd-wait-online.service waited a loong time some interfaces to become online - in my case, vboxnet0, until it did timeout (the link has no carrier). virbr0, virbr0-nic and docker0 are also in state configuring

We might need to add some exclusion rules for things like that too - setting RequiredForOnline=no in a specific .network file might do the trick.

arianvp commented 5 years ago

Note that

nixos/modules/tasks/network-interfaces-systemd.nix

would only be created with useNetworkd = true;

and apparently the 99-main.network that it generates wildcard matches all interfaces to not break semantics of networking.useDHCP which will break container stuff (Also see https://github.com/NixOS/nixpkgs/issues/18962)

So managing container network with networkd will work when networking.useNetworkd = false; but will probably break when networking.useNetwork = true; to make stuff more complicated :)

arianvp commented 5 years ago

@andir seems to have recently removed the renaming rule in question https://github.com/NixOS/nixpkgs/commit/1f03f6fc43a6f71b8204adf6cd02fb3685261add#diff-c1c886b16586c62e53e0d38c07f9bb6d and lets the kernel rename the network instead

This means we can just ship ${udev}/lib/udev/rules.d/80-net-setup-link.rules and stuff should work. I'll go create a PR.

Also shipping ${pkgs.systemd}/lib/systemd/network/99-default.link will not hurt as the NamePolicy will check whether the kernel did the renaming already:

NamePolicy=keep kernel database onboard slot path
MACAddressPolicy=persistent

Conclusion: network renaming now works both with and without networkd enabled.

We can freely include the ${udev}/lib/udev/rules.d/80-net-setup-link.rules rule to fix @Mic92 's issue

And we can include these network rules in our systemd module for making systemd-nspawn networking work as expected (and later use that as a base to get rid of scripted networking inside nixos-container)

${pkgs.systemd}/lib/systemd/network/80-container-host0.network
${pkgs.systemd}/lib/systemd/network/80-container-ve.network
${pkgs.systemd}/lib/systemd/network/80-container-vz.network
${pkgs.systemd}/lib/systemd/network/99-default.link

and be done with it and everything should work as far as I can see.

andir commented 5 years ago

@arianvp did you get around to create that PR yet?

johnalotoski commented 5 years ago

@andir @arianvp, note that commit #https://github.com/NixOS/nixpkgs/commit/1f03f6fc43a6f71b8204adf6cd02fb3685261add introduces arp networking problems with bonded nics, at least as far as I've tested on packet.net. For instance, spinning up a c2.medium.x86 (AMD) or c1.small.x86 (Intel) server on master nixpkgs with packet.net and default bonded bond0 NIC using 802.3ad LACP (set up automatically with the provisioning script) will result in no network connectivity for 15 - 30 minutes. During this time tcpdump is observed (via sos-console) to request arp who-has for the gateway during which the gateway will respond with is-at MAC addresses, but the arp table will continue to show incomplete for the gateway during this time. Rebooting or stopping and re-raising the bond interface will cause the loss of connectivity again for an extended period.

cc: @disassembler @cleverca22

flokli commented 5 years ago

@johnalotoski can you provide the networkctl status command for both the bond interface, and the underlying nic interface?

johnalotoski commented 5 years ago

Hi @flokli, networkd is not in use, but here is the output. Bisect led to this particular commit and it reliably reproduces the arp problem. I can provide more info if that would be helpful (cat /proc/net/bonding/bond0, ethtool, etc). IPs/MACs below masked for privacy.

[root@c2ipxe:~]# networkctl status bond0
WARNING: systemd-networkd is not running, output will be incomplete.

● 4: bond0
       Link File: /nix/store/gaz60mpylxry2qskvw045h803lv5lil6-systemd-242/lib/systemd/network/99-default.link
    Network File: n/a
            Type: bond
           State: n/a (unmanaged)
          Driver: bonding
      HW Address: xx:yy:zz:c0:ef:35
         Address: PrivIPv4
                  PubIPv4
                  PrivIPv6
                  PubIPv6
         Gateway: GatewayIPv4
                  GatewayIPv6

[root@c2ipxe:~]# networkctl status enp1s0f0
WARNING: systemd-networkd is not running, output will be incomplete.

● 2: enp1s0f0
       Link File: /nix/store/gaz60mpylxry2qskvw045h803lv5lil6-systemd-242/lib/systemd/network/99-default.link
    Network File: n/a
            Type: ether
           State: n/a (unmanaged)
            Path: pci-0000:01:00.0
          Driver: mlx5_core
          Vendor: Mellanox Technologies
           Model: MT27710 Family [ConnectX-4 Lx] (Stand-up ConnectX-4 Lx EN, 25GbE dual-port SFP28, PCIe3.0 x8, MCX4121A-ACAT)
      HW Address: xx:yy:zz:c0:ef:35

[root@c2ipxe:~]# networkctl status enp1s0f1
WARNING: systemd-networkd is not running, output will be incomplete.

● 3: enp1s0f1
       Link File: /nix/store/gaz60mpylxry2qskvw045h803lv5lil6-systemd-242/lib/systemd/network/99-default.link
    Network File: n/a
            Type: ether
           State: n/a (unmanaged)
            Path: pci-0000:01:00.1
          Driver: mlx5_core
          Vendor: Mellanox Technologies
           Model: MT27710 Family [ConnectX-4 Lx] (Stand-up ConnectX-4 Lx EN, 25GbE dual-port SFP28, PCIe3.0 x8, MCX4121A-ACAT)
      HW Address: xx:yy:zz:c0:ef:35
flokli commented 5 years ago

@johnalotoski while networkd is not in use, the .link files are honored by systemd. Can you point me to this systems' configuration, so I can see how the bonds are being set up?

Also cc @grahamc

johnalotoski commented 5 years ago

Hi @flokli, the instances are provisioned by iPXE images from the packet-nixos repo used by packet.net. During provisioning, a nix configuration snippet for networking and bonding is generated which looks like the following, where the parameters in angle brackets are populated from the instance metadata:

{ 
  networking.hostName = "<hostname>";
  networking.dhcpcd.enable = false;
  networking.defaultGateway = {
    address =  "<IPv4>";
    interface = "bond0";
  };
  networking.defaultGateway6 = {
    address = "<IPv6>";
    interface = "bond0";
  };
  networking.nameservers = [
    "<packetDnsIpv4>"
    "<packetDnsIpv4>"
  ];

  networking.bonds.bond0 = {
    driverOptions = {
      mode = "802.3ad";
      xmit_hash_policy = "layer3+4";
      lacp_rate = "fast";
      downdelay = "200";
      miimon = "100";
      updelay = "200";
    };

    interfaces = [
      "enp1s0f0" "enp1s0f1"
    ];
  };

  networking.interfaces.bond0 = {
    useDHCP = false;

    ipv4 = {
      routes = [
        {
          address = "10.0.0.0";
          prefixLength = 8;
          via = "<IPv4>";
        }
      ];
      addresses = [
        {
          address = "<pubIPv4>";
          prefixLength = <pubCIDR>;
        }
        {
          address = "<privIPv4>";
          prefixLength = <privCIDR>;
        }
      ];
    };

    ipv6 = {
      addresses = [
        {
          address = "<IPv6>";
          prefixLength = <CIDR>;
        }
      ];
    };
  };
}

We are using a nixops packet plugin to pass this nix networking snippet (the same nix snippet as that generated by the packet provisioning script) with appropriate metadata populated for use in nixops deployments. See, for example, the c2.medium.x86 nix network configuration used by the packet nixops plugin.

andir commented 5 years ago

@johnalotoski Can you tell me which systemd services are running during those 15-30min? Maybe a process tree could also be helpful? (systemctl status > file might be able to provide both at the same time.)

There is one while …; do sleep 0.1; done snippet in the scripted networking that could be spinning during that period of time.

johnalotoski commented 5 years ago

Hi @andir, here is an attachment that contains a sequence of commands taken during the outage (arp-incomplete) and after normal networking resumed (arp-complete). There is also a diff between them. Each of the arp complete and incomplete files captured the following output:

systemctl list-units --all
systemctl status
ps -ejH
ps axjf
pstree -w -l 100

This was taken from a machine deployed at the commit in question. comparison.zip

arianvp commented 5 years ago

A few other diagnosticcs that might help:

systemd-analyze blame > blame.txt
systemd-analyze critical-chain > critical-chain.txt
systemd-analyze plot > plot.svg

Especially the last one will show you a detailed account of the system starting up and where it might be stuck. (Though dunno if this will be of much help, just thought it might be useful)

andir commented 5 years ago

Hi @andir, here is an attachment that contains a sequence of commands taken during the outage (arp-incomplete) and after normal networking resumed (arp-complete). There is also a diff between them. Each of the arp complete and incomplete files captured the following output:

systemctl list-units --all
systemctl status
ps -ejH
ps axjf
pstree -w -l 100

This was taken from a machine deployed at the commit in question. comparison.zip

Thanks! Could you also verify if ti happens on master and/or release-19.09? We had another systemd bump there that might have already fixed it and I would like to avoid wasting time chasing old bugs.

johnalotoski commented 5 years ago

Hi @andir, yes, this happens on master and I believe on release 19-09 also. I don't blame you for not wanting to waste time; ditto. @arianvp, thanks for the tip! I've included the output of those commands in the diagnostic zip file below. This seemed like a good opportunity to use asciinema to convey the issue in a more tangible way. I recorded and have included below two console sessions of the problem taken in parallel which illustrate the issue: one console session from the nixops deployer side and one console session from the server where some debugging is done both during the network outage and after the network outage. Diagnostic/debug files collected during those videos are attached in the diagnostic zip file below. The reverse patch of the commit in question is applied to the head of master nixpkgs and shown to resolve the issue. The asciinema videos have a maximum console idle time of 2 seconds to keep the time of the video short.

Asciinema video 1: Nixops bonding nic debugging deploy Asciinema video 2: Packet server bonding nic debugging session Files collected during the video: diagnostic.zip

For the files, the naming is:

arianvp commented 5 years ago

I just confirmed that this issue is fixed on 19.09:

Host network:

networkctl status vz-nixos
● 8: vz-nixos                                                                                                                
               Link File: /nix/store/gg0ppshg45gksxsq2jbjbhvm3mk70vq9-systemd-243/lib/systemd/network/99-default.link        
            Network File: /nix/store/gg0ppshg45gksxsq2jbjbhvm3mk70vq9-systemd-243/lib/systemd/network/80-container-vz.network
                    Type: bridge                                                                                             
                   State: routable (configured)                                                    
                  Driver: bridge                                                                                             
              HW Address: fa:4b:0c:87:73:6b                                                                                  
                     MTU: 1500 (min: 68, max: 65535)                                                                         
           Forward Delay: 15s                                                                                                
              Hello Time: 2s                                                                                                 
                 Max Age: 20s                                                                                                
             Ageing Time: 5min                                                                                               
                Priority: 32768                                                                                              
                     STP: no                                                                                                 
  Multicast IGMP Version: 2                                                                                                  
    Queue Length (Tx/Rx): 1/1                                                                                                
                 Address: 192.168.210.1                                                                                      
                          169.254.244.253                                                                                    
                          fe80::f84b:cff:fe87:736b 

Two nspawn containers (Created with nixos-instal) both get an IP:

[root@arianvp:~]# machinectl list
MACHINE CLASS     SERVICE        OS    VERSION           ADDRESSES       
test1   container systemd-nspawn nixos 20.03.git.0092f2e 192.168.210.32… 
test2   container systemd-nspawn nixos 20.03.git.0092f2e 192.168.210.184…

Contained side gets configured correctly too:

[root@test1:~]# networkctl status -a
● 1: lo                                    
             Link File: n/a                
          Network File: n/a                
                  Type: loopback           
                 State: carrier (unmanaged)
                   MTU: 65536              
  Queue Length (Tx/Rx): 1/1                
               Address: 127.0.0.1          
                        ::1                

● 2: host0                                                                                                                    
             Link File: n/a                                                                                                   
          Network File: /nix/store/gg0ppshg45gksxsq2jbjbhvm3mk70vq9-systemd-243/lib/systemd/network/80-container-host0.network
                  Type: ether                                                                                                 
                 State: routable (configured)                                                       
            HW Address: 42:ef:e5:e2:77:59                                                                                     
                   MTU: 1500 (min: 68, max: 65535)                                                                            
  Queue Length (Tx/Rx): 1/1                                                                                                   
      Auto negotiation: no                                                                                                    
                 Speed: 10Gbps                                                                                                
                Duplex: full                                                                                                  
                  Port: tp                                                                                                    
               Address: 192.168.210.32                                                                                        
                        169.254.22.221                                                                                        
                        fe80::40ef:e5ff:fee2:7759                                                                             
               Gateway: 192.168.210.1                                                                                         
             Time Zone: Europe/Amsterdam                                                                                      
          Connected To: test2 on port host0                                                                                   
                        arianvp.me on port vz-nixos    

Play around with this yourself with my systemd-nspawn module (Which I eventually want to use as a base for nixos-container in 20.03)

https://github.com/arianvp/nixos-stuff/blob/master/modules/containers-v2.nix https://github.com/arianvp/nixos-stuff/blob/master/configs/arianvp.me/default.nix#L28-L35

Host network config: https://github.com/arianvp/nixos-stuff/blob/master/configs/arianvp.me/network.nix Container network config: https://github.com/arianvp/nixos-stuff/blob/master/modules/containers-v2.nix#L33-L36

fpletz commented 5 years ago

@arianvp For completeness I also tested this with plain ve- interfaces on the host with networkd:

● 29: ve-foo
             Link File: /nix/store/ag67dibj50z39rw1sr39zjd0dx6zcf2d-systemd-243/lib/systemd/network/99-default.link
          Network File: /nix/store/ag67dibj50z39rw1sr39zjd0dx6zcf2d-systemd-243/lib/systemd/network/80-container-ve.network
                  Type: ether
                 State: routable (configured)
                Driver: veth
            HW Address: 02:2d:19:52:30:a4
                   MTU: 1500 (min: 68, max: 65535)
  Queue Length (Tx/Rx): 1/1
      Auto negotiation: no
                 Speed: 10Gbps
                Duplex: full
                  Port: tp
               Address: 192.168.7.177
                        169.254.95.164
                        fe80::2d:19ff:fe52:30a4
          Connected To: foo on port host0

Yet this does not solve this whole mess completely, in particular not @johnalotoski's problem which is clearly related.

@johnalotoski After reviewing your extensive debug logs (thanks a lot!), I'm hoping that we just have to ship the updated 80-net-setup-link.rules to let udev do it's magic again without networkd enabled (which is exactly your case, not what @arianvp did above). I'm not yet sure what that magic might be and what exactly changed in udev.

I'll try to reproduce that in a NixOS test and will open a PR with the change for you to test. This is clearly something we have to fix for 19.09.

@arianvp This is BTW also somewhat related to our predictable ifnames in initrd fix where we had to resort to include all udev rules in the initrd (and 80-net-setup-link.rules in particular). That commit is not on master yet: 7da962d31b9113f16161510909a66a397dad91fc.

arianvp commented 5 years ago

the udev rule is included by default even if networkd is disabled. This is the same udev rule that enables our interface rename, which is working! If you enable debug logging on udev you'll see that it is indeed loaded during boot already. There's no need to include it.

johnalotoski commented 5 years ago

Hi @fpletz, @arianvp, happy to test out any potential fixes against the packet.net infra, thanks much!

arianvp commented 5 years ago

@fpletz you can debug that it is indeed running by doing this:

[root@arianvp:~]# udevadm -d test-builtin net_setup_link /sys/class/net/ens3 
Trying to open "/etc/udev/hwdb.bin"...
=== trie on-disk ===
tool version:          243
file size:         8269771 bytes
header size             80 bytes
strings            2110315 bytes
nodes              6159376 bytes
Load module index
Found container virtualization none.
timestamp of '/etc/systemd/network' changed
Parsed configuration file /nix/store/gg0ppshg45gksxsq2jbjbhvm3mk70vq9-systemd-243/lib/systemd/network/99-default.link
Created link configuration context.
ID_NET_DRIVER=virtio_net
ens3: Config file /nix/store/gg0ppshg45gksxsq2jbjbhvm3mk70vq9-systemd-243/lib/systemd/network/99-default.link is applied
ethtool: autonegotiation is unset or enabled, the speed and duplex are not writable.
ens3: Device has name_assign_type=4
Using default interface naming scheme 'v243'.
ens3: Policy *keep*: keeping existing userspace name
ens3: Device has addr_assign_type=0
ens3: MAC on the device already matches policy *persistent*
ID_NET_LINK_FILE=/nix/store/gg0ppshg45gksxsq2jbjbhvm3mk70vq9-systemd-243/lib/systemd/network/99-default.link
Unload module index
Unloaded link configuration context.

Can we close this issue and move the packet-network issue stuff to the new issue I opened for that specific issue?

flokli commented 5 years ago

Can we open a new issue for the packet-specific problem? Already moved to https://github.com/NixOS/nixpkgs/issues/69360. I think this it is mostly a documentation issue, (plus some follow-up fixes for packet), and we documentation issue was fixed in https://github.com/NixOS/nixpkgs/pull/71456.