NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
18.09k stars 14.07k forks source link

Wireguard doesn't bring up peers #63869

Open sjau opened 5 years ago

sjau commented 5 years ago

Issue description

Using wireguard on a server with several peers added it looks like they aren't brought up properly and since the type is set to "oneshot" it won't even retry. Problem seems to be related to dns not being available.

Steps to reproduce

  1. Setup a WG server somewhere

  2. Setup a wg on nixos where the server is set as peer and domain name is used instead of ip address (never tried with ip alone though), like

        wg_jl = {
            ips = [ "10.8.0.10/32" ];
            privateKey = "abc";
            peers = [ {
                allowedIPs = [ "10.8.0.0/24" ];
                endpoint = "vpn.domain.tld:51820";
                publicKey = "xyz";
                persistentKeepalive = 25;
            } ];
        };
  1. Rebuild

  2. Reboot

  3. Try to ping wg server at 10.8.0.1

--> 100% packet loss

For some reason systemctl does show it as started:

● wireguard-wg_jl.service - WireGuard Tunnel - wg_jl                                                                                                                                                                  "servi" 21:05 27-Jun-19
   Loaded: loaded (/nix/store/1jhgf70mi82wv9r8xzi7dgsmhp4kbjrd-unit-wireguard-wg_jl.service/wireguard-wg_jl.service; enabled; vendor preset: enabled)
   Active: active (exited) since Thu 2019-06-27 21:06:58 CEST; 1min 5s ago
  Process: 3089 ExecStart=/nix/store/d3ngi8wjx92d6wsvmw0ln6sg29mxn4qn-unit-script-wireguard-wg_jl-start (code=exited, status=0/SUCCESS)
 Main PID: 3089 (code=exited, status=0/SUCCESS)

Jun 27 21:06:57 servi systemd[1]: Starting WireGuard Tunnel - wg_jl...
Jun 27 21:06:58 servi systemd[1]: Started WireGuard Tunnel - wg_jl.

Looking at that start unit file it has this content:

#! /nix/store/w9ngash2dw3pvl98ysd79qy2rkkmc8my-bash-4.4-p23/bin/bash -e
modprobe wireguard || true

ip link add dev wg_jl type wireguard

ip address add 10.8.0.10/32 dev wg_jl

wg set wg_jl private-key /nix/store/....-wg-key 

ip link set up dev wg_jl

and ip addr show also lists the ip:

6: wg_jl: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN group default qlen 1000
    link/none 
    inet 10.8.0.10/32 scope global wg_jl
       valid_lft forever preferred_lft forever

Looking at the unit files systemctl list-unit-files | grep wireguard this pop ups:

wireguard-wg_jl-peer-…e66I18brnJTKye\x2bYDXS8iQF-zDc\x3d.service enabled        
wireguard-wg_jl.service                                          enabled 

(How to make list-unit-files provide the full name of the file?)

Looking at the status of the peer unit file systemctl status wireguard-wg_jl-peer-{public key}.service, this is returned:

systemctl status wireguard-wg_jl-peer-{public key}.service 
● wireguard-wg_jl-peer-{public key}.service - WireGuard Peer - wg_jl - {public key}
   Loaded: loaded (/nix/store/f5kp5jammckjcpgjd7r8fa8gh0y5kzrj-unit-wireguard-wg_jl-peer-{public key}.service/wireguard-wg_jl-peer-{public key}.service; enabled>
   Active: failed (Result: exit-code) since Thu 2019-06-27 21:06:59 CEST; 15min ago
  Process: 3206 ExecStart=/nix/store/08ilzr9yicqvb5wz70c3przyl235vy9w-unit-script-wireguard-wg_jl-peer-{public key}-start (code=exited, status=1/FAILURE)
  Process: 3246 ExecStopPost=/nix/store/1j8smpfa9sgib8j4as6lpbsmpz4fcq9k-unit-script-wireguard-wg_jl-peer-{public key}-post-stop (code=exited, status=1/FAILURE)
 Main PID: 3206 (code=exited, status=1/FAILURE)

Jun 27 21:06:59 servi 08ilzr9yicqvb5wz70c3przyl235vy9w-unit-script-wireguard-wg_jl-peer-{public key}-start[3206]: Name or service not known: `vpn.jus-law.ch:51820'
Jun 27 21:06:59 servi 1j8smpfa9sgib8j4as6lpbsmpz4fcq9k-unit-script-wireguard-wg_jl-peer-{public key}-post-stop[3246]: Unable to modify interface: No such device
Jun 27 21:06:58 servi systemd[1]: Starting WireGuard Peer - wg_jl - {public key}...
Jun 27 21:06:59 servi systemd[1]: wireguard-wg_jl-peer-{public key}.service: Main process exited, code=exited, status=1/FAILURE
Jun 27 21:06:59 servi systemd[1]: wireguard-wg_jl-peer-{public key}.service: Control process exited, code=exited, status=1/FAILURE
Jun 27 21:06:59 servi systemd[1]: wireguard-wg_jl-peer-{public key}.service: Failed with result 'exit-code'.
Jun 27 21:06:59 servi systemd[1]: Failed to start WireGuard Peer - wg_jl - {public key}.

The unit file /nix/store/f5kp5jammckjcpgjd7r8fa8gh0y5kzrj-unit-wireguard-wg_jl-peer-{public key}.service/wireguard-wg_jl-peer-{public key}.service contains:

[Unit]
After=wireguard-wg_jl.service
Description=WireGuard Peer - wg_jl - {public key}
Requires=wireguard-wg_jl.service

[Service]
Environment="DEVICE=wg_jl"
Environment="LOCALE_ARCHIVE=/nix/store/fnr9mys8l5224sqc7vgisgb66s5429f4-glibc-locales-2.27/lib/locale/locale-archive"
Environment="PATH=/nix/store/0dlzvjdi3hg1wkqb6bwv1aihrxnkcdv8-iproute2-5.1.0/bin:/nix/store/9nss714lk1g40swq51115a4xnxpzndb5-wireguard-tools-0.0.20190601/bin:/nix/store/i4r7xx7sj1bjgvj9p6dv59f1mb329ivw-coreutils-8.31/bin:/nix/store/dbbv$
Environment="TZDIR=/nix/store/9lyxln5y04lj5vg7npzvid0f8ampva1s-tzdata-2019a/share/zoneinfo"
Environment="WG_ENDPOINT_RESOLUTION_RETRIES=infinity"

ExecStart=/nix/store/08ilzr9yicqvb5wz70c3przyl235vy9w-unit-script-wireguard-wg_jl-peer-{public key}-start
ExecStopPost=/nix/store/1j8smpfa9sgib8j4as6lpbsmpz4fcq9k-unit-script-wireguard-wg_jl-peer-{public key}-post-stop
RemainAfterExit=true
Type=oneshot

And the ExecStart /nix/store/08ilzr9yicqvb5wz70c3przyl235vy9w-unit-script-wireguard-wg_jl-peer-{public key}-start contains:

#! /nix/store/w9ngash2dw3pvl98ysd79qy2rkkmc8my-bash-4.4-p23/bin/bash -e
wg set wg_jl peer {public key} endpoint vpn.domain.tld:51820 persistent-keepalive 25 allowed-ips 10.8.0.0/24
ip route replace 10.5.0.0/24 dev wg_jl table main

So, once the system is up and running, I have to re-issue systemctl restart 'wireguard-wg_jl-peer-{public key}.service' and then it works.

For some reason those peer unit files aren't properly executed - likely because dns isn't available at that point.

Changein them from oneshot to simple with retry on fail could improve the situation.

Technical details

Please run nix-shell -p nix-info --run "nix-info -m" and paste the results.

Shados commented 5 years ago

I am experiencing the same issue, since #62325.

shados@dreamlogic[~] λ sudo systemctl status wireguard-wg0-peer-<peerkey>\\x3d.service
[sudo] password for shados:
● wireguard-wg0-peer-<peerkey>\x3d.service - WireGuard Peer - wg0 - <peerkey>=
   Loaded: loaded (/nix/store/h57211bakrr87xyga7ks45a8jl40dp14-unit-wireguard-wg0-peer-<peerkey>-x3d.service/wireguard-wg0-peer-<peerkey>\x3d.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Wed 2019-07-24 23:45:57 AEST; 46s ago
  Process: 9126 ExecStart=/nix/store/m6ik3sfnlk998vss02l473zgn9fs6zv6-unit-script-wireguard-wg0-peer-<peerkey>--x3d-start (code=exited, status=1/FAILURE)
  Process: 9162 ExecStopPost=/nix/store/x5xw5wfj8rj2nkly70l18vhgyyjgn3bg-unit-script-wireguard-wg0-peer-<peerkey>--x3d-post-stop (code=exited, status=1/FAILURE)
 Main PID: 9126 (code=exited, status=1/FAILURE)

Jul 24 23:45:57 dreamlogic systemd[1]: Starting WireGuard Peer - wg0 - <peerkey>=...
Jul 24 23:45:57 dreamlogic m6ik3sfnlk998vss02l473zgn9fs6zv6-unit-script-wireguard-wg0-peer-<peerkey>--x3d-start[9126]: Name or service not known: `<fqdn:port>'
Jul 24 23:45:57 dreamlogic systemd[1]: wireguard-wg0-peer-<peerkey>\x3d.service: Main process exited, code=exited, status=1/FAILURE
Jul 24 23:45:57 dreamlogic x5xw5wfj8rj2nkly70l18vhgyyjgn3bg-unit-script-wireguard-wg0-peer-<peerkey>--x3d-post-stop[9162]: Unable to modify interface: No such device
Jul 24 23:45:57 dreamlogic systemd[1]: wireguard-wg0-peer-<peerkey>\x3d.service: Control process exited, code=exited, status=1/FAILURE
Jul 24 23:45:57 dreamlogic systemd[1]: wireguard-wg0-peer-<peerkey>\x3d.service: Failed with result 'exit-code'.
Jul 24 23:45:57 dreamlogic systemd[1]: Failed to start WireGuard Peer - wg0 - <peerkey>=.

shados@dreamlogic[~] λ sudo systemctl --no-pager cat wireguard-wg0-peer-<peerkey>\\x3d.service
# /nix/store/h57211bakrr87xyga7ks45a8jl40dp14-unit-wireguard-wg0-peer-<peerkey>-x3d.service/wireguard-wg0-peer-<peerkey>\x3d.service
[Unit]
After=wireguard-wg0.service
Description=WireGuard Peer - wg0 - <peerkey>=
Requires=wireguard-wg0.service

[Service]
Environment="DEVICE=wg0"
Environment="LOCALE_ARCHIVE=/nix/store/p7yq8mx8k2mhismy0a10kacn6w7k4r9c-glibc-locales-2.27/lib/locale/locale-archive"
Environment="PATH=/nix/store/9gs2k6xcfddnpff0yraz0l8wz1ijammg-iproute2-5.1.0/bin:/nix/store/7i5yrx5v5dhmglac8hm32cik3jbajsb1-wireguard-tools-0.0.20190702/bin:/nix/store/k8lhqzpaaymshchz8ky3z4653h4kln9d-coreutils-8.31/bin:/nix/store/gjh3a8hqic3bqc2xzj8g2qxwz81wfjxx-findutils-4.6.0/bin:/nix/store/agcay3wmf74qinwshnjqy73w8rxf82hs-gnugrep-3.3/bin:/nix/store/vnyd3wh5i5kj66n9c5b8shzxjjrw22cn-gnused-4.7/bin:/nix/store/wzrglgf7i2zajmm5k53hypgnv3n0z3v6-systemd-242/bin:/nix/store/9gs2k6xcfddnpff0yraz0l8wz1ijammg-iproute2-5.1.0/sbin:/nix/store/7i5yrx5v5dhmglac8hm32cik3jbajsb1-wireguard-tools-0.0.20190702/sbin:/nix/store/k8lhqzpaaymshchz8ky3z4653h4kln9d-coreutils-8.31/sbin:/nix/store/gjh3a8hqic3bqc2xzj8g2qxwz81wfjxx-findutils-4.6.0/sbin:/nix/store/agcay3wmf74qinwshnjqy73w8rxf82hs-gnugrep-3.3/sbin:/nix/store/vnyd3wh5i5kj66n9c5b8shzxjjrw22cn-gnused-4.7/sbin:/nix/store/wzrglgf7i2zajmm5k53hypgnv3n0z3v6-systemd-242/sbin"
Environment="TZDIR=/nix/store/5fj4d8mp2qfdpa86sgfjmjyrm2mfssz7-tzdata-2019a/share/zoneinfo"
Environment="WG_ENDPOINT_RESOLUTION_RETRIES=infinity"

ExecStart=/nix/store/m6ik3sfnlk998vss02l473zgn9fs6zv6-unit-script-wireguard-wg0-peer-<peerkey>--x3d-start
ExecStopPost=/nix/store/x5xw5wfj8rj2nkly70l18vhgyyjgn3bg-unit-script-wireguard-wg0-peer-<peerkey>--x3d-post-stop
RemainAfterExit=true
Type=oneshot

Based upon this comment, I suspected this to be the culprit, but a rebuild with that changed to NOTFOUND=return did not resolve the issue.


@zx2c4: From reading #61971, I understand why you wished to improve the retry logic within wg instead of relying on external restart logic (especially given the variety of contexts wg is used in), but I am not certain that that approach is the best option for NixOS.

My primary concerns are:

  1. "permanent" resolution failures aren't necessarily permanent in the broader sense; whether or not to accept them as a dead-end failure condition is a system/context-specific administrative decision
  2. Relying on application-specific restart/retry logic instead of the general systemd service management logic makes the service and module implementation less comprehensible, because NixOS hackers are far less likely to already be familiar with the WG_ENDPOINT_RESOLUTION_RETRIES environment variable than they are systemd's restart logic

Instead, given wg exits with a non-zero error code for the permanent resolution failure codes, the NixOS wireguard peer services could be configured like:

Type=simple
Restart=on-failure

Can you foresee any issues with this?

sjau commented 5 years ago

I think re-adding

Type=simple
Restart=on-failure

should be done as I can't see any problem with it.

dR3b commented 4 years ago

What's the status here? Wireguard still seems to be broken. I have the same bug and can't use it.

 - system: `"x86_64-linux"`
 - host os: `Linux 4.19.101, NixOS, 19.09.2008.ea553d8c67c (Loris)`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.3.2`
 - channels(root): `"nixos-19.09.2008.ea553d8c67c, nixos-unstable-20.03pre208413.e1eedf29e5d"`
 - nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixos`
zx2c4 commented 4 years ago

because NixOS hackers are far less likely to already be familiar with the WG_ENDPOINT_RESOLUTION_RETRIES

This is not a reason to script things improperly. Familiarize yourself and use the right tool for the task.

"permanent" resolution failures aren't necessarily permanent in the broader sense;

If your resolver is configured correctly, you should be able to distinguish between "domain record doesnt exist on the internet" and "dont have a responding dns resolver yet".

stale[bot] commented 4 years ago

Hello, I'm a bot and I thank you in the name of the community for opening this issue.

To help our human contributors focus on the most-relevant reports, I check up on old issues to see if they're still relevant. This issue has had no activity for 180 days, and so I marked it as stale, but you can rest assured it will never be closed by a non-human.

The community would appreciate your effort in checking if the issue is still valid. If it isn't, please close it.

If the issue persists, and you'd like to remove the stale label, you simply need to leave a comment. Your comment can be as simple as "still important to me". If you'd like it to get more attention, you can ask for help by searching for maintainers and people that previously touched related code and @ mention them in a comment. You can use Git blame or GitHub's web interface on the relevant files to find them.

Lastly, you can always ask for help at our Discourse Forum or at #nixos' IRC channel.

jian-lin commented 3 years ago

I still have this problem.

kravemir commented 2 years ago

"permanent" resolution failures aren't necessarily permanent in the broader sense;

If your resolver is configured correctly, you should be able to distinguish between "domain record doesnt exist on the internet" and "dont have a responding dns resolver yet".

On the same machine I've got tailscale enabled with enabled Magic DNS. However, in my case was the issue in dhcpcd announcing Started DHCP Client before it got any real lease and any real default route was set. By default, it looks like IPv4LL is enabled:

$ journalctl -u wireguard-wg-* -u dhcpcd.service
...
-- Boot 3146884528d2413889e2dd737a596d5a --
Feb 24 06:48:16 cubie systemd[1]: Starting DHCP Client...
Feb 24 06:48:16 cubie dhcpcd[1061]: dev: loaded udev
Feb 24 06:48:16 cubie systemd[1]: Starting WireGuard Tunnel - wg-NET-SERVICE - Key Generator...
Feb 24 06:48:16 cubie systemd[1]: Finished WireGuard Tunnel - wg-NET-SERVICE - Key Generator.
Feb 24 06:48:17 cubie dhcpcd[1061]: enp1s0: waiting for carrier
Feb 24 06:48:17 cubie dhcpcd[1061]: enp1s0: carrier acquired
Feb 24 06:48:17 cubie dhcpcd[1061]: DUID 00:04:b3:0e:32:40:72:ba:11:e3:90:da:38:d5:47:de:e8:31
Feb 24 06:48:17 cubie dhcpcd[1061]: enp1s0: IAID 47:de:e8:31
Feb 24 06:48:17 cubie dhcpcd[1061]: enp1s0: adding address fe80::3ad5:47ff:fede:e831
Feb 24 06:48:17 cubie dhcpcd[1061]: enp1s0: carrier lost
Feb 24 06:48:17 cubie dhcpcd[1061]: enp1s0: deleting address fe80::3ad5:47ff:fede:e831
Feb 24 06:48:17 cubie dhcpcd[1061]: br-grafprom: new hardware address: 4a:eb:b6:12:a5:b1
Feb 24 06:48:17 cubie dhcpcd[1061]: br-grafprom: new hardware address: 9a:b2:37:06:ed:56
Feb 24 06:48:20 cubie dhcpcd[1061]: enp1s0: carrier acquired
Feb 24 06:48:20 cubie dhcpcd[1061]: enp1s0: IAID 47:de:e8:31
Feb 24 06:48:20 cubie dhcpcd[1061]: enp1s0: adding address fe80::3ad5:47ff:fede:e831
Feb 24 06:48:21 cubie dhcpcd[1061]: enp1s0: soliciting an IPv6 router
Feb 24 06:48:21 cubie dhcpcd[1061]: enp1s0: soliciting a DHCP lease
Feb 24 06:48:26 cubie dhcpcd[1061]: enp1s0: probing for an IPv4LL address
Feb 24 06:48:31 cubie dhcpcd[1061]: enp1s0: using IPv4LL address 169.254.101.247
Feb 24 06:48:31 cubie dhcpcd[1061]: enp1s0: adding route to 169.254.0.0/16
Feb 24 06:48:31 cubie dhcpcd[1061]: enp1s0: adding default route
Feb 24 06:48:31 cubie dhcpcd[1061]: forked to background, child pid 2607
Feb 24 06:48:31 cubie systemd[1]: Started DHCP Client.
Feb 24 06:48:31 cubie systemd[1]: Starting WireGuard Tunnel - wg-NET-SERVICE...
Feb 24 06:48:31 cubie systemd[1]: Finished WireGuard Tunnel - wg-NET-SERVICE.
Feb 24 06:48:31 cubie systemd[1]: Started WireGuard Peer - wg-NET-SERVICE - TfEUzOWqz/pGWz7b87jPCNigVYUktXo042w06dIgp1M=.
Feb 24 06:48:31 cubie wireguard-wg-NET-SERVICE-peer-TfEUzOWqz-pGWz7b87jPCNigVYUktXo042w06dIgp1M--x3d-refresh-start[2628]: Name or service not known: `[NET-SERVICE.sk:55251](http://NET-SERVICE.sk:55251/)'
Feb 24 06:48:31 cubie systemd[1]: wireguard-wg-NET-SERVICE-peer-TfEUzOWqz-pGWz7b87jPCNigVYUktXo042w06dIgp1M\x3d-refresh.service: Main process exited, code=exited, status=1/FAILURE
Feb 24 06:48:31 cubie wireguard-wg-NET-SERVICE-peer-TfEUzOWqz-pGWz7b87jPCNigVYUktXo042w06dIgp1M--x3d-refresh-post-stop[2631]: RTNETLINK answers: No such process
Feb 24 06:48:31 cubie systemd[1]: wireguard-wg-NET-SERVICE-peer-TfEUzOWqz-pGWz7b87jPCNigVYUktXo042w06dIgp1M\x3d-refresh.service: Control process exited, code=exited, status=2/INVALIDARGUMENT
Feb 24 06:48:31 cubie systemd[1]: wireguard-wg-NET-SERVICE-peer-TfEUzOWqz-pGWz7b87jPCNigVYUktXo042w06dIgp1M\x3d-refresh.service: Failed with result 'exit-code'.
Feb 24 06:48:33 cubie dhcpcd[2607]: enp1s0: offered 192.168.88.154 from 192.168.88.1
Feb 24 06:48:33 cubie dhcpcd[2607]: enp1s0: probing address 192.168.88.154/24
Feb 24 06:48:33 cubie dhcpcd[2607]: enp1s0: no IPv6 Routers available
Feb 24 06:48:38 cubie dhcpcd[2607]: enp1s0: leased 192.168.88.154 for 600 seconds
Feb 24 06:48:38 cubie dhcpcd[2607]: enp1s0: adding route to 192.168.88.0/24
Feb 24 06:48:38 cubie dhcpcd[2607]: enp1s0: changing default route via 192.168.88.1
Feb 24 06:48:38 cubie dhcpcd[2607]: enp1s0: deleting route to 169.254.0.0/16
Feb 24 06:48:38 cubie dhcpcd[2607]: enp1s0: pid 2607 deleted default route via 192.168.88.1
Feb 24 06:53:38 cubie dhcpcd[2607]: enp1s0: adding default route via 192.168.88.1

(In output above the real name was replaced with NET-SERVICE for privacy reasons)

This specific issue was FIXED BY disallowing IPv4LL for dhcpcd:

  networking.dhcpcd = {
    wait = "ipv4";
    extraConfig = "noipv4ll";
  };

Now, dhcpcd is considered to be started only after it has got real ipv4 lease and set default route:

-- Boot cad41d179fcd49169c774be69b119c5a --
Feb 24 07:11:23 cubie systemd[1]: Starting DHCP Client...
Feb 24 07:11:23 cubie systemd[1]: Starting WireGuard Tunnel - wg-NET-SERVICE - Key Generator...
Feb 24 07:11:23 cubie dhcpcd[1060]: dev: loaded udev
Feb 24 07:11:23 cubie systemd[1]: Finished WireGuard Tunnel - wg-NET-SERVICE - Key Generator.
Feb 24 07:11:24 cubie dhcpcd[1060]: enp1s0: waiting for carrier
Feb 24 07:11:24 cubie dhcpcd[1060]: enp1s0: carrier acquired
Feb 24 07:11:24 cubie dhcpcd[1060]: DUID 00:04:b3:0e:32:40:72:ba:11:e3:90:da:38:d5:47:de:e8:31
Feb 24 07:11:24 cubie dhcpcd[1060]: enp1s0: IAID 47:de:e8:31
Feb 24 07:11:24 cubie dhcpcd[1060]: enp1s0: adding address fe80::3ad5:47ff:fede:e831
Feb 24 07:11:24 cubie dhcpcd[1060]: enp1s0: carrier lost
Feb 24 07:11:24 cubie dhcpcd[1060]: enp1s0: deleting address fe80::3ad5:47ff:fede:e831
Feb 24 07:11:24 cubie dhcpcd[1060]: br-grafprom: new hardware address: 6e:b1:57:7f:31:bf
Feb 24 07:11:24 cubie dhcpcd[1060]: br-grafprom: new hardware address: 9a:b2:37:06:ed:56
Feb 24 07:11:26 cubie dhcpcd[1060]: enp1s0: carrier acquired
Feb 24 07:11:26 cubie dhcpcd[1060]: enp1s0: IAID 47:de:e8:31
Feb 24 07:11:26 cubie dhcpcd[1060]: enp1s0: adding address fe80::3ad5:47ff:fede:e831
Feb 24 07:11:27 cubie dhcpcd[1060]: enp1s0: soliciting an IPv6 router
Feb 24 07:11:27 cubie dhcpcd[1060]: enp1s0: rebinding lease of 192.168.88.154
Feb 24 07:11:32 cubie dhcpcd[1060]: enp1s0: probing address 192.168.88.154/24
Feb 24 07:11:37 cubie dhcpcd[1060]: enp1s0: leased 192.168.88.154 for 600 seconds
Feb 24 07:11:37 cubie dhcpcd[1060]: enp1s0: adding route to 192.168.88.0/24
Feb 24 07:11:37 cubie dhcpcd[1060]: enp1s0: adding default route via 192.168.88.1
Feb 24 07:11:37 cubie dhcpcd[2655]: Failed to reload-or-try-restart ntpd.service: Unit ntpd.service not found.
Feb 24 07:11:37 cubie dhcpcd[2655]: Failed to reload-or-try-restart openntpd.service: Unit openntpd.service not found.
Feb 24 07:11:37 cubie dhcpcd[2655]: Failed to reload-or-try-restart chronyd.service: Unit chronyd.service not found.
Feb 24 07:11:37 cubie dhcpcd[1060]: forked to background, child pid 2657
Feb 24 07:11:37 cubie systemd[1]: Started DHCP Client.
Feb 24 07:11:37 cubie systemd[1]: Starting WireGuard Tunnel - wg-NET-SERVICE...
Feb 24 07:11:37 cubie systemd[1]: Finished WireGuard Tunnel - wg-NET-SERVICE.
Feb 24 07:11:37 cubie systemd[1]: Started WireGuard Peer - wg-NET-SERVICE - TfEUzOWqz/pGWz7b87jPCNigVYUktXo042w06dIgp1M=.
Feb 24 07:11:40 cubie dhcpcd[2657]: enp1s0: no IPv6 Routers available

Although I got it fixed for myself by fixing dhcpcd-IPv4LL culprit, I believe wireguard setup should be more robust and resilient against unstable network configuration. Mainly because bug is present in the default "out of the box" configuration, and user must do some amount of research to mitigate the issue. Also, this will break in future, again, when something else breaks network-online.target or some else network instability occurs.

Majiir commented 2 years ago

I'm also having this issue (NixOS 22.05 on a Surface Pro 3). I worked around it by forcing the old behavior (Type = simple and Restart = on-failure) with this hacky module:

{ config, lib, ... }:

# Workaround for an issue where the Wireguard module doesn't bring up peers 
# when the peer unit fails, often because of DNS not being available at
# system startup. See: https://github.com/NixOS/nixpkgs/issues/63869
#
# Also watch: https://github.com/NixOS/nixpkgs/pull/140890

with lib;
let
  peerUnitServiceName = peer:
    let
      dynamicRefreshEnabled = peer.peer.dynamicEndpointRefreshSeconds != 0;
      keyToUnitName = replaceChars
        [ "/" "-"    " "     "+"     "="      ]
        [ "-" "\\x2d" "\\x20" "\\x2b" "\\x3d" ];
      unitName = keyToUnitName peer.peer.publicKey;
      refreshSuffix = optionalString dynamicRefreshEnabled "-refresh";
    in
      "wireguard-${peer.interfaceName}-peer-${unitName}${refreshSuffix}";
  cfg = config.networking.wireguard;
  allPeers = flatten
    (mapAttrsToList (interfaceName: interfaceCfg:
      map (peer: { inherit interfaceName peer;}) interfaceCfg.peers
    ) cfg.interfaces);
  peerServiceNames = map peerUnitServiceName allPeers;
  serviceOverride = serviceName:
    nameValuePair serviceName {
      serviceConfig = {
        Type = mkForce "simple";
        Restart = "on-failure";
        RestartSec = "5";
      };
    };
in {
  systemd.services = listToAttrs (map serviceOverride peerServiceNames);
}

I believe wireguard setup should be more robust and resilient against unstable network configuration. Mainly because bug is present in the default "out of the box" configuration, and user must do some amount of research to mitigate the issue.

+1, a minimal Wireguard setup should not require comparatively arcane DHCP tweaks.

zarelit commented 2 years ago

@Majiir you might want to use this PR https://github.com/NixOS/nixpkgs/pull/140890 You can download the file from here: https://github.com/NixOS/nixpkgs/pull/140890/files then put in your configuration.nix something like

{... }:{
  disabledModules = [ "services/networking/wireguard.nix" ];
  imports = [ 
    # rest of the imports
    ./path/to/downloaded-pr-wireguard-module.nix
   ]; 
}
pwaller commented 1 year ago

Unfortunately #140890, which landed, did not appear to resolve this for me. On boot, the wireguard peer still fails to establish despite having WG_ENDPOINT_RESOLUTION_RETRIES=infinity in the peer unit.

Did anyone else see this fixed or otherwise find a workaround?

zarelit commented 1 year ago

@pwaller the backport (https://github.com/NixOS/nixpkgs/pull/204134) haven't been merged, are you building from the unstable? Just checking

pwaller commented 1 year ago

Aha. I am on unstable. But what I didn't realise is that's necessary to add some additional configuration (dynamicEndpointRefreshRestartSeconds or dynamicEndpointRefreshSeconds) in order for it to retry.

pwaller commented 1 year ago

Given that this can result in a lockout would it be better to default dynamicEndpointRefreshRestartSeconds to some non-zero value, so that peers will retry on some timescale by default?

zarelit commented 1 year ago

I suppose the rationale behind it is to respect the wireguard default (which is: try to resolve once at setup time), at least that's what I get from the networking.wireguard.interfaces.<name>.peers.*.endpoint option documentation.

I agree that "please ignore the changed address" should be opt-in.

Also, we could treat all the hosts as dynamic and remove one branch IMHO