Open sjau opened 5 years ago
I am experiencing the same issue, since #62325.
shados@dreamlogic[~] λ sudo systemctl status wireguard-wg0-peer-<peerkey>\\x3d.service
[sudo] password for shados:
● wireguard-wg0-peer-<peerkey>\x3d.service - WireGuard Peer - wg0 - <peerkey>=
Loaded: loaded (/nix/store/h57211bakrr87xyga7ks45a8jl40dp14-unit-wireguard-wg0-peer-<peerkey>-x3d.service/wireguard-wg0-peer-<peerkey>\x3d.service; enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Wed 2019-07-24 23:45:57 AEST; 46s ago
Process: 9126 ExecStart=/nix/store/m6ik3sfnlk998vss02l473zgn9fs6zv6-unit-script-wireguard-wg0-peer-<peerkey>--x3d-start (code=exited, status=1/FAILURE)
Process: 9162 ExecStopPost=/nix/store/x5xw5wfj8rj2nkly70l18vhgyyjgn3bg-unit-script-wireguard-wg0-peer-<peerkey>--x3d-post-stop (code=exited, status=1/FAILURE)
Main PID: 9126 (code=exited, status=1/FAILURE)
Jul 24 23:45:57 dreamlogic systemd[1]: Starting WireGuard Peer - wg0 - <peerkey>=...
Jul 24 23:45:57 dreamlogic m6ik3sfnlk998vss02l473zgn9fs6zv6-unit-script-wireguard-wg0-peer-<peerkey>--x3d-start[9126]: Name or service not known: `<fqdn:port>'
Jul 24 23:45:57 dreamlogic systemd[1]: wireguard-wg0-peer-<peerkey>\x3d.service: Main process exited, code=exited, status=1/FAILURE
Jul 24 23:45:57 dreamlogic x5xw5wfj8rj2nkly70l18vhgyyjgn3bg-unit-script-wireguard-wg0-peer-<peerkey>--x3d-post-stop[9162]: Unable to modify interface: No such device
Jul 24 23:45:57 dreamlogic systemd[1]: wireguard-wg0-peer-<peerkey>\x3d.service: Control process exited, code=exited, status=1/FAILURE
Jul 24 23:45:57 dreamlogic systemd[1]: wireguard-wg0-peer-<peerkey>\x3d.service: Failed with result 'exit-code'.
Jul 24 23:45:57 dreamlogic systemd[1]: Failed to start WireGuard Peer - wg0 - <peerkey>=.
shados@dreamlogic[~] λ sudo systemctl --no-pager cat wireguard-wg0-peer-<peerkey>\\x3d.service
# /nix/store/h57211bakrr87xyga7ks45a8jl40dp14-unit-wireguard-wg0-peer-<peerkey>-x3d.service/wireguard-wg0-peer-<peerkey>\x3d.service
[Unit]
After=wireguard-wg0.service
Description=WireGuard Peer - wg0 - <peerkey>=
Requires=wireguard-wg0.service
[Service]
Environment="DEVICE=wg0"
Environment="LOCALE_ARCHIVE=/nix/store/p7yq8mx8k2mhismy0a10kacn6w7k4r9c-glibc-locales-2.27/lib/locale/locale-archive"
Environment="PATH=/nix/store/9gs2k6xcfddnpff0yraz0l8wz1ijammg-iproute2-5.1.0/bin:/nix/store/7i5yrx5v5dhmglac8hm32cik3jbajsb1-wireguard-tools-0.0.20190702/bin:/nix/store/k8lhqzpaaymshchz8ky3z4653h4kln9d-coreutils-8.31/bin:/nix/store/gjh3a8hqic3bqc2xzj8g2qxwz81wfjxx-findutils-4.6.0/bin:/nix/store/agcay3wmf74qinwshnjqy73w8rxf82hs-gnugrep-3.3/bin:/nix/store/vnyd3wh5i5kj66n9c5b8shzxjjrw22cn-gnused-4.7/bin:/nix/store/wzrglgf7i2zajmm5k53hypgnv3n0z3v6-systemd-242/bin:/nix/store/9gs2k6xcfddnpff0yraz0l8wz1ijammg-iproute2-5.1.0/sbin:/nix/store/7i5yrx5v5dhmglac8hm32cik3jbajsb1-wireguard-tools-0.0.20190702/sbin:/nix/store/k8lhqzpaaymshchz8ky3z4653h4kln9d-coreutils-8.31/sbin:/nix/store/gjh3a8hqic3bqc2xzj8g2qxwz81wfjxx-findutils-4.6.0/sbin:/nix/store/agcay3wmf74qinwshnjqy73w8rxf82hs-gnugrep-3.3/sbin:/nix/store/vnyd3wh5i5kj66n9c5b8shzxjjrw22cn-gnused-4.7/sbin:/nix/store/wzrglgf7i2zajmm5k53hypgnv3n0z3v6-systemd-242/sbin"
Environment="TZDIR=/nix/store/5fj4d8mp2qfdpa86sgfjmjyrm2mfssz7-tzdata-2019a/share/zoneinfo"
Environment="WG_ENDPOINT_RESOLUTION_RETRIES=infinity"
ExecStart=/nix/store/m6ik3sfnlk998vss02l473zgn9fs6zv6-unit-script-wireguard-wg0-peer-<peerkey>--x3d-start
ExecStopPost=/nix/store/x5xw5wfj8rj2nkly70l18vhgyyjgn3bg-unit-script-wireguard-wg0-peer-<peerkey>--x3d-post-stop
RemainAfterExit=true
Type=oneshot
Based upon this comment, I suspected this to be the culprit, but a rebuild with that changed to NOTFOUND=return
did not resolve the issue.
@zx2c4: From reading #61971, I understand why you wished to improve the retry logic within wg
instead of relying on external restart logic (especially given the variety of contexts wg
is used in), but I am not certain that that approach is the best option for NixOS.
My primary concerns are:
WG_ENDPOINT_RESOLUTION_RETRIES
environment variable than they are systemd's restart logicInstead, given wg
exits with a non-zero error code for the permanent resolution failure codes, the NixOS wireguard peer services could be configured like:
Type=simple
Restart=on-failure
Can you foresee any issues with this?
I think re-adding
Type=simple
Restart=on-failure
should be done as I can't see any problem with it.
What's the status here? Wireguard still seems to be broken. I have the same bug and can't use it.
- system: `"x86_64-linux"`
- host os: `Linux 4.19.101, NixOS, 19.09.2008.ea553d8c67c (Loris)`
- multi-user?: `yes`
- sandbox: `yes`
- version: `nix-env (Nix) 2.3.2`
- channels(root): `"nixos-19.09.2008.ea553d8c67c, nixos-unstable-20.03pre208413.e1eedf29e5d"`
- nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixos`
because NixOS hackers are far less likely to already be familiar with the WG_ENDPOINT_RESOLUTION_RETRIES
This is not a reason to script things improperly. Familiarize yourself and use the right tool for the task.
"permanent" resolution failures aren't necessarily permanent in the broader sense;
If your resolver is configured correctly, you should be able to distinguish between "domain record doesnt exist on the internet" and "dont have a responding dns resolver yet".
Hello, I'm a bot and I thank you in the name of the community for opening this issue.
To help our human contributors focus on the most-relevant reports, I check up on old issues to see if they're still relevant. This issue has had no activity for 180 days, and so I marked it as stale, but you can rest assured it will never be closed by a non-human.
The community would appreciate your effort in checking if the issue is still valid. If it isn't, please close it.
If the issue persists, and you'd like to remove the stale label, you simply need to leave a comment. Your comment can be as simple as "still important to me". If you'd like it to get more attention, you can ask for help by searching for maintainers and people that previously touched related code and @ mention them in a comment. You can use Git blame or GitHub's web interface on the relevant files to find them.
Lastly, you can always ask for help at our Discourse Forum or at #nixos' IRC channel.
I still have this problem.
"permanent" resolution failures aren't necessarily permanent in the broader sense;
If your resolver is configured correctly, you should be able to distinguish between "domain record doesnt exist on the internet" and "dont have a responding dns resolver yet".
On the same machine I've got tailscale enabled with enabled Magic DNS. However, in my case was the issue in dhcpcd
announcing Started DHCP Client
before it got any real lease and any real default route was set. By default, it looks like IPv4LL
is enabled:
$ journalctl -u wireguard-wg-* -u dhcpcd.service
...
-- Boot 3146884528d2413889e2dd737a596d5a --
Feb 24 06:48:16 cubie systemd[1]: Starting DHCP Client...
Feb 24 06:48:16 cubie dhcpcd[1061]: dev: loaded udev
Feb 24 06:48:16 cubie systemd[1]: Starting WireGuard Tunnel - wg-NET-SERVICE - Key Generator...
Feb 24 06:48:16 cubie systemd[1]: Finished WireGuard Tunnel - wg-NET-SERVICE - Key Generator.
Feb 24 06:48:17 cubie dhcpcd[1061]: enp1s0: waiting for carrier
Feb 24 06:48:17 cubie dhcpcd[1061]: enp1s0: carrier acquired
Feb 24 06:48:17 cubie dhcpcd[1061]: DUID 00:04:b3:0e:32:40:72:ba:11:e3:90:da:38:d5:47:de:e8:31
Feb 24 06:48:17 cubie dhcpcd[1061]: enp1s0: IAID 47:de:e8:31
Feb 24 06:48:17 cubie dhcpcd[1061]: enp1s0: adding address fe80::3ad5:47ff:fede:e831
Feb 24 06:48:17 cubie dhcpcd[1061]: enp1s0: carrier lost
Feb 24 06:48:17 cubie dhcpcd[1061]: enp1s0: deleting address fe80::3ad5:47ff:fede:e831
Feb 24 06:48:17 cubie dhcpcd[1061]: br-grafprom: new hardware address: 4a:eb:b6:12:a5:b1
Feb 24 06:48:17 cubie dhcpcd[1061]: br-grafprom: new hardware address: 9a:b2:37:06:ed:56
Feb 24 06:48:20 cubie dhcpcd[1061]: enp1s0: carrier acquired
Feb 24 06:48:20 cubie dhcpcd[1061]: enp1s0: IAID 47:de:e8:31
Feb 24 06:48:20 cubie dhcpcd[1061]: enp1s0: adding address fe80::3ad5:47ff:fede:e831
Feb 24 06:48:21 cubie dhcpcd[1061]: enp1s0: soliciting an IPv6 router
Feb 24 06:48:21 cubie dhcpcd[1061]: enp1s0: soliciting a DHCP lease
Feb 24 06:48:26 cubie dhcpcd[1061]: enp1s0: probing for an IPv4LL address
Feb 24 06:48:31 cubie dhcpcd[1061]: enp1s0: using IPv4LL address 169.254.101.247
Feb 24 06:48:31 cubie dhcpcd[1061]: enp1s0: adding route to 169.254.0.0/16
Feb 24 06:48:31 cubie dhcpcd[1061]: enp1s0: adding default route
Feb 24 06:48:31 cubie dhcpcd[1061]: forked to background, child pid 2607
Feb 24 06:48:31 cubie systemd[1]: Started DHCP Client.
Feb 24 06:48:31 cubie systemd[1]: Starting WireGuard Tunnel - wg-NET-SERVICE...
Feb 24 06:48:31 cubie systemd[1]: Finished WireGuard Tunnel - wg-NET-SERVICE.
Feb 24 06:48:31 cubie systemd[1]: Started WireGuard Peer - wg-NET-SERVICE - TfEUzOWqz/pGWz7b87jPCNigVYUktXo042w06dIgp1M=.
Feb 24 06:48:31 cubie wireguard-wg-NET-SERVICE-peer-TfEUzOWqz-pGWz7b87jPCNigVYUktXo042w06dIgp1M--x3d-refresh-start[2628]: Name or service not known: `[NET-SERVICE.sk:55251](http://NET-SERVICE.sk:55251/)'
Feb 24 06:48:31 cubie systemd[1]: wireguard-wg-NET-SERVICE-peer-TfEUzOWqz-pGWz7b87jPCNigVYUktXo042w06dIgp1M\x3d-refresh.service: Main process exited, code=exited, status=1/FAILURE
Feb 24 06:48:31 cubie wireguard-wg-NET-SERVICE-peer-TfEUzOWqz-pGWz7b87jPCNigVYUktXo042w06dIgp1M--x3d-refresh-post-stop[2631]: RTNETLINK answers: No such process
Feb 24 06:48:31 cubie systemd[1]: wireguard-wg-NET-SERVICE-peer-TfEUzOWqz-pGWz7b87jPCNigVYUktXo042w06dIgp1M\x3d-refresh.service: Control process exited, code=exited, status=2/INVALIDARGUMENT
Feb 24 06:48:31 cubie systemd[1]: wireguard-wg-NET-SERVICE-peer-TfEUzOWqz-pGWz7b87jPCNigVYUktXo042w06dIgp1M\x3d-refresh.service: Failed with result 'exit-code'.
Feb 24 06:48:33 cubie dhcpcd[2607]: enp1s0: offered 192.168.88.154 from 192.168.88.1
Feb 24 06:48:33 cubie dhcpcd[2607]: enp1s0: probing address 192.168.88.154/24
Feb 24 06:48:33 cubie dhcpcd[2607]: enp1s0: no IPv6 Routers available
Feb 24 06:48:38 cubie dhcpcd[2607]: enp1s0: leased 192.168.88.154 for 600 seconds
Feb 24 06:48:38 cubie dhcpcd[2607]: enp1s0: adding route to 192.168.88.0/24
Feb 24 06:48:38 cubie dhcpcd[2607]: enp1s0: changing default route via 192.168.88.1
Feb 24 06:48:38 cubie dhcpcd[2607]: enp1s0: deleting route to 169.254.0.0/16
Feb 24 06:48:38 cubie dhcpcd[2607]: enp1s0: pid 2607 deleted default route via 192.168.88.1
Feb 24 06:53:38 cubie dhcpcd[2607]: enp1s0: adding default route via 192.168.88.1
(In output above the real name was replaced with NET-SERVICE
for privacy reasons)
This specific issue was FIXED BY disallowing IPv4LL for dhcpcd:
networking.dhcpcd = {
wait = "ipv4";
extraConfig = "noipv4ll";
};
Now, dhcpcd is considered to be started only after it has got real ipv4 lease and set default route:
-- Boot cad41d179fcd49169c774be69b119c5a --
Feb 24 07:11:23 cubie systemd[1]: Starting DHCP Client...
Feb 24 07:11:23 cubie systemd[1]: Starting WireGuard Tunnel - wg-NET-SERVICE - Key Generator...
Feb 24 07:11:23 cubie dhcpcd[1060]: dev: loaded udev
Feb 24 07:11:23 cubie systemd[1]: Finished WireGuard Tunnel - wg-NET-SERVICE - Key Generator.
Feb 24 07:11:24 cubie dhcpcd[1060]: enp1s0: waiting for carrier
Feb 24 07:11:24 cubie dhcpcd[1060]: enp1s0: carrier acquired
Feb 24 07:11:24 cubie dhcpcd[1060]: DUID 00:04:b3:0e:32:40:72:ba:11:e3:90:da:38:d5:47:de:e8:31
Feb 24 07:11:24 cubie dhcpcd[1060]: enp1s0: IAID 47:de:e8:31
Feb 24 07:11:24 cubie dhcpcd[1060]: enp1s0: adding address fe80::3ad5:47ff:fede:e831
Feb 24 07:11:24 cubie dhcpcd[1060]: enp1s0: carrier lost
Feb 24 07:11:24 cubie dhcpcd[1060]: enp1s0: deleting address fe80::3ad5:47ff:fede:e831
Feb 24 07:11:24 cubie dhcpcd[1060]: br-grafprom: new hardware address: 6e:b1:57:7f:31:bf
Feb 24 07:11:24 cubie dhcpcd[1060]: br-grafprom: new hardware address: 9a:b2:37:06:ed:56
Feb 24 07:11:26 cubie dhcpcd[1060]: enp1s0: carrier acquired
Feb 24 07:11:26 cubie dhcpcd[1060]: enp1s0: IAID 47:de:e8:31
Feb 24 07:11:26 cubie dhcpcd[1060]: enp1s0: adding address fe80::3ad5:47ff:fede:e831
Feb 24 07:11:27 cubie dhcpcd[1060]: enp1s0: soliciting an IPv6 router
Feb 24 07:11:27 cubie dhcpcd[1060]: enp1s0: rebinding lease of 192.168.88.154
Feb 24 07:11:32 cubie dhcpcd[1060]: enp1s0: probing address 192.168.88.154/24
Feb 24 07:11:37 cubie dhcpcd[1060]: enp1s0: leased 192.168.88.154 for 600 seconds
Feb 24 07:11:37 cubie dhcpcd[1060]: enp1s0: adding route to 192.168.88.0/24
Feb 24 07:11:37 cubie dhcpcd[1060]: enp1s0: adding default route via 192.168.88.1
Feb 24 07:11:37 cubie dhcpcd[2655]: Failed to reload-or-try-restart ntpd.service: Unit ntpd.service not found.
Feb 24 07:11:37 cubie dhcpcd[2655]: Failed to reload-or-try-restart openntpd.service: Unit openntpd.service not found.
Feb 24 07:11:37 cubie dhcpcd[2655]: Failed to reload-or-try-restart chronyd.service: Unit chronyd.service not found.
Feb 24 07:11:37 cubie dhcpcd[1060]: forked to background, child pid 2657
Feb 24 07:11:37 cubie systemd[1]: Started DHCP Client.
Feb 24 07:11:37 cubie systemd[1]: Starting WireGuard Tunnel - wg-NET-SERVICE...
Feb 24 07:11:37 cubie systemd[1]: Finished WireGuard Tunnel - wg-NET-SERVICE.
Feb 24 07:11:37 cubie systemd[1]: Started WireGuard Peer - wg-NET-SERVICE - TfEUzOWqz/pGWz7b87jPCNigVYUktXo042w06dIgp1M=.
Feb 24 07:11:40 cubie dhcpcd[2657]: enp1s0: no IPv6 Routers available
Although I got it fixed for myself by fixing dhcpcd-IPv4LL culprit, I believe wireguard setup should be more robust and resilient against unstable network configuration. Mainly because bug is present in the default "out of the box" configuration, and user must do some amount of research to mitigate the issue. Also, this will break in future, again, when something else breaks network-online.target
or some else network instability occurs.
I'm also having this issue (NixOS 22.05 on a Surface Pro 3). I worked around it by forcing the old behavior (Type = simple
and Restart = on-failure
) with this hacky module:
{ config, lib, ... }:
# Workaround for an issue where the Wireguard module doesn't bring up peers
# when the peer unit fails, often because of DNS not being available at
# system startup. See: https://github.com/NixOS/nixpkgs/issues/63869
#
# Also watch: https://github.com/NixOS/nixpkgs/pull/140890
with lib;
let
peerUnitServiceName = peer:
let
dynamicRefreshEnabled = peer.peer.dynamicEndpointRefreshSeconds != 0;
keyToUnitName = replaceChars
[ "/" "-" " " "+" "=" ]
[ "-" "\\x2d" "\\x20" "\\x2b" "\\x3d" ];
unitName = keyToUnitName peer.peer.publicKey;
refreshSuffix = optionalString dynamicRefreshEnabled "-refresh";
in
"wireguard-${peer.interfaceName}-peer-${unitName}${refreshSuffix}";
cfg = config.networking.wireguard;
allPeers = flatten
(mapAttrsToList (interfaceName: interfaceCfg:
map (peer: { inherit interfaceName peer;}) interfaceCfg.peers
) cfg.interfaces);
peerServiceNames = map peerUnitServiceName allPeers;
serviceOverride = serviceName:
nameValuePair serviceName {
serviceConfig = {
Type = mkForce "simple";
Restart = "on-failure";
RestartSec = "5";
};
};
in {
systemd.services = listToAttrs (map serviceOverride peerServiceNames);
}
I believe wireguard setup should be more robust and resilient against unstable network configuration. Mainly because bug is present in the default "out of the box" configuration, and user must do some amount of research to mitigate the issue.
+1, a minimal Wireguard setup should not require comparatively arcane DHCP tweaks.
@Majiir you might want to use this PR https://github.com/NixOS/nixpkgs/pull/140890 You can download the file from here: https://github.com/NixOS/nixpkgs/pull/140890/files then put in your configuration.nix something like
{... }:{
disabledModules = [ "services/networking/wireguard.nix" ];
imports = [
# rest of the imports
./path/to/downloaded-pr-wireguard-module.nix
];
}
Unfortunately #140890, which landed, did not appear to resolve this for me. On boot, the wireguard peer still fails to establish despite having WG_ENDPOINT_RESOLUTION_RETRIES=infinity in the peer unit.
Did anyone else see this fixed or otherwise find a workaround?
@pwaller the backport (https://github.com/NixOS/nixpkgs/pull/204134) haven't been merged, are you building from the unstable? Just checking
Aha. I am on unstable. But what I didn't realise is that's necessary to add some additional configuration (dynamicEndpointRefreshRestartSeconds or dynamicEndpointRefreshSeconds) in order for it to retry.
Given that this can result in a lockout would it be better to default dynamicEndpointRefreshRestartSeconds
to some non-zero value, so that peers will retry on some timescale by default?
I suppose the rationale behind it is to respect the wireguard default (which is: try to resolve once at setup time), at least that's what I get from the networking.wireguard.interfaces.<name>.peers.*.endpoint
option documentation.
I agree that "please ignore the changed address" should be opt-in.
Also, we could treat all the hosts as dynamic and remove one branch IMHO
Issue description
Using wireguard on a server with several peers added it looks like they aren't brought up properly and since the type is set to "oneshot" it won't even retry. Problem seems to be related to dns not being available.
Steps to reproduce
Setup a WG server somewhere
Setup a wg on nixos where the server is set as peer and domain name is used instead of ip address (never tried with ip alone though), like
Rebuild
Reboot
Try to ping wg server at 10.8.0.1
--> 100% packet loss
For some reason systemctl does show it as started:
Looking at that start unit file it has this content:
and
ip addr show
also lists the ip:Looking at the unit files
systemctl list-unit-files | grep wireguard
this pop ups:(How to make list-unit-files provide the full name of the file?)
Looking at the status of the peer unit file systemctl status wireguard-wg_jl-peer-{public key}.service, this is returned:
The unit file /nix/store/f5kp5jammckjcpgjd7r8fa8gh0y5kzrj-unit-wireguard-wg_jl-peer-{public key}.service/wireguard-wg_jl-peer-{public key}.service contains:
And the ExecStart /nix/store/08ilzr9yicqvb5wz70c3przyl235vy9w-unit-script-wireguard-wg_jl-peer-{public key}-start contains:
So, once the system is up and running, I have to re-issue
systemctl restart 'wireguard-wg_jl-peer-{public key}.service'
and then it works.For some reason those peer unit files aren't properly executed - likely because dns isn't available at that point.
Changein them from oneshot to simple with retry on fail could improve the situation.
Technical details
Please run
nix-shell -p nix-info --run "nix-info -m"
and paste the results."x86_64-linux"
Linux 4.19.55, NixOS, 19.09pre183832.20b993ef2c9 (Loris)
yes
yes
nix-env (Nix) 2.2.2
""
"nixos-19.09pre183832.20b993ef2c9"
/nix/var/nix/profiles/per-user/root/channels/nixos