Open tbidne opened 2 years ago
Possibly it relates to this? https://github.com/NixOS/nixpkgs/pull/178046
Edit: Nevermind, that commit is not in nixos-22.05
.
This issue has been mentioned on NixOS Discourse. There might be relevant details there:
This is related to udev not initializing devices. NetworkManager never completes startup because a WireGuard interface is never initialized by udev. A workaround is just putting the affected device into networking.networkmanager.unmanaged
.
I'm wondering if there is any solution to this. Since this triggers for me always. So, I'm not even sure if my nixos-rebuild switch
does actually complete.
As mentioned in the initial description of the issue:
This corresponds to the moment when NetworkManager logs "startup complete". This mode is generally only useful at boot time.
This never seems to be the case, since the actual message is "Started Network Manager.".
Apart from that I'm wondering why it relies on a log message string, when the status of NetworkManager.service
would be way less prone to errors.
EDIT: no idea why, but it went away ... Everything works as expected again.
I think I'm running into this.
systemd-networkd-wait-online.service
Aug 22 08:38:49 erin-laptop systemd[1]: Starting Wait for Network to be Configured...
Aug 22 08:40:49 erin-laptop systemd-networkd-wait-online[26249]: Timeout occurred while waiting for network connectivity.
Aug 22 08:40:49 erin-laptop systemd[1]: systemd-networkd-wait-online.service: Main process exited, code=exited, status=1/FAILURE
Aug 22 08:40:49 erin-laptop systemd[1]: systemd-networkd-wait-online.service: Failed with result 'exit-code'.
Aug 22 08:40:49 erin-laptop systemd[1]: Failed to start Wait for Network to be Configured.
It only happens after I've connected my USB-C dock with an ethernet connection at least once after boot. (Note: I'm running tmpfs on root, so my system should "forget" everything about my dock on reboot.)
I'm also running systemd-networkd
, not NetworkManager
.
@stale don't you dare!
@ncfavier Any chance you can help with this?
I don't use NetworkManager so I wouldn't know, but in the case of systemd-networkd there are relevant options under systemd.network.wait-online
: anyInterface
and ignoredInterfaces
. I recommend at least setting the former to true
on laptops.
So, I'm not even sure if my
nixos-rebuild switch
does actually complete.
Warning about failed units is pretty much the last thing that the activation script does, so it's probably fine (but the failure should be fixed, of course).
BTW I've "fixed" this by setting
# udev 250 doesn't reliably reinitialize devices after restart
systemd.services.systemd-udevd.restartIfChanged = false;
But this is really an upstream systemd bug.
My Ubuntu has:
❯ systemctl cat NetworkManager-wait-online.service
# /lib/systemd/system/NetworkManager-wait-online.service
[Unit]
Description=Network Manager Wait Online
Documentation=man:nm-online(1)
Requires=NetworkManager.service
After=NetworkManager.service
Before=network-online.target
[Service]
# `nm-online -s` waits until the point when NetworkManager logs
# "startup complete". That is when startup actions are settled and
# devices and profiles reached a conclusive activated or deactivated
# state. It depends on which profiles are configured to autoconnect and
# also depends on profile settings like ipv4.may-fail/ipv6.may-fail,
# which affect when a profile is considered fully activated.
# Check NetworkManager logs to find out why wait-online takes a certain
# time.
Type=oneshot
ExecStart=/usr/bin/nm-online -s -q
RemainAfterExit=yes
# Set $NM_ONLINE_TIMEOUT variable for timeout in seconds.
# Edit with `systemctl edit NetworkManager-wait-online`.
#
# Note, this timeout should commonly not be reached. If your boot
# gets delayed too long, then the solution is usually not to decrease
# the timeout, but to fix your setup so that the connected state
# gets reached earlier.
Environment=NM_ONLINE_TIMEOUT=60
[Install]
My latest NixOS (22.05) config has:
> nix-repl> c.config.systemd.services.NetworkManager-wait-online
{ after = [ ... ]; aliases = [ ... ]; before = [ ... ]; bindsTo = [ ... ]; confinement = { ... }; conflicts = [ ... ]; description = ""; documentation = [ ... ]; enable = false; environment = { ... }; jobScripts = [ ... ]; onFailure = [ ... ]; partOf = [ ... ]; path = [ ... ]; postStart = ""; postStop = ""; preStart = ""; preStop = ""; reload = ""; reloadIfChanged = false; reloadTriggers = [ ... ]; requiredBy = [ ... ]; requires = [ ... ]; requisite = [ ... ]; restartIfChanged = true; restartTriggers = [ ... ]; runner = error: attribute 'ExecStart' missing
at /nix/store/6dgpkrc0gxlndr4j2524ihlsr8209ph7-source/nixos/modules/testing/service-runner.nix:65:9:
64| my $cmd = <<END_CMD;
65| ${service.serviceConfig.ExecStart}
| ^
66| END_CMD
«derivation
based on this definition:
systemd.services.NetworkManager-wait-online = {
wantedBy = [ "network-online.target" ];
};
Where is this ExecStart
coming from in your configurations?
Also:
nixpkgs on fix/teamviewer-service-deps [$]
❯ rg 'nm-online'
[ nothing ]
@blaggacao: The reference to nm-online
comes from upstream service unit NetworkManager-wait-online.service
, not from nixpkgs itself.
I'd vote for disabling this service until we can make it reliable. It's doing no good currently.
I've been tripping over this bug for quite some time now and it is annoying for users. As mentioned above, the error can be worked around with:
systemd.services.NetworkManager-wait-online.enable = lib.mkForce false;
systemd.services.systemd-networkd-wait-online.enable = lib.mkForce false;
I was concernd if there might be other dependencies or services that require this to be enabled, so I grepped through nixpkgs for both. These are the mentions:
Upon first look the usage of these two services seem minimal to me and they are causing more problems that doing good. Agreeing with @domenkozar's proposal, I'd vote to disable them per default. If this is agreed upon, I can submit a PR
I've been running with that service disabled for 6 months now and have not experienced a single issue. Don't count my voice too heavily, though :wink: ! :+1:
If we're going to work around this I'd still prefer systemd.services.systemd-udevd.restartIfChanged = false;
as the other workaround just masks the issue while udev's still half-broken.
If we're going to work around this I'd still prefer
systemd.services.systemd-udevd.restartIfChanged = false;
as the other workaround just masks the issue while udev's still half-broken.
For some reason, that didn't work for me. On rebuild it said "not restarting service"
Yep confirming what @pinpox said:
updating GRUB 2 menu...
NOT restarting the following changed units: systemd-udevd.service
Still seeing the issue on HEAD. Disabling wait-online like mentioned previously fixes nixos-rebuild.
systemd.services.NetworkManager-wait-online.enable = lib.mkForce false;
systemd.services.systemd-networkd-wait-online.enable = lib.mkForce false;
NetworkManager-wait-online.service
is a dependency of network-online.target
:
$ systemctl list-dependencies --reverse NetworkManager-wait-online.service
NetworkManager-wait-online.service
● └─network-online.target
● ├─dnscrypt-proxy2.service
● ├─wireguard-wg-intra-peer-[...].service
● └─multi-user.target
$ systemctl cat NetworkManager-wait-online.service | grep network-online.target
Before=network-online.target
WantedBy=network-online.target
As others pointed out, the -s
(aka. --wait-for-startup
) flag is causing some deadlock:
# `nm-online -s` waits until the point when NetworkManager logs
# "startup complete". That is when startup actions are settled and
# devices and profiles reached a conclusive activated or deactivated
# state. It depends on which profiles are configured to autoconnect and
# also depends on profile settings like ipv4.may-fail/ipv6.may-fail,
# which affect when a profile is considered fully activated.
# Check NetworkManager logs to find out why wait-online takes a certain
# time.
I guess using -s
is a more accurate notification than simply waiting for NetworkManager.service
to be up, and I don't know why it sometimes fails to notice that the startup is complete, nor whether removing it has any significant drawbacks, but keeping it causes switch-to-configuration test
to fail often enough to bother me, so until a more correct fix is available, I'm personally removing it:
{ pkgs, ... }: {
systemd.services.NetworkManager-wait-online = {
serviceConfig = {
ExecStart = [ "" "${pkgs.networkmanager}/bin/nm-online -q" ];
Restart = "on-failure";
RestartSec = 1;
};
unitConfig.StartLimitIntervalSec = 0;
};
}
Also running into these deadlocks on nixpkgs-unstable.
edit: see mentioned below that this might be related to wireguard. I use tailscale - might try to play around with whether removing that setup makes any difference on this issue for me.
This has been happening to me consistently for over a year. The cause seems to be due to what I've described in https://github.com/NixOS/nixpkgs/issues/182449#issuecomment-1585296876.
I used (a simplified version of) @ju1m's workaround, but using the workaround from https://github.com/NixOS/nixpkgs/issues/195777#issuecomment-1324378856 seems to be even better. However, I don't know how to include that in my system configuration rather than running it manually when needed.
@lorenz:
This is related to udev not initializing devices. NetworkManager never completes startup because a WireGuard interface is never initialized by udev. A workaround is just putting the affected device into
networking.networkmanager.unmanaged
.
My instance of this issue is indeed caused by a wireguard interface, but adding the interface to networking.networkmanager.unmanaged
(via name or type:wireguard
) did not change a thing. Having done some more digging online, I believe I know why: prior to udev initialization completing for a link, NM cannot be sure what the name or type of the link will be (as udev may make changes), and so it is not possible to filter a device prior to udev marking it as configured.
I'm still looking for a workaround.
I have the issue on the MacBook Pro 2019 16,1 with t2 patches after switching to iwd, because of network issues.
systemd.services.NetworkManager-wait-online.enable = lib.mkForce false;
worked for me.
I have the issue on the MacBook Pro 2019 16,1 with t2 patches after switching to iwd, because of network issues.
systemd.services.NetworkManager-wait-online.enable = lib.mkForce false;
worked for me.
I'm new to NixOS, I have this error as well:
jul 20 19:19:13 nixos systemd[1]: Starting Network Manager Wait Online...
jul 20 19:20:13 nixos systemd[1]: NetworkManager-wait-online.service: Main process exited, code=exited, status=1/FAILURE
jul 20 19:20:13 nixos systemd[1]: NetworkManager-wait-online.service: Failed with result 'exit-code'.
jul 20 19:20:13 nixos systemd[1]: Failed to start Network Manager Wait Online.
warning: error(s) occurred while switching to the new configuration
I put this line in my configuration.nix (systemd.services.NetworkManager-wait-online.enable = lib.mkForce false;
), but then I get this error:
$ sudo nixos-rebuild switch --upgrade
unpacking channels...
error: undefined variable 'lib'
at /etc/nixos/configuration.nix:16:56:
15| boot.loader.systemd-boot.enable = true;
16| systemd.services.NetworkManager-wait-online.enable = lib.mkForce false;
| ^
17| boot.loader.efi.canTouchEfiVariables = true;
(use '--show-trace' to show detailed location information)
building Nix...
Right now I can't update anymore, so I am stuck and a fairly recent clean NixOS install. Is there a way to force the update for now? What is the best structural solution?
Thank you!
I put this line in my configuration.nix (systemd.services.NetworkManager-wait-online.enable = lib.mkForce false;), but then I get this error:
Looks like nixos-generate-config
adds this in the beginning of /etc/nixos/configuration.nix:
[...snip comment block...]
{ config, pkgs, ... }:
whereas most people eventually modify that to { config, pkgs, lib, ... }:
, providing configuration.nix with the missing lib
argument.
I think nixos-generate-config should be updated to include lib
by default. Although the default config doesn't need/use lib
, it quickly becomes something new NixOS users want to use.
I put this line in my configuration.nix (systemd.services.NetworkManager-wait-online.enable = lib.mkForce false;), but then I get this error:
Looks like
nixos-generate-config
adds this in the beginning of /etc/nixos/configuration.nix:[...snip comment block...] { config, pkgs, ... }:
whereas most people eventually modify that to
{ config, pkgs, lib, ... }:
, providing configuration.nix with the missinglib
argument.I think nixos-generate-config should be updated to include
lib
by default. Although the default config doesn't need/uselib
, it quickly becomes something new NixOS users want to use.
Thanx for the quick reply.
I have added { config, pkgs, lib, ... }:
In the first line of my configuration.nix, but I get the same error...
I have added { config, pkgs, lib, ... }:
In the first line of my configuration.nix, but I get the same error...
Then please open a new issue or post on https://discourse.nixos.org/ asking for help.
This issue has been mentioned on NixOS Discourse. There might be relevant details there:
Ok, this solved it for me: https://discourse.nixos.org/t/nixos-rebuild-switch-upgrade-networkmanager-wait-online-service-failure/30746/2
Just did sudo nixos-rebuild boot --upgrade
and rebooted. I left the lib in the { config, pkgs, lib, ... }:
.
Thanx for the support!
I think nixos-generate-config should be updated to include
lib
by default.
So, to summarise, the actual error seems to be that NM waits for startup of some interfaces it thinks it should manage but they're never communicated as up to NM because NM was never supposed manage that interface?
When I say nmcli
, NM thinks it does not manage my tailscale0 interface:
tailscale0: unmanaged
"tailscale0"
tun, sw, mtu 1280
Why would it be waiting on it then? The other interfaces are up and I'm sending you this message via one of them, so it can't be waiting on those.
I'll try to explicitly add it to networking.networkmanager.unmanaged
and report back after a few rebuilds in a few weeks.
No, nm-online
is just weirdly broken and locks up if you try to do it on an already running system. It's probably an easy code fix but I'm not familiar with the codebase at all.
No,
nm-online
is just weirdly broken and locks up if you try to do it on an already running system. It's probably an easy code fix but I'm not familiar with the codebase at all.
Is that something that has to be fixed upstream? If so, could you create an issue for it?
There's already an issue (kind of): https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/issues/1220 https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/issues/1034
There's already an issue (kind of): https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/issues/1220 https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/issues/1034
Reading through those. Are we using the mentioned -s
flag? If so, maybe we can remove it? Running nm-online -s
manually hangs for me, while nm-online
(without -s
) seems to work correctly.
We are, and we have to, at least on initial boot.
Alright, I'll stick with my workaround then for now. I would like to contribute a real fix for this or sort it out properly, but don't know how to continue.
I had this problem for well over a year, but since I added this to my configuration.nix
the problem has not reoccurred:
systemd.services.NetworkManager-wait-online = {
serviceConfig = {
ExecStart = [ "" "${pkgs.networkmanager}/bin/nm-online -q" ];
};
};
Nor have I had any other network problems.
Also,
$ nm-online -s
Connecting............... 30s [started]
$ journalctl -b -u NetworkManager-dispatcher.service
Jul 10 16:24:41 melchizedek systemd[1]: Starting Network Manager Script Dispatcher Service...
Jul 10 16:24:41 melchizedek systemd[1]: Started Network Manager Script Dispatcher Service.
Jul 10 16:24:51 melchizedek systemd[1]: NetworkManager-dispatcher.service: Deactivated successfully.
Jul 11 17:43:59 melchizedek systemd[1]: Starting Network Manager Script Dispatcher Service...
Jul 11 17:43:59 melchizedek systemd[1]: Started Network Manager Script Dispatcher Service.
Jul 11 17:44:09 melchizedek systemd[1]: NetworkManager-dispatcher.service: Deactivated successfully.
Jul 14 18:41:15 melchizedek systemd[1]: Starting Network Manager Script Dispatcher Service...
Jul 14 18:41:15 melchizedek systemd[1]: Started Network Manager Script Dispatcher Service.
Jul 14 18:41:25 melchizedek systemd[1]: NetworkManager-dispatcher.service: Deactivated successfully.
...
$ last reboot -1
reboot system boot 5.15.119 Mon Jul 10 16:24 still running
NetworkManager-dispatcher.service
seems to run at various times that aren't related to reboots or nixos-upgrade switch
@neilmayhew what does that exactly do? If I understand the man page correctly, -q
only omits the output, without changing any functionality
@pinpox The -q
was already there, and I'm just removing the -s
. Sorry I didn't make that clearer.
With the extra configuration:
$ systemctl cat NetworkManager-wait-online.service
# /etc/systemd/system/NetworkManager-wait-online.service
[Unit]
Description=Network Manager Wait Online
Documentation=man:nm-online(1)
Requires=NetworkManager.service
After=NetworkManager.service
Before=network-online.target
[Service]
# `nm-online -s` waits until the point when NetworkManager logs
# "startup complete". That is when startup actions are settled and
# devices and profiles reached a conclusive activated or deactivated
# state. It depends on which profiles are configured to autoconnect and
# also depends on profile settings like ipv4.may-fail/ipv6.may-fail,
# which affect when a profile is considered fully activated.
# Check NetworkManager logs to find out why wait-online takes a certain
# time.
Type=oneshot
ExecStart=/nix/store/r7fwxkxbgmwyrff5n03r9nchzskj59n9-networkmanager-1.40.16/bin/nm-online -s -q
RemainAfterExit=yes
# Set $NM_ONLINE_TIMEOUT variable for timeout in seconds.
# Edit with `systemctl edit NetworkManager-wait-online`.
#
# Note, this timeout should commonly not be reached. If your boot
# gets delayed too long, then the solution is usually not to decrease
# the timeout, but to fix your setup so that the connected state
# gets reached earlier.
Environment=NM_ONLINE_TIMEOUT=60
[Install]
WantedBy=network-online.target
# /nix/store/2ma3yvlqmj94dr6v24nkwmspgxfpfnmq-system-units/NetworkManager-wait-online.service.d/overrides.conf
[Unit]
[Service]
Environment="LOCALE_ARCHIVE=/nix/store/g135laybfk1c7bm2zhrh7x6dv884qxal-glibc-locales-2.35-224/lib/locale/locale-archive"
Environment="PATH=/nix/store/l6jgwxkc3jhr029vfzfwzcy28iyckwsj-coreutils-9.1/bin:/nix/store/gn1s1s5z19cf0wiir2cd38jckcjc6kn6-findutils-4.9.0/bin:/nix/store/pvb117r7fhwb08717ks21a6y>
Environment="TZDIR=/nix/store/z0kg1c0f8fx6r4rgg5bdy01lb2b9izqg-tzdata-2023a/share/zoneinfo"
ExecStart=
ExecStart=/nix/store/r7fwxkxbgmwyrff5n03r9nchzskj59n9-networkmanager-1.40.16/bin/nm-online -q
Adding tailscale0
to networking.networkmanager.unmanaged
on my system that's currently experiencing this issue did not resolve the problem.
This issue has been mentioned on NixOS Discourse. There might be relevant details there:
https://discourse.nixos.org/t/systemd-networkd-documentation-experience-feedback-notes/38660/5
Is there any way we can properly fix this issue ?
Not on the NixOS side; we'd have to fix it upstream I imagine.
I'll try and see how I can raise this issue upstream, we have to be careful to explain how this issue impact NixOS users but how it might benefit from fixing this issue (if there are any for others distros)
Is the problem that nm-online
with -s
on an already-up system hangs (what this thread seems to think)?
Or is the problem that NetworkManager-wait-online.service
should not have -s
(what upstream seems to think)?
It depends on what you think "wait for X" should mean when "X" has already happened. Should wait for the next X, or should it return immediately (no waiting necessary)?
I do not see any confusion about this from the upstream side; they did not claim that distros should not use -s
at any point in the thread you linked.
We want a flag that does what -s
is currently documented to supposedly do:
-s | --wait-for-startup
Wait for NetworkManager startup to complete, rather than waiting for network connectivity specifically. Startup is considered complete once NetworkManager has activated (or
attempted to activate) every auto-activate connection which is available given the current network state. This corresponds to the moment when NetworkManager logs "startup
complete". This mode is generally only useful at boot time. After startup has completed, nm-online -s will just return immediately, regardless of the current network state.
There are various ways to affect when startup complete is reached. For details see NetworkManager-wait-online.service(8).
The manual even explicitly clarifies behaviour after bootup is complete.
This is precisely what we want but nm-online
simply does not work as documented. Instead of returning immediately, it blocks indefinitely under certain conditions.
Is this related to kernel/firmware upgrade? This problem just occurred on my machine after a nixos-rebuild switch
, manually invoking nm-online -s
also waited 30s and timed out, but after a reboot I can no longer reproduce this, even with another nixos-rebuild switch
. I updated my kernel and firmware in the first nixos-rebuild switch
, updating from linux 6.9.1 to 6.9.2.
Oh, NetworkManager itself is updated and restarted.
No, it's not.
I noted that tailscaled.service
is restarted before NetworkManager.service
and NetworkManager-wait-online.service
.
restarting the following units: home-manager-inme.service, home-manager-root.service, nix-daemon.service, polkit.service, systemd-journald.service, tailscaled.service
starting the following units: NetworkManager-wait-online.service, NetworkManager.service, audit.service, bluetooth.service, kmod-static-nodes.service, logrotate-checkconf.service, mount-pstore.service, network-local-commands.service, network-setup.service, persist-\x27-nix-persist-etc-machine\x2did\x27.service, persist-\x27-nix-persist-home-inme-.bash_history\x27.service, persist-\x27-nix-persist-home-inme-.gitconfig\x27.service, persist-\x27-nix-persist-home-inme-.repo_.gitconfig.json\x27.service, power-profiles-daemon.service, resolvconf.service, rtkit-daemon.service, systemd-hostnamed.service, systemd-modules-load.service, systemd-oomd.socket, systemd-sysctl.service, systemd-timesyncd.service, systemd-udevd-control.socket, systemd-udevd-kernel.socket, systemd-update-done.service, systemd-vconsole-setup.service, waydroid-container.service, wpa_supplicant.service
Also confirmed in journalctl:
And if I systemctl stop tailscaled
before nixos-rebuild switch
, tailscaled.service
will be started after NetworkManager.service
is started, and then NetworkManager-wait-online.service
will exit gracefully.
Big if true.
That feels like something that should be solvable using a few systemd Before/After directives in the right places.
Though tailscaled.service already contains
After=network-pre.target NetworkManager.service systemd-resolved.service
IIRC After
should be applied in reverse when stopping/restarting.
Describe the bug
The systemd service
NetworkManager-wait-online.service
can preventnixos-rebuild
from succeeding:This service runs
nm-online -s -q
, and thenm-online
man page says:I am not familiar with this tool, but my experience is that after my laptop has been up for some time (e.g. days),
nm-online
will often return an error code rather than correctly determine the network is up, thus killing any futurenixos-rebuild
commands.Steps To Reproduce
Steps to reproduce the behavior:
nm-online -s -q
does not return success (not sure how to do this on demand).nixos-rebuild
failure.Expected behavior
nixos-rebuild
should not fail due to an erroneous network check.Additional context
This is tricky as it is not a
nix
issue per se but rather an issue with a presumably flaky systemd service. It is easy enough to disable this service manually:And perhaps this is the best solution. But a number of my coworkers all ran into this issue independently, so I thought it merited an issue for discoverability, if nothing else. My gut reaction is that a flaky check should probably not be required by default, but I don't know enough about this service's importance/fragility to say.
This issue was noticed only recently, both on
nixos-unstable
andnixos-22.05
.Notify maintainers
Metadata
Please run
nix-shell -p nix-info --run "nix-info -m"
and paste the result.Thanks!