NetworkManager's nm-online kills nixos-rebuild

tbidne commented 2 years ago

Describe the bug

The systemd service NetworkManager-wait-online.service can prevent nixos-rebuild from succeeding:

warning: the following units failed: NetworkManager-wait-online.service

× NetworkManager-wait-online.service - Network Manager Wait Online
     Loaded: loaded (/etc/systemd/system/NetworkManager-wait-online.service; enabled; vendor preset: enabled)
    Drop-In: /nix/store/k5yq51spcggip2h6aq1y0bydkpr4zahc-system-units/NetworkManager-wait-online.service.d
             └─overrides.conf
     Active: failed (Result: exit-code) since Tue 2022-07-05 10:18:52 NZST; 36ms ago
       Docs: man:nm-online(1)
    Process: 1258376 ExecStart=/nix/store/b4yhg54s70i0v0k1qnnv8vnja6018yrh-networkmanager-1.38.2/bin/nm-online -s -q (code=exited, status=1/FAILURE)
   Main PID: 1258376 (code=exited, status=1/FAILURE)
         IP: 0B in, 0B out
        CPU: 22ms

Jul 05 10:17:52 nixos systemd[1]: Starting Network Manager Wait Online...
Jul 05 10:18:52 nixos systemd[1]: NetworkManager-wait-online.service: Main process exited, code=exited, status=1/FAILURE
Jul 05 10:18:52 nixos systemd[1]: NetworkManager-wait-online.service: Failed with result 'exit-code'.

This service runs nm-online -s -q, and the nm-online man page says:

-s | --wait-for-startup
           Wait for NetworkManager startup to complete, rather than waiting for network connectivity specifically. Startup is
           considered complete once NetworkManager has activated (or attempted to activate) every auto-activate connection
           which is available given the current network state. This corresponds to the moment when NetworkManager logs "startup
           complete". This mode is generally only useful at boot time. After startup has completed, nm-online -s will just
           return immediately, regardless of the current network state.

           There are various ways to affect when startup complete is reached. For details see NetworkManager-wait-
           online.service(8).

This corresponds to the moment when NetworkManager logs "startup complete". This mode is generally only useful at boot time.

I am not familiar with this tool, but my experience is that after my laptop has been up for some time (e.g. days), nm-online will often return an error code rather than correctly determine the network is up, thus killing any future nixos-rebuild commands.

Steps To Reproduce

Steps to reproduce the behavior:

Get your machine in a state where nm-online -s -q does not return success (not sure how to do this on demand).
Witness nixos-rebuild failure.

Expected behavior

nixos-rebuild should not fail due to an erroneous network check.

Additional context

This is tricky as it is not a nix issue per se but rather an issue with a presumably flaky systemd service. It is easy enough to disable this service manually:

systemd.services.NetworkManager-wait-online.enable = false;

And perhaps this is the best solution. But a number of my coworkers all ran into this issue independently, so I thought it merited an issue for discoverability, if nothing else. My gut reaction is that a flaky check should probably not be required by default, but I don't know enough about this service's importance/fragility to say.

This issue was noticed only recently, both on nixos-unstable and nixos-22.05.

Notify maintainers

Metadata

Please run nix-shell -p nix-info --run "nix-info -m" and paste the result.

[user@system:~]$ nix-shell -p nix-info --run "nix-info -m"
 - system: `"x86_64-linux"`
 - host os: `Linux 5.15.47, NixOS, 22.05 (Quokka)`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.8.1`
 - nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixos`

Thanks!

tbidne commented 2 years ago

Possibly it relates to this? https://github.com/NixOS/nixpkgs/pull/178046

Edit: Nevermind, that commit is not in nixos-22.05.

nixos-discourse commented 2 years ago

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/how-to-disable-networkmanager-wait-online-service-in-the-configuration-file/19963/4

lorenz commented 2 years ago

This is related to udev not initializing devices. NetworkManager never completes startup because a WireGuard interface is never initialized by udev. A workaround is just putting the affected device into networking.networkmanager.unmanaged.

DeskworkTrickster commented 2 years ago

I'm wondering if there is any solution to this. Since this triggers for me always. So, I'm not even sure if my nixos-rebuild switch does actually complete.

As mentioned in the initial description of the issue:

This corresponds to the moment when NetworkManager logs "startup complete". This mode is generally only useful at boot time.

This never seems to be the case, since the actual message is "Started Network Manager.". Apart from that I'm wondering why it relies on a log message string, when the status of NetworkManager.service would be way less prone to errors.

EDIT: no idea why, but it went away ... Everything works as expected again.

oati commented 2 years ago

I think I'm running into this.

systemd-networkd-wait-online.service

Aug 22 08:38:49 erin-laptop systemd[1]: Starting Wait for Network to be Configured...
Aug 22 08:40:49 erin-laptop systemd-networkd-wait-online[26249]: Timeout occurred while waiting for network connectivity.
Aug 22 08:40:49 erin-laptop systemd[1]: systemd-networkd-wait-online.service: Main process exited, code=exited, status=1/FAILURE
Aug 22 08:40:49 erin-laptop systemd[1]: systemd-networkd-wait-online.service: Failed with result 'exit-code'.
Aug 22 08:40:49 erin-laptop systemd[1]: Failed to start Wait for Network to be Configured.

It only happens after I've connected my USB-C dock with an ethernet connection at least once after boot. (Note: I'm running tmpfs on root, so my system should "forget" everything about my dock on reboot.)

I'm also running systemd-networkd, not NetworkManager.

NorfairKing commented 2 years ago

@stale don't you dare!

pjones commented 2 years ago

@ncfavier Any chance you can help with this?

ncfavier commented 2 years ago

I don't use NetworkManager so I wouldn't know, but in the case of systemd-networkd there are relevant options under systemd.network.wait-online: anyInterface and ignoredInterfaces. I recommend at least setting the former to true on laptops.

So, I'm not even sure if my nixos-rebuild switch does actually complete.

Warning about failed units is pretty much the last thing that the activation script does, so it's probably fine (but the failure should be fixed, of course).

lorenz commented 2 years ago

BTW I've "fixed" this by setting

# udev 250 doesn't reliably reinitialize devices after restart
systemd.services.systemd-udevd.restartIfChanged = false;

But this is really an upstream systemd bug.

blaggacao commented 2 years ago

My Ubuntu has:

❯ systemctl cat NetworkManager-wait-online.service
# /lib/systemd/system/NetworkManager-wait-online.service
[Unit]
Description=Network Manager Wait Online
Documentation=man:nm-online(1)
Requires=NetworkManager.service
After=NetworkManager.service
Before=network-online.target

[Service]
# `nm-online -s` waits until the point when NetworkManager logs
# "startup complete". That is when startup actions are settled and
# devices and profiles reached a conclusive activated or deactivated
# state. It depends on which profiles are configured to autoconnect and
# also depends on profile settings like ipv4.may-fail/ipv6.may-fail,
# which affect when a profile is considered fully activated.
# Check NetworkManager logs to find out why wait-online takes a certain
# time.

Type=oneshot
ExecStart=/usr/bin/nm-online -s -q
RemainAfterExit=yes

# Set $NM_ONLINE_TIMEOUT variable for timeout in seconds.
# Edit with `systemctl edit NetworkManager-wait-online`.
#
# Note, this timeout should commonly not be reached. If your boot
# gets delayed too long, then the solution is usually not to decrease
# the timeout, but to fix your setup so that the connected state
# gets reached earlier.
Environment=NM_ONLINE_TIMEOUT=60

[Install]

My latest NixOS (22.05) config has:

> nix-repl> c.config.systemd.services.NetworkManager-wait-online
{ after = [ ... ]; aliases = [ ... ]; before = [ ... ]; bindsTo = [ ... ]; confinement = { ... }; conflicts = [ ... ]; description = ""; documentation = [ ... ]; enable = false; environment = { ... }; jobScripts = [ ... ]; onFailure = [ ... ]; partOf = [ ... ]; path = [ ... ]; postStart = ""; postStop = ""; preStart = ""; preStop = ""; reload = ""; reloadIfChanged = false; reloadTriggers = [ ... ]; requiredBy = [ ... ]; requires = [ ... ]; requisite = [ ... ]; restartIfChanged = true; restartTriggers = [ ... ]; runner = error: attribute 'ExecStart' missing

       at /nix/store/6dgpkrc0gxlndr4j2524ihlsr8209ph7-source/nixos/modules/testing/service-runner.nix:65:9:

           64|       my $cmd = <<END_CMD;
           65|       ${service.serviceConfig.ExecStart}
             |         ^
           66|       END_CMD
«derivation

based on this definition:

    systemd.services.NetworkManager-wait-online = {
      wantedBy = [ "network-online.target" ];
    };

Where is this ExecStart coming from in your configurations?

Also:

nixpkgs on  fix/teamviewer-service-deps [$]
❯ rg 'nm-online'
[ nothing ]

bjornfor commented 2 years ago

@blaggacao: The reference to nm-online comes from upstream service unit NetworkManager-wait-online.service, not from nixpkgs itself.

domenkozar commented 1 year ago

I'd vote for disabling this service until we can make it reliable. It's doing no good currently.

pinpox commented 1 year ago

I've been tripping over this bug for quite some time now and it is annoying for users. As mentioned above, the error can be worked around with:

systemd.services.NetworkManager-wait-online.enable = lib.mkForce false;
systemd.services.systemd-networkd-wait-online.enable = lib.mkForce false;

I was concernd if there might be other dependencies or services that require this to be enabled, so I grepped through nixpkgs for both. These are the mentions:

NetworkManager-wait-online.service

modules/networkmanager The actual definition of this service
pkgs.hqplayerd can probably be removed

systemd-networkd-wait-online.service

modules/system/boot/networkd Service devinition
Used in various nixos tests, which should not be relevant to normal operation of the system (?)
- nixos/tests/systemd-networkd-dhcpserver-static-leases.nix
- nixos/tests/kea.nix
- nixos/tests/systemd-networkd.nix
- nixos/tests/systemd-bpf.nix
- nixos/tests/systemd-networkd-dhcpserver.nix

TL;DR

Upon first look the usage of these two services seem minimal to me and they are causing more problems that doing good. Agreeing with @domenkozar's proposal, I'd vote to disable them per default. If this is agreed upon, I can submit a PR

matthiasbeyer commented 1 year ago

I've been running with that service disabled for 6 months now and have not experienced a single issue. Don't count my voice too heavily, though :wink: ! :+1:

lorenz commented 1 year ago

If we're going to work around this I'd still prefer systemd.services.systemd-udevd.restartIfChanged = false; as the other workaround just masks the issue while udev's still half-broken.

pinpox commented 1 year ago

If we're going to work around this I'd still prefer systemd.services.systemd-udevd.restartIfChanged = false; as the other workaround just masks the issue while udev's still half-broken.

For some reason, that didn't work for me. On rebuild it said "not restarting service"

supermarin commented 1 year ago

Yep confirming what @pinpox said:

updating GRUB 2 menu...
NOT restarting the following changed units: systemd-udevd.service

Still seeing the issue on HEAD. Disabling wait-online like mentioned previously fixes nixos-rebuild.

systemd.services.NetworkManager-wait-online.enable = lib.mkForce false;
systemd.services.systemd-networkd-wait-online.enable = lib.mkForce false;

ju1m commented 1 year ago

NetworkManager-wait-online.service is a dependency of network-online.target:

$ systemctl list-dependencies --reverse NetworkManager-wait-online.service
NetworkManager-wait-online.service
● └─network-online.target
●   ├─dnscrypt-proxy2.service
●   ├─wireguard-wg-intra-peer-[...].service
●   └─multi-user.target

$ systemctl cat NetworkManager-wait-online.service | grep network-online.target
Before=network-online.target
WantedBy=network-online.target

As others pointed out, the -s (aka. --wait-for-startup) flag is causing some deadlock:

# `nm-online -s` waits until the point when NetworkManager logs
# "startup complete". That is when startup actions are settled and
# devices and profiles reached a conclusive activated or deactivated
# state. It depends on which profiles are configured to autoconnect and
# also depends on profile settings like ipv4.may-fail/ipv6.may-fail,
# which affect when a profile is considered fully activated.
# Check NetworkManager logs to find out why wait-online takes a certain
# time.

I guess using -s is a more accurate notification than simply waiting for NetworkManager.service to be up, and I don't know why it sometimes fails to notice that the startup is complete, nor whether removing it has any significant drawbacks, but keeping it causes switch-to-configuration test to fail often enough to bother me, so until a more correct fix is available, I'm personally removing it:

{ pkgs, ... }: {
  systemd.services.NetworkManager-wait-online = {
    serviceConfig = {
      ExecStart = [ "" "${pkgs.networkmanager}/bin/nm-online -q" ];
      Restart = "on-failure";
      RestartSec = 1;
    };
    unitConfig.StartLimitIntervalSec = 0;
  };
}

darkone23 commented 1 year ago

Also running into these deadlocks on nixpkgs-unstable.

edit: see mentioned below that this might be related to wireguard. I use tailscale - might try to play around with whether removing that setup makes any difference on this issue for me.

neilmayhew commented 1 year ago

This has been happening to me consistently for over a year. The cause seems to be due to what I've described in https://github.com/NixOS/nixpkgs/issues/182449#issuecomment-1585296876.

I used (a simplified version of) @ju1m's workaround, but using the workaround from https://github.com/NixOS/nixpkgs/issues/195777#issuecomment-1324378856 seems to be even better. However, I don't know how to include that in my system configuration rather than running it manually when needed.

Shados commented 1 year ago

@lorenz:

This is related to udev not initializing devices. NetworkManager never completes startup because a WireGuard interface is never initialized by udev. A workaround is just putting the affected device into networking.networkmanager.unmanaged.

My instance of this issue is indeed caused by a wireguard interface, but adding the interface to networking.networkmanager.unmanaged (via name or type:wireguard) did not change a thing. Having done some more digging online, I believe I know why: prior to udev initialization completing for a link, NM cannot be sure what the name or type of the link will be (as udev may make changes), and so it is not possible to filter a device prior to udev marking it as configured.

I'm still looking for a workaround.

martin-braun commented 1 year ago

I have the issue on the MacBook Pro 2019 16,1 with t2 patches after switching to iwd, because of network issues.

systemd.services.NetworkManager-wait-online.enable = lib.mkForce false;

worked for me.

freekvh commented 1 year ago

I have the issue on the MacBook Pro 2019 16,1 with t2 patches after switching to iwd, because of network issues.

systemd.services.NetworkManager-wait-online.enable = lib.mkForce false;

worked for me.

I'm new to NixOS, I have this error as well:

jul 20 19:19:13 nixos systemd[1]: Starting Network Manager Wait Online...
jul 20 19:20:13 nixos systemd[1]: NetworkManager-wait-online.service: Main process exited, code=exited, status=1/FAILURE
jul 20 19:20:13 nixos systemd[1]: NetworkManager-wait-online.service: Failed with result 'exit-code'.
jul 20 19:20:13 nixos systemd[1]: Failed to start Network Manager Wait Online.
warning: error(s) occurred while switching to the new configuration

I put this line in my configuration.nix (systemd.services.NetworkManager-wait-online.enable = lib.mkForce false;), but then I get this error:

$ sudo nixos-rebuild switch --upgrade
unpacking channels...
error: undefined variable 'lib'

       at /etc/nixos/configuration.nix:16:56:

           15|   boot.loader.systemd-boot.enable = true;
           16|   systemd.services.NetworkManager-wait-online.enable = lib.mkForce false;
             |                                                        ^
           17|   boot.loader.efi.canTouchEfiVariables = true;
(use '--show-trace' to show detailed location information)
building Nix...

Right now I can't update anymore, so I am stuck and a fairly recent clean NixOS install. Is there a way to force the update for now? What is the best structural solution?

Thank you!

bjornfor commented 1 year ago

I put this line in my configuration.nix (systemd.services.NetworkManager-wait-online.enable = lib.mkForce false;), but then I get this error:

Looks like nixos-generate-config adds this in the beginning of /etc/nixos/configuration.nix:

[...snip comment block...]

{ config, pkgs, ... }:

whereas most people eventually modify that to { config, pkgs, lib, ... }:, providing configuration.nix with the missing lib argument.

I think nixos-generate-config should be updated to include lib by default. Although the default config doesn't need/use lib, it quickly becomes something new NixOS users want to use.

freekvh commented 1 year ago

I put this line in my configuration.nix (systemd.services.NetworkManager-wait-online.enable = lib.mkForce false;), but then I get this error:

Looks like nixos-generate-config adds this in the beginning of /etc/nixos/configuration.nix:
[...snip comment block...]

{ config, pkgs, ... }:
whereas most people eventually modify that to { config, pkgs, lib, ... }:, providing configuration.nix with the missing lib argument.

I think nixos-generate-config should be updated to include lib by default. Although the default config doesn't need/use lib, it quickly becomes something new NixOS users want to use.

Thanx for the quick reply.

I have added { config, pkgs, lib, ... }:

In the first line of my configuration.nix, but I get the same error...

bjornfor commented 1 year ago

I have added { config, pkgs, lib, ... }:

In the first line of my configuration.nix, but I get the same error...

Then please open a new issue or post on https://discourse.nixos.org/ asking for help.

nixos-discourse commented 1 year ago

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/nixos-rebuild-switch-upgrade-networkmanager-wait-online-service-failure/30746/1

freekvh commented 1 year ago

Ok, this solved it for me: https://discourse.nixos.org/t/nixos-rebuild-switch-upgrade-networkmanager-wait-online-service-failure/30746/2

Just did sudo nixos-rebuild boot --upgrade and rebooted. I left the lib in the { config, pkgs, lib, ... }:.

Thanx for the support!

bjornfor commented 1 year ago

I think nixos-generate-config should be updated to include lib by default.

https://github.com/NixOS/nixpkgs/pull/244653

Atemu commented 1 year ago

So, to summarise, the actual error seems to be that NM waits for startup of some interfaces it thinks it should manage but they're never communicated as up to NM because NM was never supposed manage that interface?

When I say nmcli, NM thinks it does not manage my tailscale0 interface:

tailscale0: unmanaged
        "tailscale0"
        tun, sw, mtu 1280

Why would it be waiting on it then? The other interfaces are up and I'm sending you this message via one of them, so it can't be waiting on those.

I'll try to explicitly add it to networking.networkmanager.unmanaged and report back after a few rebuilds in a few weeks.

K900 commented 1 year ago

No, nm-online is just weirdly broken and locks up if you try to do it on an already running system. It's probably an easy code fix but I'm not familiar with the codebase at all.

pinpox commented 1 year ago

No, nm-online is just weirdly broken and locks up if you try to do it on an already running system. It's probably an easy code fix but I'm not familiar with the codebase at all.

Is that something that has to be fixed upstream? If so, could you create an issue for it?

K900 commented 1 year ago

There's already an issue (kind of): https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/issues/1220 https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/issues/1034

pinpox commented 1 year ago

There's already an issue (kind of): https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/issues/1220 https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/issues/1034

Reading through those. Are we using the mentioned -s flag? If so, maybe we can remove it? Running nm-online -s manually hangs for me, while nm-online (without -s) seems to work correctly.

K900 commented 1 year ago

We are, and we have to, at least on initial boot.

pinpox commented 1 year ago

Alright, I'll stick with my workaround then for now. I would like to contribute a real fix for this or sort it out properly, but don't know how to continue.

neilmayhew commented 1 year ago

I had this problem for well over a year, but since I added this to my configuration.nix the problem has not reoccurred:

systemd.services.NetworkManager-wait-online = {
  serviceConfig = {
    ExecStart = [ "" "${pkgs.networkmanager}/bin/nm-online -q" ];
  };
};

Nor have I had any other network problems.

Also,

$ nm-online -s
Connecting...............   30s [started]
$ journalctl -b -u NetworkManager-dispatcher.service
Jul 10 16:24:41 melchizedek systemd[1]: Starting Network Manager Script Dispatcher Service...
Jul 10 16:24:41 melchizedek systemd[1]: Started Network Manager Script Dispatcher Service.
Jul 10 16:24:51 melchizedek systemd[1]: NetworkManager-dispatcher.service: Deactivated successfully.
Jul 11 17:43:59 melchizedek systemd[1]: Starting Network Manager Script Dispatcher Service...
Jul 11 17:43:59 melchizedek systemd[1]: Started Network Manager Script Dispatcher Service.
Jul 11 17:44:09 melchizedek systemd[1]: NetworkManager-dispatcher.service: Deactivated successfully.
Jul 14 18:41:15 melchizedek systemd[1]: Starting Network Manager Script Dispatcher Service...
Jul 14 18:41:15 melchizedek systemd[1]: Started Network Manager Script Dispatcher Service.
Jul 14 18:41:25 melchizedek systemd[1]: NetworkManager-dispatcher.service: Deactivated successfully.
...
$ last reboot -1
reboot   system boot  5.15.119         Mon Jul 10 16:24   still running

NetworkManager-dispatcher.service seems to run at various times that aren't related to reboots or nixos-upgrade switch

pinpox commented 1 year ago

@neilmayhew what does that exactly do? If I understand the man page correctly, -q only omits the output, without changing any functionality

neilmayhew commented 1 year ago

@pinpox The -q was already there, and I'm just removing the -s. Sorry I didn't make that clearer.

With the extra configuration:

$ systemctl cat NetworkManager-wait-online.service
# /etc/systemd/system/NetworkManager-wait-online.service
[Unit]
Description=Network Manager Wait Online
Documentation=man:nm-online(1)
Requires=NetworkManager.service
After=NetworkManager.service
Before=network-online.target

[Service]
# `nm-online -s` waits until the point when NetworkManager logs
# "startup complete". That is when startup actions are settled and
# devices and profiles reached a conclusive activated or deactivated
# state. It depends on which profiles are configured to autoconnect and
# also depends on profile settings like ipv4.may-fail/ipv6.may-fail,
# which affect when a profile is considered fully activated.
# Check NetworkManager logs to find out why wait-online takes a certain
# time.

Type=oneshot
ExecStart=/nix/store/r7fwxkxbgmwyrff5n03r9nchzskj59n9-networkmanager-1.40.16/bin/nm-online -s -q
RemainAfterExit=yes

# Set $NM_ONLINE_TIMEOUT variable for timeout in seconds.
# Edit with `systemctl edit NetworkManager-wait-online`.
#
# Note, this timeout should commonly not be reached. If your boot
# gets delayed too long, then the solution is usually not to decrease
# the timeout, but to fix your setup so that the connected state
# gets reached earlier.
Environment=NM_ONLINE_TIMEOUT=60

[Install]
WantedBy=network-online.target

# /nix/store/2ma3yvlqmj94dr6v24nkwmspgxfpfnmq-system-units/NetworkManager-wait-online.service.d/overrides.conf
[Unit]

[Service]
Environment="LOCALE_ARCHIVE=/nix/store/g135laybfk1c7bm2zhrh7x6dv884qxal-glibc-locales-2.35-224/lib/locale/locale-archive"
Environment="PATH=/nix/store/l6jgwxkc3jhr029vfzfwzcy28iyckwsj-coreutils-9.1/bin:/nix/store/gn1s1s5z19cf0wiir2cd38jckcjc6kn6-findutils-4.9.0/bin:/nix/store/pvb117r7fhwb08717ks21a6y>
Environment="TZDIR=/nix/store/z0kg1c0f8fx6r4rgg5bdy01lb2b9izqg-tzdata-2023a/share/zoneinfo"

ExecStart=
ExecStart=/nix/store/r7fwxkxbgmwyrff5n03r9nchzskj59n9-networkmanager-1.40.16/bin/nm-online -q

lilyball commented 1 year ago

Adding tailscale0 to networking.networkmanager.unmanaged on my system that's currently experiencing this issue did not resolve the problem.

nixos-discourse commented 10 months ago

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/systemd-networkd-documentation-experience-feedback-notes/38660/5

AkechiShiro commented 7 months ago

Is there any way we can properly fix this issue ?

Atemu commented 7 months ago

Not on the NixOS side; we'd have to fix it upstream I imagine.

AkechiShiro commented 7 months ago

I'll try and see how I can raise this issue upstream, we have to be careful to explain how this issue impact NixOS users but how it might benefit from fixing this issue (if there are any for others distros)

charmoniumQ commented 6 months ago

Is the problem that nm-online with -s on an already-up system hangs (what this thread seems to think)?

Or is the problem that NetworkManager-wait-online.service should not have -s (what upstream seems to think)?

It depends on what you think "wait for X" should mean when "X" has already happened. Should wait for the next X, or should it return immediately (no waiting necessary)?

Atemu commented 6 months ago

I do not see any confusion about this from the upstream side; they did not claim that distros should not use -s at any point in the thread you linked.

We want a flag that does what -s is currently documented to supposedly do:

 -s | --wait-for-startup
     Wait for NetworkManager startup to complete, rather than waiting for network connectivity specifically. Startup is considered complete once NetworkManager has activated (or
     attempted to activate) every auto-activate connection which is available given the current network state. This corresponds to the moment when NetworkManager logs "startup
     complete". This mode is generally only useful at boot time. After startup has completed, nm-online -s will just return immediately, regardless of the current network state.

     There are various ways to affect when startup complete is reached. For details see NetworkManager-wait-online.service(8).

The manual even explicitly clarifies behaviour after bootup is complete.

This is precisely what we want but nm-online simply does not work as documented. Instead of returning immediately, it blocks indefinitely under certain conditions.

inmaldrerah commented 6 months ago

Is this related to kernel/firmware upgrade? This problem just occurred on my machine after a nixos-rebuild switch, manually invoking nm-online -s also waited 30s and timed out, but after a reboot I can no longer reproduce this, even with another nixos-rebuild switch. I updated my kernel and firmware in the first nixos-rebuild switch, updating from linux 6.9.1 to 6.9.2.

Oh, NetworkManager itself is updated and restarted.

K900 commented 6 months ago

No, it's not.

inmaldrerah commented 6 months ago

I noted that tailscaled.service is restarted before NetworkManager.service and NetworkManager-wait-online.service.

restarting the following units: home-manager-inme.service, home-manager-root.service, nix-daemon.service, polkit.service, systemd-journald.service, tailscaled.service
starting the following units: NetworkManager-wait-online.service, NetworkManager.service, audit.service, bluetooth.service, kmod-static-nodes.service, logrotate-checkconf.service, mount-pstore.service, network-local-commands.service, network-setup.service, persist-\x27-nix-persist-etc-machine\x2did\x27.service, persist-\x27-nix-persist-home-inme-.bash_history\x27.service, persist-\x27-nix-persist-home-inme-.gitconfig\x27.service, persist-\x27-nix-persist-home-inme-.repo_.gitconfig.json\x27.service, power-profiles-daemon.service, resolvconf.service, rtkit-daemon.service, systemd-hostnamed.service, systemd-modules-load.service, systemd-oomd.socket, systemd-sysctl.service, systemd-timesyncd.service, systemd-udevd-control.socket, systemd-udevd-kernel.socket, systemd-update-done.service, systemd-vconsole-setup.service, waydroid-container.service, wpa_supplicant.service

Also confirmed in journalctl:

``` May 28 07:26:58 thinkbook-16-plus-nixos systemd[1]: Started Tailscale node agent. ░░ Subject: A start job for unit tailscaled.service has finished successfully ░░ Defined-By: systemd ░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel ░░ ░░ A start job for unit tailscaled.service has finished successfully. ░░ ░░ The job identifier is 3587. ... May 28 07:26:58 thinkbook-16-plus-nixos systemd[1]: Starting Network Manager... ░░ Subject: A start job for unit NetworkManager.service has begun execution ░░ Defined-By: systemd ░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel ░░ ░░ A start job for unit NetworkManager.service has begun execution. ░░ ░░ The job identifier is 3744. ... May 28 07:26:58 thinkbook-16-plus-nixos systemd[1]: Starting Network Manager Wait Online... ░░ Subject: A start job for unit NetworkManager-wait-online.service has begun execution ░░ Defined-By: systemd ░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel ░░ ░░ A start job for unit NetworkManager-wait-online.service has begun execution. ░░ ░░ The job identifier is 3743. ```

And if I systemctl stop tailscaled before nixos-rebuild switch, tailscaled.service will be started after NetworkManager.service is started, and then NetworkManager-wait-online.service will exit gracefully.

Atemu commented 6 months ago

Big if true.

That feels like something that should be solvable using a few systemd Before/After directives in the right places.

Though tailscaled.service already contains

After=network-pre.target NetworkManager.service systemd-resolved.service

IIRC After should be applied in reverse when stopping/restarting.

NixOS / nixpkgs