Closed dustymabe closed 2 years ago
From the console log, it appears that coreos-test-nm-keyfile.service
failed.
For this issue we pinned the kernel for a period of time to 5.19.0-0.rc8.20220727git39c3c396f813.60.fc37
. Later it was tested and found to be working so it was unpinned in https://github.com/coreos/fedora-coreos-config/commit/8563fe68e881e6128afe8124d2494dabc898231b.
This is still an issue. It's a race condition. I was able to get some debug output after doing a local build with:
$ git diff
diff --git a/image-base.yaml b/image-base.yaml
index 98c324df..d9b5f32b 100644
--- a/image-base.yaml
+++ b/image-base.yaml
@@ -9,6 +9,10 @@ size: 10
extra-kargs:
# Disable SMT on systems vulnerable to MDS or any similar future issue.
- mitigations=auto,nosmt
+ - systemd.log_level=debug
+ - systemd.log_color=no
+ - systemd.log_target=console
+ - debug
# Disable networking by default on firstboot. We can drop this once cosa stops
# defaulting to `ip=auto rd.neednet=1` when it doesn't see this key.
It turns out this is a legitimate failure. The coreos-test-nm-keyfile.service
runs three commands:
ExecStart=/usr/bin/journalctl -u nm-initrd --no-pager --grep "policy: set '%[1]s' (.*) as default .* routing and DNS"
ExecStart=/usr/bin/journalctl -u NetworkManager --no-pager --grep "policy: set '%[1]s' (.*) as default .* routing and DNS"
ExecStart=/usr/bin/grep "%[1]s" /etc/NetworkManager/system-connections/%[2]s
This service also runs After=network-online.target
but for some reason in this case NetworkManager-wait-online.service
runs and exits before enp0s5
appears to be there. So NetworkManager-wait-online.service
runs and exits and then coreos-test-nm-keyfile.service
runs and fails. All of this happens directly before some messages about enp0s5
in the log:
Received SIGCHLD from PID 2342 (nm-online).^M
Child 2342 (nm-online) died (code=exited, status=0/SUCCESS)^M
NetworkManager-wait-online.service: Child 2342 belongs to NetworkManager-wait-online.service.^M
NetworkManager-wait-online.service: Main process exited, code=exited, status=0/SUCCESS (success)^M
NetworkManager-wait-online.service: bpf-lsm: Failed to delete cgroup entry from LSM BPF map: No such file or directory^M
NetworkManager-wait-online.service: Changed start -> exited^M
NetworkManager-wait-online.service: Job 617 NetworkManager-wait-online.service/start finished, result=done^M
[ OK ] Finished NetworkManager-wait-onlineâ<80>¦ice - Network Manager Wait Online.^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/NetworkManager_2dwait_2donline_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=517 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/NetworkManager_2dwait_2donline_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=518 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager member=JobRemoved cookie=519 reply_cookie=0 signature=uoss error-name=n/a error-message=n/a^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/NetworkManager_2dwait_2donline_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=520 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/NetworkManager_2dwait_2donline_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=521 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a^M
systemd-journald.service: Received EPOLLHUP on stored fd 86 (stored), closing.^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/network_2donline_2etarget interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=522 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a^M
network-online.target changed dead -> active^M
network-online.target: Job 616 network-online.target/start finished, result=done^M
[ OK ] Reached target network-online.target - Network is Online.^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/network_2donline_2etarget interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=523 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager member=JobRemoved cookie=524 reply_cookie=0 signature=uoss error-name=n/a error-message=n/a^M
coreos-test-nm-keyfile.service: Will spawn child (service_enter_start): /usr/bin/journalctl^M
[ 27.728662] kauditd_printk_skb: 121 callbacks suppressed^M
[ 27.728664] audit: type=1130 audit(1661821762.270:297): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=NetworkManager-wait-online comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'^M
Failed to read pids.max attribute of root cgroup, ignoring: No data available^M
coreos-test-nm-keyfile.service: Passing 0 fds to service^M
coreos-test-nm-keyfile.service: About to execute /usr/bin/journalctl -u nm-initrd --no-pager --grep "policy: set 'CoreOS DHCP' (.*) as default .* routing and DNS"^M
coreos-test-nm-keyfile.service: Forked /usr/bin/journalctl as 2438^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/coreos_2dtest_2dnm_2dkeyfile_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=525 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/coreos_2dtest_2dnm_2dkeyfile_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=526 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a^M
coreos-test-nm-keyfile.service: Changed dead -> start^M
Starting coreos-test-nm-keyfile.serâ<80>¦O Verify NM Keyfile Propagation...^M
iscsi.service: ConditionDirectoryNotEmpty=/var/lib/iscsi/nodes failed.^M
iscsi.service: Starting requested but condition failed. Not starting unit.^M
iscsi.service: Job 627 iscsi.service/start finished, result=done^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/iscsi_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=527 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/iscsi_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=528 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager member=JobRemoved cookie=529 reply_cookie=0 signature=uoss error-name=n/a error-message=n/a^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/remote_2dfs_2dpre_2etarget interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=530 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a^M
remote-fs-pre.target changed dead -> active^M
remote-fs-pre.target: Job 628 remote-fs-pre.target/start finished, result=done^M
[ OK ] Reached target remote-fs-pre.targetâ<80>¦eparation for Remote File Systems.^M
coreos-test-nm-keyfile.service: Executing: /usr/bin/journalctl -u nm-initrd --no-pager --grep "policy: set 'CoreOS DHCP' (.*) as default .* routing and DNS"^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/remote_2dfs_2dpre_2etarget interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=531 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager member=JobRemoved cookie=532 reply_cookie=0 signature=uoss error-name=n/a error-message=n/a^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/remote_2dcryptsetup_2etarget interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=533 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a^M
remote-cryptsetup.target changed dead -> active^M
remote-cryptsetup.target: Job 630 remote-cryptsetup.target/start finished, result=done^M
[ OK ] Reached target remote-cryptsetup.target - Remote Encrypted Volumes.^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/remote_2dcryptsetup_2etarget interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=534 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager member=JobRemoved cookie=535 reply_cookie=0 signature=uoss error-name=n/a error-message=n/a^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/remote_2dfs_2etarget interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=536 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a^M
remote-fs.target changed dead -> active^M
remote-fs.target: Job 626 remote-fs.target/start finished, result=done^M
[ OK ] Reached target remote-fs.target - Remote File Systems.^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/remote_2dfs_2etarget interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=537 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager member=JobRemoved cookie=538 reply_cookie=0 signature=uoss error-name=n/a error-message=n/a^M
systemd-user-sessions.service: starting held back, waiting for: console-login-helper-messages-gensnippet-ssh-keys.service^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/remote_2dfs_2etarget interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=539 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/remote_2dcryptsetup_2etarget interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=540 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/remote_2dfs_2dpre_2etarget interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=541 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/iscsi_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=542 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/iscsi_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=543 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/coreos_2dtest_2dnm_2dkeyfile_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=544 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/coreos_2dtest_2dnm_2dkeyfile_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=545 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/network_2donline_2etarget interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=546 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/job/615 interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=547 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a^M
systemd-journald.service: Got notification message from PID 2183 (FDSTORE=1)^M
systemd-journald.service: Added fd 83 (n/a) to fd store.^M
Received SIGCHLD from PID 2438 (journalctl).^M
Child 2438 (journalctl) died (code=exited, status=0/SUCCESS)^M
coreos-test-nm-keyfile.service: Child 2438 belongs to coreos-test-nm-keyfile.service.^M
coreos-test-nm-keyfile.service: Main process exited, code=exited, status=0/SUCCESS (success)^M
coreos-test-nm-keyfile.service: Running next main command for state start.^M
coreos-test-nm-keyfile.service: Will spawn child (service_run_next_main): /usr/bin/journalctl^M
coreos-test-nm-keyfile.service: Passing 0 fds to service^M
coreos-test-nm-keyfile.service: About to execute /usr/bin/journalctl -u NetworkManager --no-pager --grep "policy: set 'CoreOS DHCP' (.*) as default .* routing and DNS"^M
coreos-test-nm-keyfile.service: Forked /usr/bin/journalctl as 2443^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/coreos_2dtest_2dnm_2dkeyfile_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=548 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/coreos_2dtest_2dnm_2dkeyfile_2eservice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=549 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a^M
enp0s5: Processing udev action (SEQNUM=2133, ACTION=add)^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/sys_2dsubsystem_2dnet_2ddevices_2denp0s5_2edevice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=550 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/sys_2dsubsystem_2dnet_2ddevices_2denp0s5_2edevice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=551 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a^M
sys-subsystem-net-devices-enp0s5.device: Changed tentative -> plugged^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/sys_2ddevices_2dpci0000_3a00_2d0000_3a00_3a05_2e0_2dvirtio4_2dnet_2denp0s5_2edevice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=552 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a^M
coreos-test-nm-keyfile.service: Executing: /usr/bin/journalctl -u NetworkManager --no-pager --grep "policy: set 'CoreOS DHCP' (.*) as default .* routing and DNS"^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/sys_2ddevices_2dpci0000_3a00_2d0000_3a00_3a05_2e0_2dvirtio4_2dnet_2denp0s5_2edevice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=553 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a^M
sys-devices-pci0000:00-0000:00:05.0-virtio4-net-enp0s5.device: Changed tentative -> plugged^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/sys_2ddevices_2dpci0000_3a00_2d0000_3a00_3a05_2e0_2dvirtio4_2dnet_2denp0s5_2edevice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=554 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/sys_2ddevices_2dpci0000_3a00_2d0000_3a00_3a05_2e0_2dvirtio4_2dnet_2denp0s5_2edevice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=555 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/sys_2dsubsystem_2dnet_2ddevices_2denp0s5_2edevice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=556 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a^M
Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/systemd1/unit/sys_2dsubsystem_2dnet_2ddevices_2denp0s5_2edevice interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=557 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a^M
systemd-journald.service: Got notification message from PID 2183 (FDSTORE=1)^M
systemd-journald.service: Added fd 86 (n/a) to fd store.^M
Received SIGCHLD from PID 2443 (journalctl).^M
Child 2443 (journalctl) died (code=exited, status=1/FAILURE)^M
coreos-test-nm-keyfile.service: Child 2443 belongs to coreos-test-nm-keyfile.service.^M
coreos-test-nm-keyfile.service: Main process exited, code=exited, status=1/FAILURE^M
coreos-test-nm-keyfile.service: Failed with result 'exit-code'.^M
coreos-test-nm-keyfile.service: Service will not restart (restart setting)^M
coreos-test-nm-keyfile.service: Changed start -> failed
You can see that coreos-test-nm-keyfile.service
is executing journalctl -u NetworkManager
right before enp0s5: Processing udev action (SEQNUM=2133, ACTION=add)
.
Full log: console.txt
btw, the NetworkManager-wait-online manual ([1] explains how this is supposed to work).
In the attached log, you always see startup complete
line after device (enp0s5): state change: secondaries -> activated
.
The logs are confusing to me, but I would not assume that NetworkManager-wait-online.service
completes before startup complete
line. Do you think that is the problem?
Otherwise, the assumption would be that NM thinks enp0s5
is fully activated, when in fact it is not.
runs and exits before enp0s5 appears to be there.
I don't see that. Why do you get to that conclusion?
btw, the NetworkManager-wait-online manual ([1] explains how this is supposed to work).
Thanks for the link.
In the attached log, you always see
startup complete
line afterdevice (enp0s5): state change: secondaries -> activated
.The logs are confusing to me, but I would not assume that
NetworkManager-wait-online.service
completes beforestartup complete
line. Do you think that is the problem?
I think maybe it's confusing because this test tests a full install, so there are actually two boots in there: the install boot and the Ignition boot. Can you look at just the second boot?
Otherwise, the assumption would be that NM thinks
enp0s5
is fully activated, when in fact it is not.
Or it's not seeing it at all because it's not activated yet. I need to investigate further but the page you linked to mentions:
The property connection.wait-device-timeout of the connection profiles waits until
the waited devices appear. This is useful if the driver takes a longer time to
detect the networking interfaces. Similar with the connection.gateway-ping-timeout property.
runs and exits before enp0s5 appears to be there.
I don't see that. Why do you get to that conclusion?
See the logs I pasted in my previous comment where you see [ OK ] Finished NetworkManager-wait-online
before network-online.target: Job 616 network-online.target/start finished
before enp0s5: Processing udev action (SEQNUM=2133, ACTION=add)
.
ok using wait-device-timeout
seems to have helped. Before we had:
[connection]
id=<id>
type=ethernet
[ipv4]
method=auto
Now with this it's working better:
[connection]
id=<id>
type=ethernet
wait-device-timeout=20000
[ipv4]
method=auto
I guess everything appears to be working as expected for NetworkManager-wait-online.service
. We did see a change in behavior in Rawhide/F37 with respect to this but I'm not sure whose fault it is (it's a race condition after all).
What I dont understand is why NetworkManager-wait-online.service
would exit early (again this is by design) if there aren't any connections that have been established. I guess I can take that discussion to the NM issue tracker or to IRC.
See the logs I pasted in my previous comment where you see [ OK ] Finished NetworkManager-wait-online before network-online.target: Job 616 network-online.target/start finished before enp0s5: Processing udev action (SEQNUM=2133, ACTION=add).
Looking at the second boot, in the console log from there, we see also
enp0s5: Processing udev action (SEQNUM=1768, ACTION=add)
in line 22928, before NetworkManager is starting. Also, NetworkManager logs about seeing that interface, for example in line 26958:
[ 7.517981] NetworkManager[1383]: <info> [1661821741.6415] manager: (enp0s5): new Ethernet device (/org/freedesktop/NetworkManager/Devices/2)
meaning, the device was there, and according to the log, NetworkManager activated the device before startup complete
was reached.
I would not think that wait-device-timeout
makes a difference in that example.
It can be right to set the property. For example, nm-initrd-generator
likes to set wait-device-timeout
, it's just not clear to me that this was the problem/solution in the provided log.
All of the above happens in initrd (of the second boot). Later, after switch root we see the mentioned
enp0s5: Processing udev action (SEQNUM=2133, ACTION=add)
in line 43061. I guess, you are talking about real-root of the second boot? There we don't see any logs from NetworkManager, so it's hard to tell what happens. Surely, the device did exist, because it already existed in initrd, and wait-device-timeout
would again not matter.
I guess, you are talking about real-root of the second boot? There we don't see any logs from NetworkManager, so it's hard to tell what happens. Surely, the device did exist, because it already existed in initrd, and
wait-device-timeout
would again not matter.
Correct. We are talking about the real-root of the second boot. In Fedora CoreOS we tear down any networking when we switch from the initrd to the real root. So Networking gets brought back up again in the real root when NM starts again.
Do you know why NetworkManager-wait-online.service
would behave differently there?
Note that the fix that we implemented set wait-device-timeout
and now our tests are passing. That indicates to me that it does matter.
The
miniso-install-nm
testiso scenario is failing in latest rawhide:I don't see anything immediately obvious in the logs (but I could be missing something):
The diff here looks something like:
I believe I've narrowed it down to the kernel update.