Open nilo85 opened 4 months ago
Hmm at first thought - I have the below systemd override because cloud-init hangs for 120 seconds at boot if there is no network found, I lower the timeout to 10 seconds. Maybe this is the culprit?
https://github.com/Joshua-Riek/ubuntu-rockchip-settings/blob/noble/data/server/override.conf
Full path of the override is /etc/systemd/system/systemd-networkd-wait-online.service.d/override.conf
It renders itself as /etc/systemd/system/systemd-networkd-wait-online.service.d/override.conf in the end? I can try a new flash and hot patch the file before first boot =)
Yeah, give that a try, id change the value to something more reasonable like 60 seconds, either way I now think 10 seconds may be too aggressive.
cat /etc/systemd/system/systemd-networkd-wait-online.service.d/override.conf
# Remove 120 second network delay
[Service]
ExecStart=
ExecStart=/lib/systemd/systemd-networkd-wait-online --timeout=60
Cloud-init v. 24.1.3-0ubuntu3 running 'init' at Tue, 30 Apr 2024 23:20:01 +0000. Up 10.54 seconds.
ci-info: +++++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++++
ci-info: +-----------+-------+-----------+-----------+-------+-------------------+
ci-info: | Device | Up | Address | Mask | Scope | Hw-Address |
ci-info: +-----------+-------+-----------+-----------+-------+-------------------+
ci-info: | enP3p49s0 | False | . | . | . | c0:74:2b:fe:97:ed |
ci-info: | enP4p65s0 | False | . | . | . | c0:74:2b:fe:97:ec |
ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | host | . |
ci-info: | lo | True | ::1/128 | . | host | . |
ci-info: +-----------+-------+-----------+-----------+-------+-------------------+
ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: | Route | Destination | Gateway | Interface | Flags |
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: +-------+-------------+---------+-----------+-------+
Cloud-init v. 24.1.3-0ubuntu3 running 'modules:config' at Tue, 30 Apr 2024 23:20:02 +0000. Up 11.20 seconds.
2024-04-30 23:20:02,892 ERROR <urlopen error [Errno -3] Temporary failure in name resolution>
2024-04-30 23:20:03,000 - util.py[WARNING]: Failed to run command to import ubuntu SSH ids
2024-04-30 23:20:03,002 - util.py[WARNING]: ssh-import-id failed for: ubuntu ['gh:nilo85']
2024-04-30 23:20:03,004 - util.py[WARNING]: Running module ssh_import_id (<module 'cloudinit.config.cc_ssh_import_id' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_ssh_import_id.py'>) failed
Cloud-init v. 24.1.3-0ubuntu3 running 'modules:final' at Tue, 30 Apr 2024 23:20:11 +0000. Up 20.14 seconds.
Cloud-init v. 24.1.3-0ubuntu3 finished at Tue, 30 Apr 2024 23:20:11 +0000. Datasource DataSourceNoCloud [seed=/dev/mmcblk0p1][dsmode=local]. Up 20.35 seconds
Looks like it has no effect, and looking at uptime in the log it looks like it does not wait at all.. Maybe systemd-networkd-wait-online.service.d considers loopback enough?
I think it's because systemd handles the DNS (this has been a pain to deal with tbh).
For example, if you flash the OS to an SD Card and chroot into it, you won't be able to run apt update because DNS is expected to be handled by systemd.
Try to modify the file /etc/resolv.conf
to be nameserver 8.8.8.8
.
I dont think its running the command at all, running it as configured it blocks forever as it seems to wait for all devices to be up. and I only have 1 of 2 ports connected..
# ubuntu@k3s-1:~$ time /lib/systemd/systemd-networkd-wait-online --timeout 5
Timeout occurred while waiting for network connectivity.
real 0m5.054s
user 0m0.004s
sys 0m0.012s
# ubuntu@k3s-1:~$ time /lib/systemd/systemd-networkd-wait-online -i enP4p65s0 --timeout 5
real 0m0.007s
user 0m0.007s
sys 0m0.000s
# ubuntu@k3s-1:~$ time /lib/systemd/systemd-networkd-wait-online -i -i enP3p49s0 --timeout 5
Timeout occurred while waiting for network connectivity.
real 0m5.185s
user 0m0.004s
sys 0m0.004s
Maybe we want to run
/lib/systemd/systemd-networkd-wait-online --any --ignore=lo --timeout 60
Could the double ExecStart= assignment be an issue?
Maybe we want to run
/lib/systemd/systemd-networkd-wait-online --any --ignore=lo --timeout 60
Could the double ExecStart= assignment be an issue?
Double ExecStart=
is required for overrides iirc. Try to modify /etc/resolv.conf?
if you flash the OS to an SD Card and chroot into it, you won't be able to run apt update because DNS is expected to be handled by systemd.
I didnt chroot to it, i flashed it, then mounted the ext4 partition and modified the file, unmounted, powered off, pulled out sd card and booted.
Sure I can give resolve.conf a try, however, looking at the logs, only loopback is up, and super close in time to that output, it gives error, and resolving hosts etc works fine from the machine, and the fact that when I run the command post-boot, it does not return before timeout due to it seems to expect all interfaces to be up
resolv.conf seems to be a link to a non existing file
ubuntu@opi-installer:~/opi-flasher$ ls mount/etc/resolv.conf -al
lrwxrwxrwx 1 root root 39 Apr 30 04:56 mount/etc/resolv.conf -> ../run/systemd/resolve/stub-resolv.conf
ubuntu@opi-installer:~/opi-flasher$ ls mount/run
blkid mount needrestart reboot-required
Ill remove the link and replace with a file with content
nameserver 8.8.8.8
still not working.
ubuntu@k3s-1:~$ cat /etc/resolv.conf
nameserver 8.8.8.8
ubuntu@k3s-1:~$ cat /etc/systemd/system/systemd-networkd-wait-online.service.d/override.conf
# Remove 120 second network delay
[Service]
ExecStart=
ExecStart=/lib/systemd/systemd-networkd-wait-online --timeout=120
ubuntu@k3s-1:~$ cat /var/log/cloud-init-output.log
.......
Cloud-init v. 24.1.3-0ubuntu3 running 'init' at Tue, 30 Apr 2024 23:47:49 +0000. Up 11.58 seconds.
ci-info: +++++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++++
ci-info: +-----------+-------+-----------+-----------+-------+-------------------+
ci-info: | Device | Up | Address | Mask | Scope | Hw-Address |
ci-info: +-----------+-------+-----------+-----------+-------+-------------------+
ci-info: | enP3p49s0 | False | . | . | . | c0:74:2b:fe:97:ed |
ci-info: | enP4p65s0 | False | . | . | . | c0:74:2b:fe:97:ec |
ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | host | . |
ci-info: | lo | True | ::1/128 | . | host | . |
ci-info: +-----------+-------+-----------+-----------+-------+-------------------+
ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: | Route | Destination | Gateway | Interface | Flags |
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: +-------+-------------+---------+-----------+-------+
Cloud-init v. 24.1.3-0ubuntu3 running 'modules:config' at Tue, 30 Apr 2024 23:47:50 +0000. Up 12.27 seconds.
sudo: unable to resolve host k3s-1: Temporary failure in name resolution
2024-04-30 23:47:50,998 ERROR <urlopen error [Errno -3] Temporary failure in name resolution>
2024-04-30 23:47:51,110 - util.py[WARNING]: Failed to run command to import ubuntu SSH ids
2024-04-30 23:47:51,113 - util.py[WARNING]: ssh-import-id failed for: ubuntu ['gh:nilo85']
2024-04-30 23:47:51,114 - util.py[WARNING]: Running module ssh_import_id (<module 'cloudinit.config.cc_ssh_import_id' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_ssh_import_id.py'>) failed
Cloud-init v. 24.1.3-0ubuntu3 running 'modules:final' at Tue, 30 Apr 2024 23:47:59 +0000. Up 21.10 seconds.
Cloud-init v. 24.1.3-0ubuntu3 finished at Tue, 30 Apr 2024 23:47:59 +0000. Datasource DataSourceNoCloud [seed=/dev/mmcblk0p1][dsmode=local]. Up 21.32 seconds
I really think this wait thing is not running at all, or maybe its running, but not as a pre-requisite to "modules:config"
Sidenote: Found this in the log too
2024-04-30 23:47:44,677 - schema.py[WARNING]: Invalid cloud-config provided: Please run 'sudo cloud-init schema --system' to see the schema errors.
sudo cloud-init schema --system
sudo: unable to resolve host k3s-1: Name or service not known
Found cloud-config data types: user-data, network-config
1. user-data at /var/lib/cloud/instances/cloud-image/cloud-config.txt:
Invalid user-data /var/lib/cloud/instances/cloud-image/cloud-config.txt
Error: Cloud config schema errors: chpasswd.users.0: Additional properties are not allowed ('groups', 'password' were unexpected), chpasswd.users.0: {'groups': ['video'], 'name': 'ubuntu', 'password': 'ubuntu', 'type': 'text'} is not valid under any of the given schemas
2. network-config at /var/lib/cloud/instances/cloud-image/network-config.json:
Valid schema network-config
Error: Invalid schema: user-data
I think this part I didn't modify =) EDIT: link: https://github.com/Joshua-Riek/ubuntu-rockchip/blob/main/overlay/boot/firmware/user-data#L28
# On first boot, set the (default) ubuntu user's password to "ubuntu" and
# expire user passwords
chpasswd:
expire: true
users:
- name: ubuntu
password: ubuntu
type: text
groups:
- video
The error to resolve is not too unexpected as we use google dns now (and probably no hosts file "hack") =)
But it seems it is not happy about groups under chpasswd. Looking at example here https://cloudinit.readthedocs.io/en/latest/reference/modules.html#set-passwords
No idea what I am looking at but based on a google result of someone mentioning a similar issue I found an interesting command to run I have no idea what it does, but looks like "network-pre.target" ran before "cloud-init-local.service", maybe that wait thing needs to be in "network-pre.target"?
https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1636912
ubuntu@k3s-1:~$ systemd-analyze critical-chain systemd-networkd.service
The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.
systemd-networkd.service +49ms
└─network-pre.target @7.613s
└─cloud-init-local.service @930ms +6.682s
└─systemd-remount-fs.service @571ms +32ms
└─systemd-journald.socket @518ms
└─system.slice @486ms
└─-.slice @486ms
Full output
ubuntu@k3s-1:~$ systemd-analyze critical-chain
The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.
graphical.target @17.188s
└─multi-user.target @17.188s
└─snapd.seeded.service @8.482s +8.704s
└─basic.target @8.384s
└─sockets.target @8.384s
└─snap.lxd.user-daemon.unix.socket @10.569s
└─sysinit.target @8.321s
└─cloud-init.service @7.712s +604ms
└─cloud-init-local.service @930ms +6.682s
└─systemd-remount-fs.service @571ms +32ms
└─systemd-journald.socket @518ms
└─system.slice @486ms
└─-.slice @486ms
Very new to cloid-init but maybe some of the "wants" is wrong in /etc/systemd/system/cloud-init.target.wants
in "cloud-init.service" i find this
After=networking.service
Before=network-online.target
Maybe whatever wait command we want to override should be in the After script, and we are not in the Before? (so runs after?)
Found a verbose cloud-init log file =) https://gist.github.com/nilo85/16224aa34b79998a328bd0a5ddc888c0
Seems its just about less than a second between cloud-init:init and cloud-init:modules:config and found this in cloud-init docs:
Cloud-init then exits and expects for the continued boot of the operating system to bring network configuration up as configured. https://cloudinit.readthedocs.io/en/latest/explanation/boot.html#local
So I suspect whatever is supposed to wait for network doesnt really work. (and I bet the command times out always as without --any it seems to wait for all to come online) So if this --timeout 120 was before modules, I would expect to see 120s to have passed between the stages
Time to go to bed =)
Found out you could get a plotted image of the whole process with "systemd-analyze plot > something.svg", maybe this shows what went on https://gist.github.com/nilo85/9f963029f9235b580fd37482fff6d7ed
EDIT: Looks like the wait script runs concurrently with cloud init config (and wrong order)
EDIT2: I now see there is both a .target and a .service, .service is probably the one we want and it is run a bit further down and after network-online.target
I just flashed a 24.04 on a RPi4 and this is how the paths differ
OrangePi5
RaspberryPi4
Seems clear to me that on official RPi Ubuntu cloud-conf is supposed to be run after network-online.target
Checking paths for network-online.target OrangePi5
RaspberryPi4
Content of this file is identical, and both seem to be after network-online.target
cat /etc/systemd/system/cloud-init.target.wants/cloud-config.service
[Unit]
Description=Apply the settings specified in cloud-config
After=network-online.target cloud-config.target
Before=systemd-user-sessions.service
Wants=network-online.target cloud-config.target
ConditionPathExists=!/etc/cloud/cloud-init.disabled
ConditionKernelCommandLine=!cloud-init=disabled
ConditionEnvironment=!KERNEL_CMDLINE=cloud-init=disabled
[Service]
Type=oneshot
ExecStart=/usr/bin/cloud-init modules --mode=config
RemainAfterExit=yes
TimeoutSec=0
# Output needs to appear in instance console output
StandardOutput=journal+console
[Install]
WantedBy=cloud-init.target
EDIT: Seems network also looks not inited on RPi on output here, so maybe you are onto something about dns etc..? RPI Cloud init out.log
Cloud-init v. 24.1.3-0ubuntu3 running 'init' at Tue, 23 Apr 2024 14:02:17 +0000. Up 29.30 seconds.
ci-info: +++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++
ci-info: +--------+-------+-----------+-----------+-------+-------------------+
ci-info: | Device | Up | Address | Mask | Scope | Hw-Address |
ci-info: +--------+-------+-----------+-----------+-------+-------------------+
ci-info: | eth0 | False | . | . | . | dc:a6:32:23:68:29 |
ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | host | . |
ci-info: | lo | True | ::1/128 | . | host | . |
ci-info: | wlan0 | False | . | . | . | dc:a6:32:23:68:2c |
ci-info: +--------+-------+-----------+-----------+-------+-------------------+
ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: | Route | Destination | Gateway | Interface | Flags |
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: +-------+-------------+---------+-----------+-------+
Cloud-init v. 24.1.3-0ubuntu3 running 'modules:config' at Tue, 23 Apr 2024 14:02:20 +0000. Up 31.97 seconds.
2024-05-01 07:24:15,583 INFO Authorized key ['256', 'SHA256:3bve13Z3I9fHInXNHQgPIv1RvXUBtE22MQDhpNnoxFU', 'nilo85@github/87004172', '(ED25519)']
2024-05-01 07:24:15,599 INFO Authorized key ['256', 'SHA256:+F9Uod43e1EEiWdUzfWHeu1982KwOzt+EjQ4bu1arZs', 'nilo85@github/95981694', '(ED25519)']
2024-05-01 07:24:15,615 INFO Authorized key ['3072', 'SHA256:JIEAP83yRzEgsKkJdjRmdG0ny5xzxWmTFsWc/44s7cQ', 'nilo85@github/96126473', '(RSA)']
2024-05-01 07:24:15,616 INFO [3] SSH keys [Authorized]
EDIT: Now that I think about it, none of the plots contained the wait-online.service we "overriden" and on the rpi, no such file exists so not sure what we override =)
EDIT: I patched the ssh-import-id script to run "ip a" and "nslookup google.com", this is the output:
Cloud-init v. 24.1.3-0ubuntu3 running 'init' at Wed, 01 May 2024 08:08:32 +0000. Up 11.05 seconds.
ci-info: +++++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++++
ci-info: +-----------+-------+-----------+-----------+-------+-------------------+
ci-info: | Device | Up | Address | Mask | Scope | Hw-Address |
ci-info: +-----------+-------+-----------+-----------+-------+-------------------+
ci-info: | enP3p49s0 | False | . | . | . | c0:74:2b:fe:97:ed |
ci-info: | enP4p65s0 | False | . | . | . | c0:74:2b:fe:97:ec |
ci-info: | lo | True | 127.0.0.1 | 255.0.0.0 | host | . |
ci-info: | lo | True | ::1/128 | . | host | . |
ci-info: +-----------+-------+-----------+-----------+-------+-------------------+
ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: | Route | Destination | Gateway | Interface | Flags |
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: +-------+-------------+---------+-----------+-------+
Cloud-init v. 24.1.3-0ubuntu3 running 'modules:config' at Wed, 01 May 2024 08:08:33 +0000. Up 11.74 seconds.
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host noprefixroute
valid_lft forever preferred_lft forever
2: enP3p49s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN group default qlen 1000
link/ether c0:74:2b:fe:97:ed brd ff:ff:ff:ff:ff:ff
3: enP4p65s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN group default qlen 1000
link/ether c0:74:2b:fe:97:ec brd ff:ff:ff:ff:ff:ff
Server: 127.0.0.53
Address: 127.0.0.53#53
** server can't find google.com: SERVFAIL
2024-05-01 08:08:33,480 ERROR <urlopen error [Errno -3] Temporary failure in name resolution>
2024-05-01 08:08:33,589 - util.py[WARNING]: Failed to run command to import ubuntu SSH ids
2024-05-01 08:08:33,592 - util.py[WARNING]: ssh-import-id failed for: ubuntu ['gh:nilo85']
2024-05-01 08:08:33,592 - util.py[WARNING]: Running module ssh_import_id (<module 'cloudinit.config.cc_ssh_import_id' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_ssh_import_id.py'>) failed
Cloud-init v. 24.1.3-0ubuntu3 running 'modules:final' at Wed, 01 May 2024 08:08:41 +0000. Up 20.66 seconds.
Cloud-init v. 24.1.3-0ubuntu3 finished at Wed, 01 May 2024 08:08:42 +0000. Datasource DataSourceNoCloud [seed=/dev/mmcblk0p1][dsmode=local]. Up 20.86 seconds
So seems indeed the network is not up at the point it tries to run it
EDIT: Wait.. this looks weird? https://github.com/Joshua-Riek/ubuntu-rockchip-settings/blob/noble/data/meson.build#L8 Are we overwriting the wrong file? getty-wait written to cloud-config?
EDIT: I noticed that the overide was going to /lib/... and not /usr/lib as the default one... however didnt make a difference
Now ediitng the cloud-config, it looks like maybe this getty thing might be the issue afterall? It overrides Before, maybe =)
EDIT: no, I guess not, cause After is still network-online.target... but why didnt it show up in the graph?... This is a crash course in systemd =D
EDIT: comparing jounnal with "journalctl -b" I see that on OPi network is considered up immediately, while on RPI it is way more down, so I wonder if there might be some issue with the netplan maybe
EDIT: netplan looks identical except en* vs eth0. however running "sudo journalctl -xeu systemd-networkd.service" I can see:
08:45:13 it gets ip from DHCP, but get this a bit too early
May 01 08:45:07 k3s-1 systemd[1]: Reached target network-online.target - Network is Online.
EDIT: Found something!!!
sudo journalctl -xeu systemd-networkd-wait-online.service
May 01 08:45:06 k3s-1 systemd[1]: systemd-networkd-wait-online.service - Wait for Network to be Configured was skipped because of an unmet condition check (ConditionPathIsSymbolicLink=/run/systemd/generator/network-online.target.wants/systemd-networkd-wait-online.service).
Based on my latest finding I have a strong suspicion this is the issue
May 01 08:45:06 k3s-1 systemd[1]: systemd-networkd-wait-online.service - Wait for Network to be Configured was skipped because of an unmet condition check (ConditionPathIsSymbolicLink=/run/systemd/generator/network-online.target.wants/systemd-networkd-wait-online.service).
someone else had similar issue updating to 24.04 from 23.XX here https://bugs.launchpad.net/ubuntu/+source/netplan.io/+bug/2063973/comments/3
Due to some name mismatch in initramfs vs some udev renaming, could we have a similar issue? =)
RPi netplan config:
sudo cat /etc/netplan/50-cloud-init.yaml
# This file is generated from information provided by the datasource. Changes
# to it will not persist across an instance reboot. To disable cloud-init's
# network configuration capabilities, write a file
# /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following:
# network: {config: disabled}
network:
ethernets:
eth0:
dhcp4: true
optional: true
version: 2
OPi netplan config
sudo cat /etc/netplan/50-cloud-init.yaml
# This file is generated from information provided by the datasource. Changes
# to it will not persist across an instance reboot. To disable cloud-init's
# network configuration capabilities, write a file
# /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following:
# network: {config: disabled}
network:
ethernets:
zz-all-en:
dhcp4: true
match:
name: en*
optional: true
zz-all-eth:
dhcp4: true
match:
name: eth*
optional: true
version: 2
If I get this right, it probably does not like "zz-all-en" and expects "enP4p65s0"
EDIT: I patched the cloud-init provided netork file with explicit name of interface to
network:
ethernets:
enP4p65s0:
dhcp4: true
optional: true
version: 2
But still same issue, I think I am so close to figuring it out but maybe next is the udev rules magic knowledge etc I am missing, having my hopes up @Joshua-Riek you can interpret this =)
FINAL EDIT:
I got it working, this is the content of my network-config file:
# This file contains a netplan-compatible configuration which cloud-init will
# apply on first-boot (note: it will *not* update the config after the first
# boot). Please refer to the cloud-init documentation and the netplan reference
# for full details:
#
# https://netplan.io/reference
# https://cloudinit.readthedocs.io/en/latest/topics/network-config.html
# https://cloudinit.readthedocs.io/en/latest/topics/network-config-format-v2.html
#
# Please note that the YAML format employed by this file is sensitive to
# differences in whitespace; if you are editing this file in an editor (like
# Notepad) which uses literal tabs, take care to only use spaces for
# indentation. See the following link for more details:
#
# https://en.wikipedia.org/wiki/YAML
# Some additional examples are commented out below
network:
renderer: networkd
ethernets:
enP4p65s0:
dhcp4: true
version: 2
not sure if its renderer and / or the removal of optional that did the trick in the end. However RPi config had optional so not sure.. but now it works for my setup 🥳
Maybe this is the root cause to your issue where you wanted to shorten the timeout in the first place?
Sorry just catching up on this now, because there are so many devices supported with different network interfaces, I tried to use a generic config to match all ethernet interfaces (hence the "zz-all-en" and "zz-all-eth").
I think that I will need to keep track of the networking interfaces for each board and set them accordingly during the image creation process. Thanks for the testing and looking into this a whole bunch :)
Was a nice journey to the modern linux userspace =D last time I touched something like this, init.d was the newest coolest thing =D
I think what I discovered here works well enough as a workaround for now, the users who depend on cloud-init, also has control over network-config via the fat partition so shouldn't be a biggie =)
I just started investing enough time into my little setup that I really don't want to endup wanting to add another node and then have manually been running a lot of commands for setup, so for me it was critical I got coud-init working =D
@nilo85 nice work debugging this
Hmm at first thought - I have the below systemd override because cloud-init hangs for 120 seconds at boot if there is no network found, I lower the timeout to 10 seconds. Maybe this is the culprit?
@Joshua-Riek Please don't do this. Overriding systemd-networkd-wait-online.service
as a workaround for that timeout might allow boot to proceed, but doing this is going to break a lot more than just cloud-init. All of the dns symptoms reported by @nilo85 related to ssh_import_id are a byproduct of bypassing network online meaning that cloud-init is working in an environment where it expects network to be online, yet it is not. If there is a bug in netplan, please report it.
someone else had similar issue updating to 24.04 from 23.XX here https://bugs.launchpad.net/ubuntu/+source/netplan.io/+bug/2063973/comments/3
Due to some name mismatch in initramfs vs some udev renaming, could we have a similar issue? =)
Sounds likely.
not sure if its renderer and / or the removal of optional that did the trick in the end. However RPi config had optional so not sure.. but now it works for my setup 🥳
I would guess the former. I'd be surprised if removing optional fixed your issue, and if it did that should probably be reported to netplan since that is the opposite of what I would expect. @Joshua-Riek Any ideas why netplan wouldn't detect the right backend automatically? Maybe Rockchip has NetworkManager installed as well?
I will take a closer look into cloud-init over the weekend and see how this should be properly addressed. I've been taking a little break the past few days with the release of Ubuntu 24.04.
Ok I have this fixed now and found two problems:
@holmanb, I have a large portion of users who may not have ethernet and will not configure cloud-init for their use case. Because of this some users can experience a two-minute boot delay due to systemd-networkd-wait-online. Is there a way to properly adjust the timeout for systemd-networkd-wait-online or is it imperative that the service hangs for the full two minutes?
Is there a way to properly adjust the timeout for systemd-networkd-wait-online or is it imperative that the service hangs for the full two minutes?
Adjusting timeouts is the wrong approach to solving this problem.
From @nilo85's comment about renderer: networkd
above (and the NetworkManager configurations I see in the source tree), I suspect that NetworkManager
is also enabled on the system and therefore netplan is rendering a NetworkManager config rather than a networkd configuration which is why your optional: true
setting did nothing before. IIRC the default networkd configuration causes systemd-networkd-wait-online.service
to wait for one interface, which would explains why you are seeing this timeout.
If you are using NetworkManager, then systemd-networkd.service (and associated units like systemd-networkd-wait-online.service) should NOT be enabled. I think that this is the real fix to this issue - pick one or the other.
@Joshua-Riek If you actually do want to use systemd-networkd rather than NetworkManager, then I suggest that you revert this change:
Network interfaces cannot be optional (fixed with https://github.com/Joshua-Riek/ubuntu-rockchip/commit/1580bfabf120cd768eaa892162d833cc3a766144)
Per the docs:
An optional device is not required for booting. Normally, networkd will wait some time for device to become configured before proceeding with booting. However, if a device is marked as optional, networkd will not wait for it. This is only supported by networkd, and the default is false.
You actually do want optional: true
for instances that have no interfaces if you are using systemd.
However, if you decide to use only NetworkManager, then it really doesn't matter since this config only affects systemd-networkd - just make sure that you have systemd-networkd disabled.
Thanks @holmanb for the detailed information, this is very insightful. I double checked and NetworkManager is not installed on the system, networkd is being used.
That being the case, when using optional: true
I'm able to reproduce @nilo85's initial report. Could we have a bug where the network is not connected to even when there is a visible network interface?
Apr 30 05:14:17 ubuntu cloud-init[999]: Cloud-init v. 24.1.3-0ubuntu3 running 'modules:config' at Tue, 30 Apr 2024 05:14:16 +0000. Up 14.27 seconds.
Apr 30 05:14:18 ubuntu cloud-init[999]: 2024-04-30 05:14:18,308 ERROR <urlopen error [Errno -3] Temporary failure in name resolution>
Apr 30 05:14:18 ubuntu cloud-init[999]: 2024-04-30 05:14:18,368 - util.py[WARNING]: Failed to run command to import ubuntu SSH ids
Apr 30 05:14:18 ubuntu cloud-init[999]: 2024-04-30 05:14:18,372 - util.py[WARNING]: ssh-import-id failed for: ubuntu ['lp:jjriek']
Apr 30 05:14:18 ubuntu cloud-init[999]: 2024-04-30 05:14:18,373 - util.py[WARNING]: Running module ssh_import_id (<module 'cloudinit.config.cc_ssh_import_id' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_ssh_import_id.py'>) failed
May 03 20:37:43 ubuntu systemd[1]: cloud-config.service: Main process exited, code=exited, status=1/FAILURE
May 03 20:37:43 ubuntu systemd[1]: cloud-config.service: Failed with result 'exit-code'.
I double checked and NetworkManager is not installed on the system, networkd is being used.
Thanks for checking, nevermind then on that.
That being the case, when using optional: true I'm able to reproduce @nilo85's initial report.
Thanks for checking. This sounds like a possible repeat of LP: #2039083, however I think that the version of systemd shipped in 24.04 should have fixed this?
Perhaps you could try the suggestion proposed in that bug:
what happens if you add the --any flag to systemd-networkd-wait-online.service (best to do this with an override config), e.g.
/etc/systemd/system/systemd-networkd-wait-online.service.d/override.conf
[Service] ExecStart= ExecStart=/lib/systemd/systemd-networkd-wait-online --any
That should make it so that it does not wait on all the other unmanaged interfaces.
Workaround
It turns out something changed on ubuntu that doesnt like interfaces not having strict matches or something. In the post below I linked to an issue with description, and I also explained how I worked around it by providing my own netowrk-config file for cloud-init. https://github.com/Joshua-Riek/ubuntu-rockchip/issues/757#issuecomment-2088207668
Issue
Congrats to releasing 24.04! You worked long and hard for this.
I finally managed to get around playing with cloud-init and I am currently having issues with importing ssh keys via github.
I have the following snippet in user-data
cloud init logs look like this
It looks like it is trying too early, before getting network ready. Could very much be me doing something wrong but all other examples of ssh-import-id I have seen does not seem to do anything special, so maybe we need to tweak something here..