Joshua-Riek / ubuntu-rockchip

Ubuntu for Rockchip RK35XX Devices
https://joshua-riek.github.io/ubuntu-rockchip-download/
GNU General Public License v3.0
2.17k stars 237 forks source link

24.04: server: cloud-init: ssh_import_id not working due to network not yet ready #757

Open nilo85 opened 4 months ago

nilo85 commented 4 months ago

Workaround

It turns out something changed on ubuntu that doesnt like interfaces not having strict matches or something. In the post below I linked to an issue with description, and I also explained how I worked around it by providing my own netowrk-config file for cloud-init. https://github.com/Joshua-Riek/ubuntu-rockchip/issues/757#issuecomment-2088207668

Issue

Congrats to releasing 24.04! You worked long and hard for this.

I finally managed to get around playing with cloud-init and I am currently having issues with importing ssh keys via github.

I have the following snippet in user-data

## On first boot, use ssh-import-id to give the specific users SSH access to
## the default user
ssh_import_id:
  - gh:nilo85

cloud init logs look like this

Cloud-init v. 24.1.3-0ubuntu3 running 'init' at Tue, 30 Apr 2024 22:49:20 +0000. Up 11.70 seconds.
ci-info: +++++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++++
ci-info: +-----------+-------+-----------+-----------+-------+-------------------+
ci-info: |   Device  |   Up  |  Address  |    Mask   | Scope |     Hw-Address    |
ci-info: +-----------+-------+-----------+-----------+-------+-------------------+
ci-info: | enP3p49s0 | False |     .     |     .     |   .   | c0:74:2b:fe:97:ed |
ci-info: | enP4p65s0 | False |     .     |     .     |   .   | c0:74:2b:fe:97:ec |
ci-info: |     lo    |  True | 127.0.0.1 | 255.0.0.0 |  host |         .         |
ci-info: |     lo    |  True |  ::1/128  |     .     |  host |         .         |
ci-info: +-----------+-------+-----------+-----------+-------+-------------------+
ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: | Route | Destination | Gateway | Interface | Flags |
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: +-------+-------------+---------+-----------+-------+
Cloud-init v. 24.1.3-0ubuntu3 running 'modules:config' at Tue, 30 Apr 2024 22:49:21 +0000. Up 12.38 seconds.
2024-04-30 22:49:22,093 ERROR <urlopen error [Errno -3] Temporary failure in name resolution>
2024-04-30 22:49:22,184 - util.py[WARNING]: Failed to run command to import ubuntu SSH ids
2024-04-30 22:49:22,186 - util.py[WARNING]: ssh-import-id failed for: ubuntu ['gh:nilo85']
2024-04-30 22:49:22,187 - util.py[WARNING]: Running module ssh_import_id (<module 'cloudinit.config.cc_ssh_import_id' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_ssh_import_id.py'>) failed
Cloud-init v. 24.1.3-0ubuntu3 running 'modules:final' at Tue, 30 Apr 2024 22:49:30 +0000. Up 21.19 seconds.
Cloud-init v. 24.1.3-0ubuntu3 finished at Tue, 30 Apr 2024 22:49:30 +0000. Datasource DataSourceNoCloud [seed=/dev/mmcblk0p1][dsmode=local].  Up 21.38 seconds

It looks like it is trying too early, before getting network ready. Could very much be me doing something wrong but all other examples of ssh-import-id I have seen does not seem to do anything special, so maybe we need to tweak something here..

Joshua-Riek commented 4 months ago

Hmm at first thought - I have the below systemd override because cloud-init hangs for 120 seconds at boot if there is no network found, I lower the timeout to 10 seconds. Maybe this is the culprit?

https://github.com/Joshua-Riek/ubuntu-rockchip-settings/blob/noble/data/server/override.conf

Full path of the override is /etc/systemd/system/systemd-networkd-wait-online.service.d/override.conf

nilo85 commented 4 months ago

It renders itself as /etc/systemd/system/systemd-networkd-wait-online.service.d/override.conf in the end? I can try a new flash and hot patch the file before first boot =)

Joshua-Riek commented 4 months ago

Yeah, give that a try, id change the value to something more reasonable like 60 seconds, either way I now think 10 seconds may be too aggressive.

nilo85 commented 4 months ago
cat /etc/systemd/system/systemd-networkd-wait-online.service.d/override.conf
# Remove 120 second network delay

[Service]
ExecStart=
ExecStart=/lib/systemd/systemd-networkd-wait-online --timeout=60
Cloud-init v. 24.1.3-0ubuntu3 running 'init' at Tue, 30 Apr 2024 23:20:01 +0000. Up 10.54 seconds.
ci-info: +++++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++++
ci-info: +-----------+-------+-----------+-----------+-------+-------------------+
ci-info: |   Device  |   Up  |  Address  |    Mask   | Scope |     Hw-Address    |
ci-info: +-----------+-------+-----------+-----------+-------+-------------------+
ci-info: | enP3p49s0 | False |     .     |     .     |   .   | c0:74:2b:fe:97:ed |
ci-info: | enP4p65s0 | False |     .     |     .     |   .   | c0:74:2b:fe:97:ec |
ci-info: |     lo    |  True | 127.0.0.1 | 255.0.0.0 |  host |         .         |
ci-info: |     lo    |  True |  ::1/128  |     .     |  host |         .         |
ci-info: +-----------+-------+-----------+-----------+-------+-------------------+
ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: | Route | Destination | Gateway | Interface | Flags |
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: +-------+-------------+---------+-----------+-------+
Cloud-init v. 24.1.3-0ubuntu3 running 'modules:config' at Tue, 30 Apr 2024 23:20:02 +0000. Up 11.20 seconds.
2024-04-30 23:20:02,892 ERROR <urlopen error [Errno -3] Temporary failure in name resolution>
2024-04-30 23:20:03,000 - util.py[WARNING]: Failed to run command to import ubuntu SSH ids
2024-04-30 23:20:03,002 - util.py[WARNING]: ssh-import-id failed for: ubuntu ['gh:nilo85']
2024-04-30 23:20:03,004 - util.py[WARNING]: Running module ssh_import_id (<module 'cloudinit.config.cc_ssh_import_id' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_ssh_import_id.py'>) failed
Cloud-init v. 24.1.3-0ubuntu3 running 'modules:final' at Tue, 30 Apr 2024 23:20:11 +0000. Up 20.14 seconds.
Cloud-init v. 24.1.3-0ubuntu3 finished at Tue, 30 Apr 2024 23:20:11 +0000. Datasource DataSourceNoCloud [seed=/dev/mmcblk0p1][dsmode=local].  Up 20.35 seconds

Looks like it has no effect, and looking at uptime in the log it looks like it does not wait at all.. Maybe systemd-networkd-wait-online.service.d considers loopback enough?

Joshua-Riek commented 4 months ago

I think it's because systemd handles the DNS (this has been a pain to deal with tbh).

For example, if you flash the OS to an SD Card and chroot into it, you won't be able to run apt update because DNS is expected to be handled by systemd.

Try to modify the file /etc/resolv.conf to be nameserver 8.8.8.8.

nilo85 commented 4 months ago

I dont think its running the command at all, running it as configured it blocks forever as it seems to wait for all devices to be up. and I only have 1 of 2 ports connected..

# ubuntu@k3s-1:~$ time /lib/systemd/systemd-networkd-wait-online  --timeout 5
Timeout occurred while waiting for network connectivity.

real    0m5.054s
user    0m0.004s
sys 0m0.012s

# ubuntu@k3s-1:~$ time /lib/systemd/systemd-networkd-wait-online -i enP4p65s0  --timeout 5

real    0m0.007s
user    0m0.007s
sys 0m0.000s

# ubuntu@k3s-1:~$ time /lib/systemd/systemd-networkd-wait-online -i -i enP3p49s0  --timeout 5
Timeout occurred while waiting for network connectivity.

real    0m5.185s
user    0m0.004s
sys 0m0.004s
nilo85 commented 4 months ago

Maybe we want to run

/lib/systemd/systemd-networkd-wait-online --any --ignore=lo --timeout 60

Could the double ExecStart= assignment be an issue?

Joshua-Riek commented 4 months ago

Maybe we want to run

/lib/systemd/systemd-networkd-wait-online --any --ignore=lo --timeout 60

Could the double ExecStart= assignment be an issue?

Double ExecStart= is required for overrides iirc. Try to modify /etc/resolv.conf?

nilo85 commented 4 months ago

if you flash the OS to an SD Card and chroot into it, you won't be able to run apt update because DNS is expected to be handled by systemd.

I didnt chroot to it, i flashed it, then mounted the ext4 partition and modified the file, unmounted, powered off, pulled out sd card and booted.

Sure I can give resolve.conf a try, however, looking at the logs, only loopback is up, and super close in time to that output, it gives error, and resolving hosts etc works fine from the machine, and the fact that when I run the command post-boot, it does not return before timeout due to it seems to expect all interfaces to be up

nilo85 commented 4 months ago

resolv.conf seems to be a link to a non existing file

ubuntu@opi-installer:~/opi-flasher$ ls mount/etc/resolv.conf -al
lrwxrwxrwx 1 root root 39 Apr 30 04:56 mount/etc/resolv.conf -> ../run/systemd/resolve/stub-resolv.conf

ubuntu@opi-installer:~/opi-flasher$ ls mount/run
blkid  mount  needrestart  reboot-required

Ill remove the link and replace with a file with content

nameserver 8.8.8.8

still not working.


ubuntu@k3s-1:~$ cat /etc/resolv.conf
nameserver 8.8.8.8

ubuntu@k3s-1:~$ cat /etc/systemd/system/systemd-networkd-wait-online.service.d/override.conf
# Remove 120 second network delay

[Service]
ExecStart=
ExecStart=/lib/systemd/systemd-networkd-wait-online --timeout=120

ubuntu@k3s-1:~$ cat /var/log/cloud-init-output.log
.......
Cloud-init v. 24.1.3-0ubuntu3 running 'init' at Tue, 30 Apr 2024 23:47:49 +0000. Up 11.58 seconds.
ci-info: +++++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++++
ci-info: +-----------+-------+-----------+-----------+-------+-------------------+
ci-info: |   Device  |   Up  |  Address  |    Mask   | Scope |     Hw-Address    |
ci-info: +-----------+-------+-----------+-----------+-------+-------------------+
ci-info: | enP3p49s0 | False |     .     |     .     |   .   | c0:74:2b:fe:97:ed |
ci-info: | enP4p65s0 | False |     .     |     .     |   .   | c0:74:2b:fe:97:ec |
ci-info: |     lo    |  True | 127.0.0.1 | 255.0.0.0 |  host |         .         |
ci-info: |     lo    |  True |  ::1/128  |     .     |  host |         .         |
ci-info: +-----------+-------+-----------+-----------+-------+-------------------+
ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: | Route | Destination | Gateway | Interface | Flags |
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: +-------+-------------+---------+-----------+-------+
Cloud-init v. 24.1.3-0ubuntu3 running 'modules:config' at Tue, 30 Apr 2024 23:47:50 +0000. Up 12.27 seconds.
sudo: unable to resolve host k3s-1: Temporary failure in name resolution
2024-04-30 23:47:50,998 ERROR <urlopen error [Errno -3] Temporary failure in name resolution>
2024-04-30 23:47:51,110 - util.py[WARNING]: Failed to run command to import ubuntu SSH ids
2024-04-30 23:47:51,113 - util.py[WARNING]: ssh-import-id failed for: ubuntu ['gh:nilo85']
2024-04-30 23:47:51,114 - util.py[WARNING]: Running module ssh_import_id (<module 'cloudinit.config.cc_ssh_import_id' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_ssh_import_id.py'>) failed
Cloud-init v. 24.1.3-0ubuntu3 running 'modules:final' at Tue, 30 Apr 2024 23:47:59 +0000. Up 21.10 seconds.
Cloud-init v. 24.1.3-0ubuntu3 finished at Tue, 30 Apr 2024 23:47:59 +0000. Datasource DataSourceNoCloud [seed=/dev/mmcblk0p1][dsmode=local].  Up 21.32 seconds

I really think this wait thing is not running at all, or maybe its running, but not as a pre-requisite to "modules:config"

nilo85 commented 4 months ago

Sidenote: Found this in the log too

2024-04-30 23:47:44,677 - schema.py[WARNING]: Invalid cloud-config provided: Please run 'sudo cloud-init schema --system' to see the schema errors.
sudo cloud-init schema --system
sudo: unable to resolve host k3s-1: Name or service not known
Found cloud-config data types: user-data, network-config

1. user-data at /var/lib/cloud/instances/cloud-image/cloud-config.txt:
  Invalid user-data /var/lib/cloud/instances/cloud-image/cloud-config.txt
  Error: Cloud config schema errors: chpasswd.users.0: Additional properties are not allowed ('groups', 'password' were unexpected), chpasswd.users.0: {'groups': ['video'], 'name': 'ubuntu', 'password': 'ubuntu', 'type': 'text'} is not valid under any of the given schemas

2. network-config at /var/lib/cloud/instances/cloud-image/network-config.json:
  Valid schema network-config
Error: Invalid schema: user-data

I think this part I didn't modify =) EDIT: link: https://github.com/Joshua-Riek/ubuntu-rockchip/blob/main/overlay/boot/firmware/user-data#L28

# On first boot, set the (default) ubuntu user's password to "ubuntu" and
# expire user passwords
chpasswd:
  expire: true
  users:
  - name: ubuntu
    password: ubuntu
    type: text
    groups:
      - video

The error to resolve is not too unexpected as we use google dns now (and probably no hosts file "hack") =)

But it seems it is not happy about groups under chpasswd. Looking at example here https://cloudinit.readthedocs.io/en/latest/reference/modules.html#set-passwords

nilo85 commented 4 months ago

No idea what I am looking at but based on a google result of someone mentioning a similar issue I found an interesting command to run I have no idea what it does, but looks like "network-pre.target" ran before "cloud-init-local.service", maybe that wait thing needs to be in "network-pre.target"?

https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1636912

ubuntu@k3s-1:~$ systemd-analyze critical-chain systemd-networkd.service
The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

systemd-networkd.service +49ms
└─network-pre.target @7.613s
  └─cloud-init-local.service @930ms +6.682s
    └─systemd-remount-fs.service @571ms +32ms
      └─systemd-journald.socket @518ms
        └─system.slice @486ms
          └─-.slice @486ms

Full output

ubuntu@k3s-1:~$ systemd-analyze critical-chain
The time when unit became active or started is printed after the "@" character.
The time the unit took to start is printed after the "+" character.

graphical.target @17.188s
└─multi-user.target @17.188s
  └─snapd.seeded.service @8.482s +8.704s
    └─basic.target @8.384s
      └─sockets.target @8.384s
        └─snap.lxd.user-daemon.unix.socket @10.569s
          └─sysinit.target @8.321s
            └─cloud-init.service @7.712s +604ms
              └─cloud-init-local.service @930ms +6.682s
                └─systemd-remount-fs.service @571ms +32ms
                  └─systemd-journald.socket @518ms
                    └─system.slice @486ms
                      └─-.slice @486ms
nilo85 commented 4 months ago

Very new to cloid-init but maybe some of the "wants" is wrong in /etc/systemd/system/cloud-init.target.wants

in "cloud-init.service" i find this

After=networking.service
Before=network-online.target

Maybe whatever wait command we want to override should be in the After script, and we are not in the Before? (so runs after?)

nilo85 commented 4 months ago

Found a verbose cloud-init log file =) https://gist.github.com/nilo85/16224aa34b79998a328bd0a5ddc888c0

nilo85 commented 4 months ago

Seems its just about less than a second between cloud-init:init and cloud-init:modules:config and found this in cloud-init docs:

Cloud-init then exits and expects for the continued boot of the operating system to bring network configuration up as configured. https://cloudinit.readthedocs.io/en/latest/explanation/boot.html#local

So I suspect whatever is supposed to wait for network doesnt really work. (and I bet the command times out always as without --any it seems to wait for all to come online) So if this --timeout 120 was before modules, I would expect to see 120s to have passed between the stages

Time to go to bed =)

nilo85 commented 4 months ago

Found out you could get a plotted image of the whole process with "systemd-analyze plot > something.svg", maybe this shows what went on https://gist.github.com/nilo85/9f963029f9235b580fd37482fff6d7ed

EDIT: Looks like the wait script runs concurrently with cloud init config (and wrong order)

image

EDIT2: I now see there is both a .target and a .service, .service is probably the one we want and it is run a bit further down and after network-online.target

nilo85 commented 4 months ago

I just flashed a 24.04 on a RPi4 and this is how the paths differ

OrangePi5

image

RaspberryPi4

image

Seems clear to me that on official RPi Ubuntu cloud-conf is supposed to be run after network-online.target

Checking paths for network-online.target OrangePi5

image

RaspberryPi4

image

Content of this file is identical, and both seem to be after network-online.target

cat /etc/systemd/system/cloud-init.target.wants/cloud-config.service
[Unit]
Description=Apply the settings specified in cloud-config
After=network-online.target cloud-config.target
Before=systemd-user-sessions.service
Wants=network-online.target cloud-config.target
ConditionPathExists=!/etc/cloud/cloud-init.disabled
ConditionKernelCommandLine=!cloud-init=disabled
ConditionEnvironment=!KERNEL_CMDLINE=cloud-init=disabled

[Service]
Type=oneshot
ExecStart=/usr/bin/cloud-init modules --mode=config
RemainAfterExit=yes
TimeoutSec=0

# Output needs to appear in instance console output
StandardOutput=journal+console

[Install]
WantedBy=cloud-init.target

EDIT: Seems network also looks not inited on RPi on output here, so maybe you are onto something about dns etc..? RPI Cloud init out.log

Cloud-init v. 24.1.3-0ubuntu3 running 'init' at Tue, 23 Apr 2024 14:02:17 +0000. Up 29.30 seconds.
ci-info: +++++++++++++++++++++++++++Net device info++++++++++++++++++++++++++++
ci-info: +--------+-------+-----------+-----------+-------+-------------------+
ci-info: | Device |   Up  |  Address  |    Mask   | Scope |     Hw-Address    |
ci-info: +--------+-------+-----------+-----------+-------+-------------------+
ci-info: |  eth0  | False |     .     |     .     |   .   | dc:a6:32:23:68:29 |
ci-info: |   lo   |  True | 127.0.0.1 | 255.0.0.0 |  host |         .         |
ci-info: |   lo   |  True |  ::1/128  |     .     |  host |         .         |
ci-info: | wlan0  | False |     .     |     .     |   .   | dc:a6:32:23:68:2c |
ci-info: +--------+-------+-----------+-----------+-------+-------------------+
ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: | Route | Destination | Gateway | Interface | Flags |
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: +-------+-------------+---------+-----------+-------+
Cloud-init v. 24.1.3-0ubuntu3 running 'modules:config' at Tue, 23 Apr 2024 14:02:20 +0000. Up 31.97 seconds.
2024-05-01 07:24:15,583 INFO Authorized key ['256', 'SHA256:3bve13Z3I9fHInXNHQgPIv1RvXUBtE22MQDhpNnoxFU', 'nilo85@github/87004172', '(ED25519)']
2024-05-01 07:24:15,599 INFO Authorized key ['256', 'SHA256:+F9Uod43e1EEiWdUzfWHeu1982KwOzt+EjQ4bu1arZs', 'nilo85@github/95981694', '(ED25519)']
2024-05-01 07:24:15,615 INFO Authorized key ['3072', 'SHA256:JIEAP83yRzEgsKkJdjRmdG0ny5xzxWmTFsWc/44s7cQ', 'nilo85@github/96126473', '(RSA)']
2024-05-01 07:24:15,616 INFO [3] SSH keys [Authorized]

EDIT: Now that I think about it, none of the plots contained the wait-online.service we "overriden" and on the rpi, no such file exists so not sure what we override =)

EDIT: I patched the ssh-import-id script to run "ip a" and "nslookup google.com", this is the output:

Cloud-init v. 24.1.3-0ubuntu3 running 'init' at Wed, 01 May 2024 08:08:32 +0000. Up 11.05 seconds.
ci-info: +++++++++++++++++++++++++++++Net device info+++++++++++++++++++++++++++++
ci-info: +-----------+-------+-----------+-----------+-------+-------------------+
ci-info: |   Device  |   Up  |  Address  |    Mask   | Scope |     Hw-Address    |
ci-info: +-----------+-------+-----------+-----------+-------+-------------------+
ci-info: | enP3p49s0 | False |     .     |     .     |   .   | c0:74:2b:fe:97:ed |
ci-info: | enP4p65s0 | False |     .     |     .     |   .   | c0:74:2b:fe:97:ec |
ci-info: |     lo    |  True | 127.0.0.1 | 255.0.0.0 |  host |         .         |
ci-info: |     lo    |  True |  ::1/128  |     .     |  host |         .         |
ci-info: +-----------+-------+-----------+-----------+-------+-------------------+
ci-info: +++++++++++++++++++Route IPv6 info+++++++++++++++++++
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: | Route | Destination | Gateway | Interface | Flags |
ci-info: +-------+-------------+---------+-----------+-------+
ci-info: +-------+-------------+---------+-----------+-------+
Cloud-init v. 24.1.3-0ubuntu3 running 'modules:config' at Wed, 01 May 2024 08:08:33 +0000. Up 11.74 seconds.
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host noprefixroute
       valid_lft forever preferred_lft forever
2: enP3p49s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN group default qlen 1000
    link/ether c0:74:2b:fe:97:ed brd ff:ff:ff:ff:ff:ff
3: enP4p65s0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN group default qlen 1000
    link/ether c0:74:2b:fe:97:ec brd ff:ff:ff:ff:ff:ff
Server:     127.0.0.53
Address:    127.0.0.53#53

** server can't find google.com: SERVFAIL

2024-05-01 08:08:33,480 ERROR <urlopen error [Errno -3] Temporary failure in name resolution>
2024-05-01 08:08:33,589 - util.py[WARNING]: Failed to run command to import ubuntu SSH ids
2024-05-01 08:08:33,592 - util.py[WARNING]: ssh-import-id failed for: ubuntu ['gh:nilo85']
2024-05-01 08:08:33,592 - util.py[WARNING]: Running module ssh_import_id (<module 'cloudinit.config.cc_ssh_import_id' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_ssh_import_id.py'>) failed
Cloud-init v. 24.1.3-0ubuntu3 running 'modules:final' at Wed, 01 May 2024 08:08:41 +0000. Up 20.66 seconds.
Cloud-init v. 24.1.3-0ubuntu3 finished at Wed, 01 May 2024 08:08:42 +0000. Datasource DataSourceNoCloud [seed=/dev/mmcblk0p1][dsmode=local].  Up 20.86 seconds

So seems indeed the network is not up at the point it tries to run it

EDIT: Wait.. this looks weird? https://github.com/Joshua-Riek/ubuntu-rockchip-settings/blob/noble/data/meson.build#L8 Are we overwriting the wrong file? getty-wait written to cloud-config?

EDIT: I noticed that the overide was going to /lib/... and not /usr/lib as the default one... however didnt make a difference

Now ediitng the cloud-config, it looks like maybe this getty thing might be the issue afterall? It overrides Before, maybe =)

image

EDIT: no, I guess not, cause After is still network-online.target... but why didnt it show up in the graph?... This is a crash course in systemd =D

EDIT: comparing jounnal with "journalctl -b" I see that on OPi network is considered up immediately, while on RPI it is way more down, so I wonder if there might be some issue with the netplan maybe

EDIT: netplan looks identical except en* vs eth0. however running "sudo journalctl -xeu systemd-networkd.service" I can see:

08:45:13 it gets ip from DHCP, but get this a bit too early

May 01 08:45:07 k3s-1 systemd[1]: Reached target network-online.target - Network is Online.

EDIT: Found something!!!

sudo journalctl -xeu systemd-networkd-wait-online.service

May 01 08:45:06 k3s-1 systemd[1]: systemd-networkd-wait-online.service - Wait for Network to be Configured was skipped because of an unmet condition check (ConditionPathIsSymbolicLink=/run/systemd/generator/network-online.target.wants/systemd-networkd-wait-online.service).
nilo85 commented 4 months ago

Based on my latest finding I have a strong suspicion this is the issue

May 01 08:45:06 k3s-1 systemd[1]: systemd-networkd-wait-online.service - Wait for Network to be Configured was skipped because of an unmet condition check (ConditionPathIsSymbolicLink=/run/systemd/generator/network-online.target.wants/systemd-networkd-wait-online.service).

someone else had similar issue updating to 24.04 from 23.XX here https://bugs.launchpad.net/ubuntu/+source/netplan.io/+bug/2063973/comments/3

Due to some name mismatch in initramfs vs some udev renaming, could we have a similar issue? =)

RPi netplan config:

sudo cat /etc/netplan/50-cloud-init.yaml
# This file is generated from information provided by the datasource.  Changes
# to it will not persist across an instance reboot.  To disable cloud-init's
# network configuration capabilities, write a file
# /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following:
# network: {config: disabled}
network:
    ethernets:
        eth0:
            dhcp4: true
            optional: true
    version: 2

OPi netplan config

sudo cat /etc/netplan/50-cloud-init.yaml
# This file is generated from information provided by the datasource.  Changes
# to it will not persist across an instance reboot.  To disable cloud-init's
# network configuration capabilities, write a file
# /etc/cloud/cloud.cfg.d/99-disable-network-config.cfg with the following:
# network: {config: disabled}
network:
    ethernets:
        zz-all-en:
            dhcp4: true
            match:
                name: en*
            optional: true
        zz-all-eth:
            dhcp4: true
            match:
                name: eth*
            optional: true
    version: 2

If I get this right, it probably does not like "zz-all-en" and expects "enP4p65s0"

EDIT: I patched the cloud-init provided netork file with explicit name of interface to

network:
    ethernets:
        enP4p65s0:
            dhcp4: true
            optional: true
    version: 2

But still same issue, I think I am so close to figuring it out but maybe next is the udev rules magic knowledge etc I am missing, having my hopes up @Joshua-Riek you can interpret this =)

FINAL EDIT:

I got it working, this is the content of my network-config file:

# This file contains a netplan-compatible configuration which cloud-init will
# apply on first-boot (note: it will *not* update the config after the first
# boot). Please refer to the cloud-init documentation and the netplan reference
# for full details:
#
# https://netplan.io/reference
# https://cloudinit.readthedocs.io/en/latest/topics/network-config.html
# https://cloudinit.readthedocs.io/en/latest/topics/network-config-format-v2.html
#
# Please note that the YAML format employed by this file is sensitive to
# differences in whitespace; if you are editing this file in an editor (like
# Notepad) which uses literal tabs, take care to only use spaces for
# indentation. See the following link for more details:
#
# https://en.wikipedia.org/wiki/YAML

# Some additional examples are commented out below

network:
  renderer: networkd
  ethernets:
        enP4p65s0:
            dhcp4: true
  version: 2

not sure if its renderer and / or the removal of optional that did the trick in the end. However RPi config had optional so not sure.. but now it works for my setup 🥳

Maybe this is the root cause to your issue where you wanted to shorten the timeout in the first place?

Joshua-Riek commented 4 months ago

Sorry just catching up on this now, because there are so many devices supported with different network interfaces, I tried to use a generic config to match all ethernet interfaces (hence the "zz-all-en" and "zz-all-eth").

I think that I will need to keep track of the networking interfaces for each board and set them accordingly during the image creation process. Thanks for the testing and looking into this a whole bunch :)

nilo85 commented 4 months ago

Was a nice journey to the modern linux userspace =D last time I touched something like this, init.d was the newest coolest thing =D

I think what I discovered here works well enough as a workaround for now, the users who depend on cloud-init, also has control over network-config via the fat partition so shouldn't be a biggie =)

I just started investing enough time into my little setup that I really don't want to endup wanting to add another node and then have manually been running a lot of commands for setup, so for me it was critical I got coud-init working =D

holmanb commented 4 months ago

@nilo85 nice work debugging this

Hmm at first thought - I have the below systemd override because cloud-init hangs for 120 seconds at boot if there is no network found, I lower the timeout to 10 seconds. Maybe this is the culprit?

@Joshua-Riek Please don't do this. Overriding systemd-networkd-wait-online.service as a workaround for that timeout might allow boot to proceed, but doing this is going to break a lot more than just cloud-init. All of the dns symptoms reported by @nilo85 related to ssh_import_id are a byproduct of bypassing network online meaning that cloud-init is working in an environment where it expects network to be online, yet it is not. If there is a bug in netplan, please report it.

someone else had similar issue updating to 24.04 from 23.XX here https://bugs.launchpad.net/ubuntu/+source/netplan.io/+bug/2063973/comments/3

Due to some name mismatch in initramfs vs some udev renaming, could we have a similar issue? =)

Sounds likely.

not sure if its renderer and / or the removal of optional that did the trick in the end. However RPi config had optional so not sure.. but now it works for my setup 🥳

I would guess the former. I'd be surprised if removing optional fixed your issue, and if it did that should probably be reported to netplan since that is the opposite of what I would expect. @Joshua-Riek Any ideas why netplan wouldn't detect the right backend automatically? Maybe Rockchip has NetworkManager installed as well?

Joshua-Riek commented 4 months ago

I will take a closer look into cloud-init over the weekend and see how this should be properly addressed. I've been taking a little break the past few days with the release of Ubuntu 24.04.

Joshua-Riek commented 4 months ago

Ok I have this fixed now and found two problems:

  1. Syntax error in the user-data config (fixed with https://github.com/Joshua-Riek/ubuntu-rockchip/commit/387bc22283aac833dca5b069f7752185c7739877)
  2. Network interfaces cannot be optional (fixed with https://github.com/Joshua-Riek/ubuntu-rockchip/commit/1580bfabf120cd768eaa892162d833cc3a766144)
Joshua-Riek commented 4 months ago

@holmanb, I have a large portion of users who may not have ethernet and will not configure cloud-init for their use case. Because of this some users can experience a two-minute boot delay due to systemd-networkd-wait-online. Is there a way to properly adjust the timeout for systemd-networkd-wait-online or is it imperative that the service hangs for the full two minutes?

holmanb commented 4 months ago

Is there a way to properly adjust the timeout for systemd-networkd-wait-online or is it imperative that the service hangs for the full two minutes?

Adjusting timeouts is the wrong approach to solving this problem.

From @nilo85's comment about renderer: networkd above (and the NetworkManager configurations I see in the source tree), I suspect that NetworkManager is also enabled on the system and therefore netplan is rendering a NetworkManager config rather than a networkd configuration which is why your optional: true setting did nothing before. IIRC the default networkd configuration causes systemd-networkd-wait-online.service to wait for one interface, which would explains why you are seeing this timeout.

If you are using NetworkManager, then systemd-networkd.service (and associated units like systemd-networkd-wait-online.service) should NOT be enabled. I think that this is the real fix to this issue - pick one or the other.

holmanb commented 4 months ago

@Joshua-Riek If you actually do want to use systemd-networkd rather than NetworkManager, then I suggest that you revert this change:

Network interfaces cannot be optional (fixed with https://github.com/Joshua-Riek/ubuntu-rockchip/commit/1580bfabf120cd768eaa892162d833cc3a766144)

Per the docs:

An optional device is not required for booting. Normally, networkd will wait some time for device to become configured before proceeding with booting. However, if a device is marked as optional, networkd will not wait for it. This is only supported by networkd, and the default is false.

You actually do want optional: true for instances that have no interfaces if you are using systemd.

However, if you decide to use only NetworkManager, then it really doesn't matter since this config only affects systemd-networkd - just make sure that you have systemd-networkd disabled.

Joshua-Riek commented 4 months ago

Thanks @holmanb for the detailed information, this is very insightful. I double checked and NetworkManager is not installed on the system, networkd is being used.

That being the case, when using optional: true I'm able to reproduce @nilo85's initial report. Could we have a bug where the network is not connected to even when there is a visible network interface?

Apr 30 05:14:17 ubuntu cloud-init[999]: Cloud-init v. 24.1.3-0ubuntu3 running 'modules:config' at Tue, 30 Apr 2024 05:14:16 +0000. Up 14.27 seconds.
Apr 30 05:14:18 ubuntu cloud-init[999]: 2024-04-30 05:14:18,308 ERROR <urlopen error [Errno -3] Temporary failure in name resolution>
Apr 30 05:14:18 ubuntu cloud-init[999]: 2024-04-30 05:14:18,368 - util.py[WARNING]: Failed to run command to import ubuntu SSH ids
Apr 30 05:14:18 ubuntu cloud-init[999]: 2024-04-30 05:14:18,372 - util.py[WARNING]: ssh-import-id failed for: ubuntu ['lp:jjriek']
Apr 30 05:14:18 ubuntu cloud-init[999]: 2024-04-30 05:14:18,373 - util.py[WARNING]: Running module ssh_import_id (<module 'cloudinit.config.cc_ssh_import_id' from '/usr/lib/python3/dist-packages/cloudinit/config/cc_ssh_import_id.py'>) failed
May 03 20:37:43 ubuntu systemd[1]: cloud-config.service: Main process exited, code=exited, status=1/FAILURE
May 03 20:37:43 ubuntu systemd[1]: cloud-config.service: Failed with result 'exit-code'.
holmanb commented 4 months ago

I double checked and NetworkManager is not installed on the system, networkd is being used.

Thanks for checking, nevermind then on that.

That being the case, when using optional: true I'm able to reproduce @nilo85's initial report.

Thanks for checking. This sounds like a possible repeat of LP: #2039083, however I think that the version of systemd shipped in 24.04 should have fixed this?

Perhaps you could try the suggestion proposed in that bug:

what happens if you add the --any flag to systemd-networkd-wait-online.service (best to do this with an override config), e.g.

/etc/systemd/system/systemd-networkd-wait-online.service.d/override.conf

[Service] ExecStart= ExecStart=/lib/systemd/systemd-networkd-wait-online --any

That should make it so that it does not wait on all the other unmanaged interfaces.