NanoPi R5S | eth0 not configured on reboot

StephanStS commented 8 months ago

Creating a bug report/issue

[x] I have searched the existing open and closed issues

Required Information

DietPi version |

G_DIETPI_VERSION_CORE=9
G_DIETPI_VERSION_SUB=1
G_DIETPI_VERSION_RC=1
G_GITBRANCH='master'
G_GITOWNER='MichaIng'
G_LIVE_PATCH_STATUS[0]='not applicable'

Distro version | bookworm
Kernel version | Linux r5s 6.6.16-current-rockchip64 #1 SMP PREEMPT Fri Feb 23 08:25:28 UTC 2024 aarch64 GNU/Linux
SBC model | NanoPi R5S/R5C (aarch64) R5S
Power supply used | USB Hub power supply
SD card used | Several disks used, Samsung, Sandisk, no name

Additional Information (if applicable)

Software title | N.A.
Was the software title installed freshly or updated/migrated? --> fresh installation
Can this issue be replicated on a fresh installation of DietPi? -> can be replicated

Steps to reproduce

Fresh installation of DietPi on R5S, LAN cable connected to middle ethernet connector (eth0, LAN1)
During installation resp. on reboot or power off/on, the eth0 connection gets lost, i.e. no ethernet connection is established. The power on display shows that a new connection needs to be established (e.g. via dietpi-config)
If dietpi-config is started and the eth0 is applied again, all works fine
On reboot or power off/on the eth0 is lost again

Expected behaviour

eth0 should be stable connectable

Actual behaviour

eth0 gets lost.
Behaviour is identical for right LAN port (eth2, WAN)

Extra details

ip a in case of running system:

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host noprefixroute
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP group default qlen 1000
    link/ether 82:e1:1d:46:25:c8 brd ff:ff:ff:ff:ff:ff
    inet 192.168.178.112/24 brd 192.168.178.255 scope global dynamic eth0
       valid_lft 86035sec preferred_lft 86035sec
    inet6 fd00::80e1:1dff:fe46:25c8/64 scope global dynamic mngtmpaddr
       valid_lft 6831sec preferred_lft 3231sec
    inet6 2a02:810d:b1c0:db0:80e1:1dff:fe46:25c8/64 scope global dynamic mngtmpaddr
       valid_lft 4775sec preferred_lft 2075sec
    inet6 fe80::80e1:1dff:fe46:25c8/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast state DOWN group default qlen 1000
    link/ether 96:ba:bd:bc:ea:2d brd ff:ff:ff:ff:ff:ff
4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether c2:3b:04:cb:97:8b brd ff:ff:ff:ff:ff:ff

ip a in case of problem state:

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host noprefixroute
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast state DOWN group default qlen 1000
    link/ether 7e:48:64:f6:b5:a5 brd ff:ff:ff:ff:ff:ff
3: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc pfifo_fast state DOWN group default qlen 1000
    link/ether 1e:e8:8d:18:dc:fe brd ff:ff:ff:ff:ff:ff
4: eth2: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN group default qlen 1000
    link/ether c2:3b:04:cb:97:8b brd ff:ff:ff:ff:ff:ff

ethtool eth0 in case of running system:

Settings for eth0:
        Supported ports: [ TP    MII ]
        Supported link modes:   10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
                                2500baseT/Full
        Supported pause frame use: Symmetric Receive-only
        Supports auto-negotiation: Yes
        Supported FEC modes: Not reported
        Advertised link modes:  10baseT/Half 10baseT/Full
                                100baseT/Half 100baseT/Full
                                1000baseT/Full
                                2500baseT/Full
        Advertised pause frame use: Symmetric Receive-only
        Advertised auto-negotiation: Yes
        Advertised FEC modes: Not reported
        Link partner advertised link modes:  10baseT/Half 10baseT/Full
                                             100baseT/Half 100baseT/Full
                                             1000baseT/Full
        Link partner advertised pause frame use: No
        Link partner advertised auto-negotiation: Yes
        Link partner advertised FEC modes: Not reported
        Speed: 1000Mb/s
        Duplex: Full
        Auto-negotiation: on
        master-slave cfg: preferred slave
        master-slave status: master
        Port: Twisted Pair
        PHYAD: 0
        Transceiver: external
        MDI-X: Unknown
        Supports Wake-on: pumbg
        Wake-on: d
        Link detected: yes

ethtool eth0 in case of problem state:

Settings for eth0:
        Supports Wake-on: pumbg
        Wake-on: d
        Link detected: no

Contents of /etc/network/interfaces:

# Location: /etc/network/interfaces
# Please modify network settings via: dietpi-config
# Or create your own drop-ins in: /etc/network/interfaces.d/

# Drop-in configs
source interfaces.d/*

# Ethernet
allow-hotplug eth0
iface eth0 inet dhcp
address 192.168.0.100
netmask 255.255.255.0
gateway 192.168.0.1
#dns-nameservers 192.168.178.2

# WiFi
#allow-hotplug wlan0
iface wlan0 inet dhcp
address 192.168.0.100
netmask 255.255.255.0
gateway 192.168.0.1
#dns-nameservers 192.168.178.2
wpa-conf /etc/wpa_supplicant/wpa_supplicant.conf

Strange behaviour of MAC of eth0 and eth1: The values (given by ip a) change on reboot, see output above. MAC of eth2 stays on c2:3b:04:cb:97:8b

MichaIng commented 8 months ago

Can you check:

journalctl -u ifup@eth0

Good to know that the MAC addresses of the two LAN ports are randomised. But at least the interfaces do not seem to be switched (eth0 remains the LAN1 port).

StephanStS commented 8 months ago

In case of a failed power up, the output is:

Mär 09 12:11:59 r5s systemd[1]: Starting ifup@eth0.service - ifup for eth0...
Mär 09 12:11:59 r5s dhclient[405]: Internet Systems Consortium DHCP Client 4.4.3-P1
Mär 09 12:11:59 r5s ifup[405]: Internet Systems Consortium DHCP Client 4.4.3-P1
Mär 09 12:11:59 r5s ifup[405]: Copyright 2004-2022 Internet Systems Consortium.
Mär 09 12:11:59 r5s ifup[405]: All rights reserved.
Mär 09 12:11:59 r5s ifup[405]: For info, please visit https://www.isc.org/software/dhcp/
Mär 09 12:11:59 r5s dhclient[405]: Copyright 2004-2022 Internet Systems Consortium.
Mär 09 12:11:59 r5s dhclient[405]: All rights reserved.
Mär 09 12:11:59 r5s dhclient[405]: For info, please visit https://www.isc.org/software/dhcp/
Mär 09 12:11:59 r5s dhclient[405]:
Mär 09 12:12:00 r5s dhclient[405]: Listening on LPF/eth0/8e:31:b4:9c:c5:31
Mär 09 12:12:00 r5s ifup[405]: Listening on LPF/eth0/8e:31:b4:9c:c5:31
Mär 09 12:12:00 r5s ifup[405]: Sending on   LPF/eth0/8e:31:b4:9c:c5:31
Mär 09 12:12:00 r5s ifup[405]: Sending on   Socket/fallback
Mär 09 12:12:00 r5s dhclient[405]: Sending on   LPF/eth0/8e:31:b4:9c:c5:31
Mär 09 12:12:00 r5s dhclient[405]: Sending on   Socket/fallback
Mär 09 12:12:01 r5s dhclient[405]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 3
Mär 09 12:12:01 r5s ifup[405]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 3
Mär 09 12:12:01 r5s ifup[405]: send_packet: Network is down
Mär 09 12:12:01 r5s ifup[405]: dhclient.c:2600: Failed to send 300 byte long packet over eth0 interface.
Mär 09 12:12:01 r5s ifup[405]: receive_packet failed on eth0: Network is down
Mär 09 12:12:01 r5s dhclient[405]: send_packet: Network is down
Mär 09 12:12:01 r5s dhclient[405]: dhclient.c:2600: Failed to send 300 byte long packet over eth0 interface.
Mär 09 12:12:01 r5s dhclient[405]: receive_packet failed on eth0: Network is down
Mär 09 12:12:04 r5s dhclient[405]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 6
Mär 09 12:12:04 r5s ifup[405]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 6
Mär 09 12:12:04 r5s ifup[405]: send_packet: Network is down
Mär 09 12:12:04 r5s ifup[405]: dhclient.c:2600: Failed to send 300 byte long packet over eth0 interface.
Mär 09 12:12:04 r5s dhclient[405]: send_packet: Network is down
Mär 09 12:12:04 r5s dhclient[405]: dhclient.c:2600: Failed to send 300 byte long packet over eth0 interface.
Mär 09 12:12:10 r5s dhclient[405]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 15
Mär 09 12:12:10 r5s ifup[405]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 15
Mär 09 12:12:10 r5s ifup[405]: send_packet: Network is down
Mär 09 12:12:10 r5s ifup[405]: dhclient.c:2600: Failed to send 300 byte long packet over eth0 interface.
Mär 09 12:12:10 r5s dhclient[405]: send_packet: Network is down
Mär 09 12:12:10 r5s dhclient[405]: dhclient.c:2600: Failed to send 300 byte long packet over eth0 interface.
Mär 09 12:12:25 r5s dhclient[405]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 15
Mär 09 12:12:25 r5s ifup[405]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 15
Mär 09 12:12:25 r5s ifup[405]: send_packet: Network is down
Mär 09 12:12:25 r5s ifup[405]: dhclient.c:2600: Failed to send 300 byte long packet over eth0 interface.
Mär 09 12:12:25 r5s dhclient[405]: send_packet: Network is down
Mär 09 12:12:25 r5s dhclient[405]: dhclient.c:2600: Failed to send 300 byte long packet over eth0 interface.
Mär 09 12:12:40 r5s dhclient[405]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 16
Mär 09 12:12:40 r5s ifup[405]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 16
Mär 09 12:12:40 r5s ifup[405]: send_packet: Network is down
Mär 09 12:12:40 r5s ifup[405]: dhclient.c:2600: Failed to send 300 byte long packet over eth0 interface.
Mär 09 12:12:40 r5s dhclient[405]: send_packet: Network is down
Mär 09 12:12:40 r5s dhclient[405]: dhclient.c:2600: Failed to send 300 byte long packet over eth0 interface.
Mär 09 12:12:56 r5s dhclient[405]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 5
Mär 09 12:12:56 r5s ifup[405]: DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 5
Mär 09 12:12:56 r5s ifup[405]: send_packet: Network is down
Mär 09 12:12:56 r5s ifup[405]: dhclient.c:2600: Failed to send 300 byte long packet over eth0 interface.
Mär 09 12:12:56 r5s dhclient[405]: send_packet: Network is down
Mär 09 12:12:56 r5s dhclient[405]: dhclient.c:2600: Failed to send 300 byte long packet over eth0 interface.
Mär 09 12:13:01 r5s dhclient[405]: No DHCPOFFERS received.
Mär 09 12:13:01 r5s ifup[405]: No DHCPOFFERS received.
Mär 09 12:13:01 r5s ifup[405]: No working leases in persistent database - sleeping.
Mär 09 12:13:01 r5s dhclient[405]: No working leases in persistent database - sleeping.
Mär 09 12:13:01 r5s systemd[1]: Finished ifup@eth0.service - ifup for eth0.

MichaIng commented 8 months ago

It just does not get a DHCP lease. It does immediately work when you run systemctl restart ifup@eth0, and when you apply a static IP?

Hmm

Mär 09 12:12:40 r5s dhclient[405]: send_packet: Network is down
Mär 09 12:12:40 r5s dhclient[405]: dhclient.c:2600: Failed to send 300 byte long packet over eth0 interface.

This actually looks like it was not able to send the DHCPDISCOVER, as if it really has no carrier signal or so 🤔.

yjteng commented 6 months ago

It seems I am too having similar issue of unreliable/hang of ifup service during startup boot. Most of the time, waited for more that 1 minute and had to power it down and up again to try to get pass that upon new boot. Happy to help with troubleshooting if you need anything.

I thought this could be related to #6663, #6565

cxgth commented 5 months ago

I have the same problem and think I found a workaround for the moment.

I have 2 installations, one on a sdcard and one on the eMMC. The sdcard installation works every time I boot the device, the eMMC installation isn't able to bring up eth0 most of the time.

I checked both installations for multiple hours and found nothing, they seem to be identical. So my guess is, that the sdcard installation boots slower, which gives the network device enough time to come up.

My "dirty" workaround is the following:

edit the file /etc/systemd/system/ifup@.service.d/dietpi.conf
add a sleep statement

[Service] Type=oneshot ExecStart= ExecStart=sleep 10 ExecStart=/sbin/ifup --allow=hotplug %I

MichaIng commented 5 months ago

Can you both check my last comment and test whether restarting the service and/or assigning a static IP helps?

cxgth commented 5 months ago

Restarting the service (networking.service), when configured with dhcp, works most of the time (if I remember correctly). The static IP helps to at least boot into the system.

The main problem is, that the boot dialog hangs at ifup@eth.service for over 15 minutes.

If you give me some test cases (dhcp, static, reboot/poweroff, etc.), I can run them on my device.

MichaIng commented 5 months ago

Restarting the service (networking.service), when configured with dhcp, works most of the time (if I remember correctly).

This should not have any effect: networking.service restarts auto interfaces only, while we use allow-hotplug interfaces, controlled by ifup@.service instances only.

The main problem is, that the boot dialog hangs at ifup@eth.service for over 15 minutes.

Hmm, it should have a default startup timeout of 90 seconds. Similarly, the DHCP client itself has a 60 seconds timeout, before it gives of for a while (several minutes at least), which can be seen in Stephan's logs. If it hangs for more than 90 seconds in your case, then there must be a more low-level hang overriding the DHCP and systemd timeouts.

If you give me some test cases (dhcp, static, reboot/poweroff, etc.), I can run them on my device.

As said, it would be great to test whether a static IP does not cause that issues, i.e. whether only the DHCP client has issues at this stage.

Also can you add your ifup@eth0 logs in case it hangs for so long, to compare with Stephan's:

journactl -u ifup@eth0

cxgth commented 5 months ago

So, I reverted all changes and tried multiple tests with dhcp and static ip settings.

Before you read further: I think it has nothing to do with dhcp or static ip, instead with the device not initialize correctly or a wrong device state due to services not waiting.

test (dhcp)

system hangs for ~60 seconds
journal shows failed dhcp requests (same as Stephan's logs)
device sometimes not responding (eth0 down)
- can sometimes be revived by restarting service (networking)

test (static)

most of the time, all services pass and I can login without having to wait
device sometimes not responding (eth0 down)
- can sometimes be revived by restarting service (networking)

last test

sometimes the led initialization fails (wrong led is on, for example led 2 is on, cable is connected to led 1)
- led state won't change even when disconnect and reconnect the cable
system hangs during boot, timeouts (5min. default for systemd service) ignored
eth0 can't be revived with restart of service

Here a screenshot of the boot process and the log:

last-test_boot

journal.txt

Edit: as soon as I added the timeout (sleep command) both versions (dhcp and static ip) worked without a problem

MichaIng commented 5 months ago

Now I see the problem: ifup@eth0.service tries to start before ifupdown-pre.service has finished.

This service waits for udev events to settle, so that network interfaces are not tried to be brought up, before device detection has finished. This is important for auto interfaces, which are tried to be brought up at a fixed stage of the boot sequence, asynchronously to device detection, without ifupdown-pre.service. On stock Debian, ifupdown-pre.service does nothing if no auto interfaces are used, since the ifup@.service instances for allow-hotplug interfaces are added to the boot sequence by udev itself, once the related network adapter has been detected. However, this has the downside, that network.target and network-online.target can be reached, before network adapters have been even detected.

We hence changed ifupdown-pre.service to as well wait for udev events to settle, if only allow-hotplug interfaces are used. This assures that those network targets are never reached before network adapter device detection has finished, and all allow-hotplug interfaces for attached adapters have been brought up.

But, since on stock Debian, the two services are not related to each other, if is perfectly possible that ifup@.service instances run before ifupdown-pre,service, in case adapters are detected quickly compared to other devices. I am not 100% sure about the implications, but at least theoretically it is possible that the Ethernet adapter LEDs have not been detected yet. In this case, the adapter would be started to be brought up, then the our udev rule to attach them to the correct adapters/interfaces could start, which brings up and down the interface, to make the LED behave correctly. Totally makes sense that this breaks ifup and would explain everything perfectly.

What we hence need to assure is that udev settles, hence our LED udev rules have finished, before ifup@.service instances run, which can be done easiest by letting ifupdown-pre.service run/finish before network-pre.target. The service names alone indicate already that this is basically the intended order, matching also the practical order if auto interfaces are used.

Please try this:

G_CONFIG_INJECT '\[Unit\]' '[Unit]' /etc/systemd/system/ifupdown-pre.service.d/dietpi.conf
G_CONFIG_INJECT 'Before=' 'Before=network-pre.target' /etc/systemd/system/ifupdown-pre.service.d/dietpi.conf '\[Unit\]'
G_CONFIG_INJECT 'Wants=' 'Wants=network-pre.target' /etc/systemd/system/ifupdown-pre.service.d/dietpi.conf '\[Unit\]'
reboot

cxgth commented 5 months ago

Thanks for the detailed information and quick fix.

Unfortunately, it failed after the 13th attempt. 8 boot attempts with static ip worked flawlessly.

With dhcp, it worked 4 times and the 5th failed with wrong LED state (1 and 2 are on), the service ifupdown-pre.service failed and now it hangs at ifup@eth0.service and networking.service. As a side note, after the service sys-subsystem-net-devices-eth0.service I got an error for the CPU governors, which doesn't happen if the system boots up normally.

As soon as the system is accessible, I attach the journal log.

MichaIng commented 5 months ago

Would be good to know how ifupdown-pre.service failed. Probably something is keeping udev busy until the service times out. We would then need to investigate what is causing this. I mean, the fact that we need to enforce ordering in this case already indicates that udev takes longer than expected, as this does not happen on any other SBC. So we'd need to check this in any case, and above "fix" is more fixing symptoms, with the limitation of ifupdown-pre.service timeout or udevadm settle failure.

cxgth commented 5 months ago

Here is the journal.txt

Before I found this specific ticket/issue, I read some other NanoPi related ones which mentioned, that the SoC is not always reset properly. It is a wild guess but would match with the missing CPU governors.

MichaIng commented 5 months ago

Not related to the issue, but your network configuration seems to have issues as well:

Jun 02 21:57:02 DietPi systemd[1]: Starting ifup@eth0.service - ifup for eth0...
Jun 02 21:57:02 DietPi ifup[464]: ifup: /etc/network/interfaces.d/if.tar:1: misplaced option
Jun 02 21:57:02 DietPi systemd[1]: Starting networking.service - Raise network interfaces...
Jun 02 21:57:02 DietPi ifup[466]: ifup: /etc/network/interfaces.d/if.tar:1: misplaced option

However, the bigger problem is that indeed the network adapters seem to be ready very late:

Jun 02 21:56:03 DietPi (udev-worker)[373]: green:lan-1: Spawned process '/bin/ip l s up dev eth0' [410] is taking longer than 59s to complete
Jun 02 21:56:03 DietPi (udev-worker)[339]: green:lan-2: Spawned process '/bin/ip l s up dev eth1' [420] is taking longer than 59s to complete
Jun 02 21:56:03 DietPi systemd-udevd[318]: green:lan-1: Worker [373] processing SEQNUM=3360 is taking a long time
Jun 02 21:56:03 DietPi systemd-udevd[318]: green:wan: Worker [349] processing SEQNUM=3362 is taking a long time
Jun 02 21:56:03 DietPi systemd-udevd[318]: green:lan-2: Worker [339] processing SEQNUM=3361 is taking a long time
Jun 02 21:57:02 DietPi dash[344]: Timed out for waiting the udev queue being empty.

These are our LED udev rules, trying to bring up the interface to apply the LED trigger (else it would just be lid, regardless whether a cable is attached or not).

The adapters have been detected at that time:

Jun 02 21:55:00 DietPi kernel: r8169 0000:01:00.0 eth0: RTL8125B, 8e:72:25:1c:0e:58, XID 641, IRQ 79
Jun 02 21:55:00 DietPi kernel: r8169 0000:01:00.0 eth0: jumbo features [frames: 9194 bytes, tx checksumming: ko]
...
Jun 02 21:55:02 DietPi systemd[1]: Expecting device sys-subsystem-net-devices-eth0.device - /sys/subsystem/net/devices/eth0...
Jun 02 21:55:02 DietPi DietPi-PreBoot[392]: DietPi-CPU_set | CPU governors are not available on this system. This is probably a virtual machine. Aborting...
Jun 02 21:55:02 DietPi systemd[1]: Found device sys-subsystem-net-devices-eth0.device - RTL8125 2.5GbE Controller.

There also the missing CPU governors show up. So yeah, something is broken since latest Linux 6.6.32 update. This all worked perfectly fine some time ago, also with older Linux 6.6, IIRC. Can you try this:

apt install linux-{image,dtb}-current-rockchip64=24.2.1

Confirm the downgrade.

cxgth commented 5 months ago

Not related to the issue, but your network configuration seems to have issues as well:

Jun 02 21:57:02 DietPi systemd[1]: Starting ifup@eth0.service - ifup for eth0...
Jun 02 21:57:02 DietPi ifup[464]: ifup: /etc/network/interfaces.d/if.tar:1: misplaced option
Jun 02 21:57:02 DietPi systemd[1]: Starting networking.service - Raise network interfaces...
Jun 02 21:57:02 DietPi ifup[466]: ifup: /etc/network/interfaces.d/if.tar:1: misplaced option

Yeah, not a problem. I just saved my older config in a tar-file, so ignore the error message.

There also the missing CPU governors show up. So yeah, something is broken since latest Linux 6.6.32 update. This all worked perfectly fine some time ago, also with older Linux 6.6, IIRC. Can you try this:
apt install linux-{image,dtb}-current-rockchip64=24.2.1
Confirm the downgrade.

I will try it tomorrow, as soon as I get back from work. Thanks for the support.

MichaIng commented 5 months ago

I just saved my older config in a tar-file, so ignore the error message.

Our /etc/network/interfaces loads every file in this dir, regardless which file ending. Probably we should change it to load *.conf files only. Though, AFAIR stock Debian loads everything as well. However, yeah, I guess it does not cause issues either.

I'll also run some tests now on my R5S.

MichaIng commented 5 months ago

With SD card, it works fine here as well. I cannot test the eMMC currently, since it is still on an old 5.10.160 kernel, where I want to test the conversion to more recent Linux builds. I'll see whether I can finish this tomorrow evening.

Just to rule it out, does it change anything for you when you remove our udev rules?

mv /etc/udev/rules.d/dietpi-eth-leds.rules{,_bak}

However, since you have issues even with the CPU governor, I think there is a more fundamental problem with the latest kernel when booting from eMMC.

MichaIng commented 5 months ago

Another report, which seems to be the same issue at first view, booting with NVMe in this case: https://dietpi.com/forum/t/20268

cxgth commented 5 months ago

So, today it was way more difficult to reproduce the error.

Disabling the LED udev rule didn't change anything, I guess (stopped after 10 reboots). After that, I reverted the change and needed at least 20 reboots to get to the state.

Downgrading the kernel had some or no effect. I stopped after 30 reboots.

The only thing I can confirm is, that the CPU governor has nothing to do with the kernel versions or the failed state. On both versions (24.2.1 / 24.5.1) I got the error mentioned above. And on both versions, even when the governor detection failed, the initialization worked.

Changing the systemd service dependencies was a big step forward. For now, I have no clue how to force that rare state or it to be more replicable.

MichaIng commented 5 months ago

So far so good then, so enforcing the systemd order helped against the case where our udev rules brought down an interface which is in the process of ifup@.service. But boot does still take long?

If so, we could further investigate this way:

G_CONFIG_INJECT 'udev_log=' 'udev_log=debug' /etc/udev/udev.conf
reboot

This a mass of detailed udev logs, which device is detected when, which udev rules it triggered, what they did, in case whether they took very long etc etc:

journalctl

In case, makes sense to filter it a little, like

journalctl --grep eth0

to check logs for eth0 in particular.

Did we actually check for kernel errors already?

dmesg -l 0,1,2,3

cxgth commented 5 months ago

I removed eth1 and eth2 from /etc/network/interfaces to speed up the boot process. Due to the fact, that only the WAN interface has a fixed mac address, I used it to stop the randomization of the other to nics. But of course, without setting a static ip address, the service tried a dhcp request for both unconfigured nics.

If I remember correctly, there were no kernel errors during the boot process. But I try one more time to replicate the error with extended udev logs. Maybe it lead to something.

cxgth commented 5 months ago

So, after 80 or more reboots over two days, I can't reproduce the issue.

I also reverted the udev debug log, switch kernels and played with power supply, because the last time it happened, I switched the power supply to monitor the consumption via FNIRSi FNB58 (after our changes).

Either it is a weird power glitch/SoC reset problem or a timing problem between the services.

Update: Now, when I switch to static ip address, it sometimes hangs @ ifupdown-pre.service. I'll keep investigating.

MichaIng commented 5 months ago

Did the udev logs contain any hint why the Ethernet devices take so long to get ready, or some detach/attach loops?

cxgth commented 5 months ago

I disabled the logs and switched to static ip afterwards, because I wanted to setup my system for now.

I will re-enable the debug log and gather enough information. My first thought is, that via dhcp, the services have enough time to finish.

Update: Finally, I got the error again. Let's hope, that the error log will be enough to find the cause. (~15min. until I can access the login prompt)

cxgth commented 5 months ago

Hey, a little bit late with my response due to overtime.

I could not find any clue, but maybe you will. Here is my log journal.log.gz

MichaIng commented 5 months ago

Definitely try to remove our udev rule. Or did you try this already and still ran into the same issue?

So, today it was way more difficult to reproduce the error.

Disabling the LED udev rule didn't change anything, I guess (stopped after 10 reboots). After that, I reverted the change and needed at least 20 reboots to get to the state.

Does this mean, you were not able to trigger the issue within 10 reboots, when the udev rule was disabled, but you were able to trigger it after 20 reboots, with the udev rules in place? If so, it has not yet been ruled out, that those do cause the issue, for which, we would need to trigger the issue with the rules disabled:

mv /etc/udev/rules.d/dietpi-eth-leds.rules{,_bak}

Hence, I am still not sure whether its a symptom or reason, but all seems to process fine, until one of the LED rules triggers /bin/ip l s up dev eth1. This never finishes and times out after 3 minutes. In parallel, the other two LEDs hang at applying the netdev trigger, respectively (the other one) at applying the network interface to the trigger. So all 3 rules hang and time out at different stages. And in two cases, right after these timeouts were logged, udev rules were completely reloaded Udev rules need reloading. Since it is taking too long, ifupdown-pre.service times out, after which networking.service and all 3 ifup@.service instances start (the last change made them all wait for this service to finish), but this happens ~1 minute before the LED trigger workers are killed.

So our ordering, which seem to have helped you, did not help in this particular case, since the timeouts made ifup@.service instances still start, before our udev rules finished or were killed, leading to potentially parallel ip link set up/down calls for the same interface. All 4 network services then time out after another 5 minutes (increased default timeout due to TimeoutStartSec=5min in the units). Terminating them (SIGTERM) times out as well after another 90 seconds, the hard way SIGKILL times out as well, after another 90 seconds. But while it says Processes still around after SIGKILL. Ignoring., it actually tries another 3 times to SIGKILL them, before giving up, even a 4th time for networking.service. I was wondering why this happens, because normally SIGKILL is done after a few seconds already, and given up after another few seconds, normally, but I think it has to do with this, in those units:

ExecStart=/usr/sbin/ifup -a --read-environment
ExecStart=-/bin/sh -c 'if [ -f /run/network/restart-hotplug ]; then /usr/sbin/ifup -a --read-environment --allow=hotplug; fi'
ExecStop=/usr/sbin/ifdown -a --read-environment --exclude=lo
ExecStopPost=/usr/bin/touch /run/network/restart-hotplug
TimeoutStartSec=5min

Here the example of networking.service, which does not have a single start command only, but two start commands, one stop and one post-stop command. The first ExecStart takes 5 minutes to time out, due to TimeoutStartSec. Then its SIGERM times out after 90 seconds, then its SIGKILL after another 90 seconds, after which it is given up. The 2nd ExecStart and ExecStop are both skipped, but ExecStopPost runs, which also times out after 90 seconds, with SIGTERM and SIGKILL each adding another 90 seconds. So all together: 300s + 5*90s = 12.5 minutes. Verified again with logs:

Jun 04 23:23:28 DietPi systemd[1]: Starting networking.service - Raise network interfaces...
# ExecStart timeout causing SIGTERM
Jun 04 23:28:28 DietPi systemd[1]: networking.service: start operation timed out. Terminating.
# SIGTERM timeout causing SIGKILL
Jun 04 23:29:58 DietPi systemd[1]: networking.service: State 'stop-sigterm' timed out. Killing.
Jun 04 23:29:58 DietPi systemd[1]: networking.service: Killing process 481 (ifup) with signal SIGKILL.
# SIGKILL timeout
Jun 04 23:31:28 DietPi systemd[1]: networking.service: Processes still around after SIGKILL. Ignoring.
# ExecStopPost timeout causing SIGTERM
Jun 04 23:32:59 DietPi systemd[1]: networking.service: State 'stop-post' timed out. Terminating.
# SIGTERM timeout causing SIGKILL
Jun 04 23:34:29 DietPi systemd[1]: networking.service: State 'final-sigterm' timed out. Killing.
Jun 04 23:34:29 DietPi systemd[1]: networking.service: Killing process 481 (ifup) with signal SIGKILL.
# SIGKILL timeout
Jun 04 23:35:59 DietPi systemd[1]: networking.service: Processes still around after final SIGKILL. Entering failed mode.
Jun 04 23:35:59 DietPi systemd[1]: networking.service: Failed with result 'timeout'.
Jun 04 23:35:59 DietPi systemd[1]: networking.service: Unit process 481 (ifup) remains running after unit stopped.
Jun 04 23:35:59 DietPi systemd[1]: Failed to start networking.service - Raise network interfaces.

23:23:28 til 23:35:59 are 12.5 minutes. Totally insane, of course. For the ifup@.service instances, there is no ExecStopPost, hence they are given up 90 seconds earlier.

But I could not find any hint in the logs, why /bin/ip l s up dev eth1 and every other ip/ifup call time out in the first place, and seem to hang up not only the whole network, but more, so that even /usr/bin/touch /run/network/restart-hotplug times out, a command which only creates an empty file in a tmpfs, and has by itself not anything to do with network at all. This file is only a flag, but does not cause any action by itself.

In any case, we really need to find a better way to enable the Ethernet LEDs, so that they light up/blink on link/activity, but keep them off until there is really a cable connected and link up. The whole reason for this ip calls in our rules is that, once you set the LED to light up on link, it immediately lights up, regardless whether the link is up and a cable connected or not. And I did not find any other way to turn it off, but keeping the link trigger active, aside of setting the interface up and down once. A nasty limitation of this driver.

An alternative would be to leave the LEDs untouched at boot, and instead configure them via /etc/network/if-up.d script, enabling the trigger once the interface is really configured. It has the same issue, that the LED is then turned on, even if no cable is connected, but it is probably a rare case that admins configure an interface via /etc/network/interfaces|.d, without actually having a cable connected. And at least, when a cable is detached after the trigger was applied, the LED reacts correctly and turns off.

MichaIng commented 4 months ago

Could you (both of you) test whether this has been solved? We did some changes around network services.

yjteng commented 4 months ago

I'm still on v9.5.1, do you want me to update to give some tests?

yjteng commented 4 months ago

Uploading both logs for passed and failed scenarios, hope it helps. Note: I'm bridging my lan1 and lan2 ports. On the failed scenario, I noticed both my lan1 and lan2 LED are lit green. After reboot, only lan1 is lit green and it is working fine.

journal-fail.log journal-pass.log interfaces.txt

jysqice commented 2 months ago

https://github.com/unifreq/linux-6.1.y/commit/1c9dcaf4c98622b869a33942f131093bac27a6d2

MichaIng commented 2 months ago

@jysqice does this apply to mainline kernel (which we use) as well?

MichaIng commented 2 months ago

@yjteng In the failed case, the ip l s up dev eth commands from the udev rules timed out again. So removing those should indeed solve it (but leaves the LEDs unconfgured).

Armbian btw provides an overlay to apply the LED triggers via device tree: https://github.com/armbian/build/blob/main/patch/kernel/archive/rockchip64-6.6/overlay/rockchip-rk3568-nanopi-r5s-leds.dts I'll test this, whether this works without having the LEDs lid at boot, hence without the ip up/down hassle, and whether it is possible to apply the other attributes we set via udev rules as well, or with a combination of both.

It should not have an effect on the underlying issue, but I recognise you use ifupdown2, which at least bypasses the adjustments we did to ifupdown, and might behave a little different. Not that it is not newer, no successor or any such, but just a different implementation written in Python, naturally slower, larger overhead, Python runtime as dependency etc, and it is developed slower. Hence we keep using original C `ifupdown´ intentionally.

MichaIng / DietPi