freifunk-gluon / gluon

a modular framework for creating OpenWrt-based firmwares for wireless mesh nodes
https://gluon.readthedocs.io
Other
553 stars 325 forks source link

[v2022.1+] TP-Link WDR4300 hangs during reboot #2904

Closed grische closed 10 months ago

grische commented 1 year ago

Bug report

What is the problem? Occasionally (>10% of all devices), hang after an autoupdate and need a manual powercycle to reboot.

I managed to reproduce this while a serial cable was attached:

root@ffmuc-e894f6d457e8:/# autoupdater -f
Retrieving manifest from http://[2001:678:ed0:f000::]/experimental/sysupgrade/experimental.manifest ...
Stopping cron...
Stopping urngd...
Stopping micrond...
Stopping sysntpd...
Stopping gluon-radvd...
Stopping uhttpd...
Stopping sse-multiplexd...
Stopping gluon-respondd...
[  350.261277] sysctl (6540): drop_caches: 3
vm.drop_caches = 3
Downloading image:  6144 / 6144 KiB
[  372.830625] device client1 left promiscuous mode
[  372.835595] br-client: port 4(client1) entered disabled state
[  373.170806] device client0 left promiscuous mode
[  373.175746] br-client: port 5(client0) entered disabled state
Stopping network...
[  374.316737] batman_adv: bat0: Interface deactivated: mesh1
[  374.564283] batman_adv: bat0: Removing interface: mesh1
[  374.587411] batman_adv: bat0: Interface deactivated: mesh0
[  374.782898] batman_adv: bat0: Removing interface: mesh0
[  376.235548] br-client: port 3(bat0) entered disabled state
[  376.241674] br-client: port 2(local-port) entered disabled state
[  376.247889] br-client: port 1(eth0.1) entered disabled state
[  376.336981] device bat0 left promiscuous mode
[  376.341546] br-client: port 3(bat0) entered disabled state
[  376.393905] device eth0.1 left promiscuous mode
[  376.398752] br-client: port 1(eth0.1) entered disabled state
[  376.498386] device local-port left promiscuous mode
[  376.503522] br-client: port 2(local-port) entered disabled state
[  376.807475] batman_adv: bat0: Interface deactivated: primary0
[  376.813326] batman_adv: bat0: Removing interface: primary0
[  376.819582] batman_adv: bat0: Interface deactivated: mesh-vpn
[  376.825424] batman_adv: bat0: Removing interface: mesh-vpn
Cannot find device "bat0"
autoupdater: warning: execution of /usr/lib/autoupdater/upgrade.d/10stop-network exited with status code 1
[  379.953731] sysctl (6840): drop_caches: 3
vm.drop_caches = 3
Sat May 13 12:33:01 CEST 2023 upgrade: Saving config files...
Sat May 13 12:33:02 CEST 2023 upgrade: Commencing upgrade. Closing all shell sessions.
Watchdog handover: fd=3
- watchdog -
Watchdog did not previously reset the system
Sat May 13 12:33:04 CEST 2023 upgrade: Sending TERM to remaining processes ...
Sat May 13 12:33:04 CEST 2023 upgrade: Sending signal TERM to udhcpc (2029)
Sat May 13 12:33:04 CEST 2023 upgrade: Sending signal TERM to odhcp6c (2034)
Sat May 13 12:33:04 CEST 2023 upgrade: Sending signal TERM to dnsmasq (2883)
Sat May 13 12:33:08 CEST 2023 upgrade: Sending KILL to remaining processes ...
[  400.548946] stage2 (7004): drop_caches: 3
Sat May 13 12:33:15 CEST 2023 upgrade: Switching to ramdisk...
mount: mounting /dev/mtdblock4 on /overlay failed: Resource busy
[  404.592005] VFS: Busy inodes after unmount of jffs2. Self-destruct in 5 seconds.  Have a nice day...
Sat May 13 10:33:18 UTC 2023 upgrade: Performing system upgrade...
[  404.686640] do_stage2 (7004): drop_caches: 3
Unlocking firmware ...

Writing from <stdin> to firmware ...
Appending jffs2 data from /tmp/sysupgrade.tgz to firmware..                                                                                                  
Sat May 13 10:33:49 UTC 2023 upgrade: Upgrade completed
Sat May 13 10:33:50 UTC 2023 upgrade: Rebooting system...
umount: can't unmount /dev: Resource busy
umount: can't unmount /tmp: Resource [  436.218456] reboot: Restarting system
▒

I am not sure if this is related to #185, but we were not able to reproduce it (yet) with a reboot.

What is the expected behaviour? That the WDR4300 comes back up after an update.

Gluon Version: v2022.1.2 and v2022.1.3 Probably also earlier v2022.x

We experienced similar behaviour during the initial v2022.1 deployment, but discarded it as "random". It was more severe with the v2022.1.3 deployment (probably just because of chance) and I was able to reproduce it with a serial cable attached when upgrading from v2022.1.3 to v2022.1.4.

Site Configuration: https://github.com/freifunkMUC/site-ffm/blob/833829e68f97e4781f175bdd688d7f498a7efe53/site.conf

Custom patches: https://github.com/freifunkMUC/site-ffm/tree/833829e68f97e4781f175bdd688d7f498a7efe53/patches

rotanid commented 1 year ago

does this also happen with the very similar WDR3600 ?

grische commented 1 year ago

does this also happen with the very similar WDR3600 ?

Probably. We had a few isolated cases where a WDR3600 needed a power cycle after an upgrade but it is not clear if this is at all related to the problem described here. We don't have enough (failing) devices to have a confident answer.

grische commented 1 year ago

It might be worth mentioning that the special symbol at the end of the log is printed during a boot as well. I'm not sure if this is printed before or after the bootloader loaded though.

EDIT: I was able to reproduce the hang without the special symbol appearing. As if it got stuck during reboot:

[  226.555153] br-client: port 4(client0) entered disabled state
Watchdog handover: fd=3
- watchdog -
Watchdog did not previously reset the system
[  226.596451] device client1 left promiscuous mode
[  226.601297] br-client: port 5(client1) entered disabled state
Wed Jul 12 21:19:50 CEST 2023 upgrade: Sending TERM to remaining processes ...
Wed Jul 12 21:19:51 CEST 2023 upgrade: Sending signal TERM to sse-multiplexd (2340)
Wed Jul 12 21:19:51 CEST 2023 upgrade: Sending signal TERM to dnsmasq (2782)
Wed Jul 12 21:19:55 CEST 2023 upgrade: Sending KILL to remaining processes ...
[  237.363401] stage2 (5324): drop_caches: 3
Wed Jul 12 21:20:01 CEST 2023 upgrade: Switching to ramdisk...
mount: mounting /dev/mtdblock4 on /overlay failed: Resource busy
[  241.391167] VFS: Busy inodes after unmount of jffs2. Self-destruct in 5 seconds.  Have a nice day...
Wed Jul 12 19:20:05 UTC 2023 upgrade: Performing system upgrade...
[  241.489244] do_stage2 (5324): drop_caches: 3
Unlocking firmware ...

Writing from <stdin> to firmware ...
Wed Jul 12 19:20:23 UTC 2023 upgrade: Upgrade completed
Wed Jul 12 19:20:24 UTC 2023 upgrade: Rebooting system...
umount: can't unmount /dev: Resource busy
umount: can't unmount /tmp: Resource [  261.048575] reboot: Restarting system
Grotax commented 1 year ago

We also had reports in our community when I rolled out 2022.1 but thought it was random, and we didn't have proper logs or anything else. #2655

smoe commented 1 year ago

We observed this when transitioning from 2022.1.2 to 2022.1.4 on WDR4300 and more frequently on Ubiquiti AC lite. In our observation, the update was fine when the machine was rebooted just prior to the update, which may be suggesting an out-of-memory issue.

grische commented 12 months ago

@smoe Just to clarify, we were able to reproduce the issue on a freshly booted device as well. I assume the WDRs and the AC Lite are different issues here.

blocktrron commented 11 months ago

@grische

One thing that comes to my mind is the usage of the newer ar934x SPI controller driver, at least no device reported in this issue uses the older ar71xx driver.

This driver was first shipped with OpenWrt 21.02, matching the observation it does not break with older releases based on OpenWrt 19.07 and older.

https://github.com/openwrt/openwrt/commit/ebf0d8dadeca443121f4f597c51bf6591e341caf

If you are still able to reproduce this issue, you can modify the ar934x DTSI to use the compatible for the ar71xx SPI controller. Ping me in case i should provide you with a patch. If this fixes the reboot issue, we have a better path where to look next.

grische commented 11 months ago

@blocktrron thank you for looking into this. To avoid misunderstandings, you suggest to do this change here in OpenWRT?

diff --git a/target/linux/ath79/dts/ar934x.dtsi b/target/linux/ath79/dts/ar934x.dtsi
index d88c7bfabc..15201b197e 100644
--- a/target/linux/ath79/dts/ar934x.dtsi
+++ b/target/linux/ath79/dts/ar934x.dtsi
@@ -199,15 +199,17 @@
                };

                spi: spi@1f000000 {
-                       compatible = "qca,ar934x-spi";
-                       reg = <0x1f000000 0x1c>;
+                       compatible = "qca,ar7240-spi",
+                                       "qca,ar7100-spi";
+                       reg = <0x1f000000 0x10>;

                        clocks = <&pll ATH79_CLK_AHB>;
+                       clock-names = "ahb";
+
+                       status = "disabled";

                        #address-cells = <1>;
                        #size-cells = <0>;
-
-                       status = "disabled";
                };
        };
blocktrron commented 11 months ago

@grische Almost. Just revert this commit in the file:

https://github.com/openwrt/openwrt/commit/ebf0d8dadeca443121f4f597c51bf6591e341caf#diff-45ad725f9ec8cc2da88738047b1d5c4d1e69df19194bd22394d3736e03093613

grische commented 10 months ago

@blocktrron I was able to reproduce a hang after reboot even with the above commit reverted using Gluon v2023.1: https://gist.github.com/grische/27e4e780530f9a0795d96afaf749a4ed

Here is the respective branch: https://github.com/grische/site-ffm/commits/test/revert-ath79-add-new-ar934x-spi-driver/

blocktrron commented 10 months ago

@grische Are these hangs only reproducible after writing a upgrade image or does a regular reboot invocation also trigger a spurious hang?

grische commented 10 months ago

I have a test WDR4300 device where I can reproduce the hangs during a reboot every other time. Surprisingly often actually. This device has a manually installed serial port and a serial cable attached to the port.

grische commented 10 months ago

On the exact same setup, I tested it with

grische commented 10 months ago

The bug was fixed upstream in