freifunk-gluon / gluon

a modular framework for creating OpenWrt-based firmwares for wireless mesh nodes
https://gluon.readthedocs.io
Other
553 stars 325 forks source link

Ubiquiti UniFi AC Mesh broken after upgrading from v2022.1.1 to v2022.1.2 #2780

Closed grische closed 1 year ago

grische commented 1 year ago

Bug report

What is the problem? 3 out of 4 upgraded AC Mesh failed to come back up. Unfortunately, we only received information for one of them:

Disconnected the power of the AC Mesh, without change. It gets an IP but it does not send the SSID. Also not reachable via IPv6, but the local IP works via SSH. Ping to google.de does not work, but ping to google 2a00:1450:4001:829::2003 works.

readlog spams with the following errors:

gluon_bat0 (30227): Error - can't open file '/sys/module/batman_adv/parameters/routing_algo': No such file or directory
Sun Feb  5 15:56:52 2023 daemon.err modprobe: no module folders for kernel version 5.10.146 found
Sun Feb  5 15:56:52 2023 daemon.err modprobe: no module folders for kernel version 5.10.146 found
Sun Feb  5 15:56:52 2023 daemon.err modprobe: no module folders for kernel version 5.10.146 found
Sun Feb  5 15:56:53 2023 daemon.notice netifd: gluon_bat0 (30227): Error - failed to add create batman-adv interface: Not supported
Sun Feb  5 15:56:53 2023 daemon.err modprobe: no module folders for kernel version 5.10.146 found
Sun Feb  5 15:56:53 2023 daemon.err modprobe: no module folders for kernel version 5.10.146 found
Sun Feb  5 15:56:53 2023 daemon.err modprobe: no module folders for kernel version 5.10.146 found
Sun Feb  5 15:56:53 2023 daemon.err modprobe: no module folders for kernel version 5.10.146 found
Sun Feb  5 15:56:53 2023 daemon.notice netifd: gluon_bat0 (30227): Error - interface bat0 is not present or not a batman-adv interface

What is the expected behaviour? That it survives an upgrade from v2022.1.1 to v2022.1.2

Gluon Version: v2022.1.2

Site Configuration: https://github.com/freifunkMUC/site-ffm/blob/stable/site.conf

Custom patches: https://github.com/freifunkMUC/site-ffm/tree/stable/patches

cc @rotanid @ecsv

ecsv commented 1 year ago

On first glance (for the only existing log), it looks like it was not installed correctly (when switching from Ubiqiti's OS to gluon). At least, in the past, we saw that people installed the ac mesh using a wrong tutorial. In the tutorial, it was mentioned that you had to write the gluon image to both kernel partitions and it was missing the important fact that the bs partition had to be modified to let u-boot select the kernel0 for loading the kernel image. As result, after the next big upgrade (with a new kernel version), the rootfs (always coming from kernel0 for OpenWrt) was no longer matching the kernel (which was loaded from kernel1).

Could you check whether the original installation method from https://wiki.freifunk.net/Ubiquiti_Unifi_AC/Flash-Anleitung_mit_mtd was used? See also https://openwrt.org/toh/ubiquiti/unifiac#non-invasive_method_using_mtd

Can you check bs partition (which should start with 0) and the kernel1/ubnt-airos partition (which should completely empty = aka 0xff or 0x00) with hexdump -C .... Check cat /proc/mtd to find the actual /dev/mtd* devices to use with hexdump -C. And then check if /lib/modules/ contains the directory for a different kernel version?

I would assume (if it was installed using the wrong method) that:

In this case, someone copied the v2022.1.1 sysupgrade image to both kernel0 + kernel1 partition when switching from Ubiqiti's OS to gluon. And now, OpenWrt/gluon installed during the sysupgrade to kernel0. But during the actual boot, the kernel is loaded from kernel1. Unfortunately, the rootfs is still used from kernel0. As result, the kernel modules from the rootfs are not matching the kernel which was loaded.

Would be interesting if you can falsify this. For the device(s?) which disappeared here, it is more likely that the installation was done correctly (so kernel1 was never used) - but I cannot check myself what happened to these devices (at the moment). And our device(s?) were upgraded from v2021.1.2 to v2022.1.2

grische commented 1 year ago

Indeed, they used this installation method to install Gluon on the AC Mesh: https://forum.darmstadt.freifunk.net/t/unifi-ap-erstinstallation/790 And this installation also specifies to overwrite kernel1 together with kernel0.

kernel1 is not completely empty

kernel1 (mtd6) starts with 00000000 27 05 19 56 02 00 50 19 5f e1 69 75 00 21 83 be |'..V..P._.iu.!..| and is filled up pretty well. Similar, kernel (mtd3) starts with 00000000 27 05 19 56 02 00 50 19 5f e1 69 75 00 21 83 be |'..V..P._.iu.!..|.

bs partition is not starting with zero

bs (mtd7) does not start with 00:

 hexdump -C /dev/mtd7ro
00000000  80 00 00 00 a3 4d e8 2b  00 00 00 00 00 00 00 00  |.....M.+........|
00000010  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00020000

/lib/modules/5.10.161/ exists (from the kernel of 2022.1.2)

# uname -a
Linux myhostname 5.10.146 #0 Tue Dec 22 03:35:17 2020 mips GNU/Linux
# ls -al /lib/modules/5.10.161/
drwxr-xr-x    2 root     root          2531 Dec 22  2020 .
drwxr-xr-x    3 root     root            31 Dec 22  2020 ..
-rw-r--r--    1 root     root         10164 Dec 22  2020 act_csum.ko
-rw-r--r--    1 root     root          6176 Dec 22  2020 act_gact.ko
-rw-r--r--    1 root     root          9644 Dec 22  2020 act_mirred.ko
-rw-r--r--    1 root     root          9084 Dec 22  2020 act_pedit.ko
-rw-r--r--    1 root     root          5732 Dec 22  2020 act_simple.ko
-rw-r--r--    1 root     root          7452 Dec 22  2020 act_skbedit.ko
-rw-r--r--    1 root     root         27192 Dec 22  2020 ath.ko
...
rotanid commented 1 year ago

we also have this issue with all of our Unifi AC Mesh devices. other, even very similar devices, are not affected, e.g. the UniFi AC LR access point

@ecsv i'm fairly certain we did also flash them correctly...

ecsv commented 1 year ago

@rotanid i wasn't meaning you. I was about the log at the beginning of the this ticket. As written in IRC, I also assume that something is wrong with AC mesh - but wanted to avoid that we spend time on the wrong installation method (which is likely the reason for the log shown by @grische)

@mweinelt Maybe you want to change the first installation guide linked by @grische to not write an image to kernel1 to avoid such behavior. I think neoraider recommended the mtd erase of kernel1 a while back

@grische please use following to switch to kernel0 for booting:

. /lib/functions.sh

# if it is called kernel1 for you (might fail because kernel1 is read-only for you)
mtd erase /dev/mtd$(find_mtd_index "kernel1")

# or if it is still called ubnt-airos for you (might fail because ubnt-airos is read-only for you)
mtd erase /dev/mtd$(find_mtd_index "ubnt-airos")

dd if=/dev/zero bs=1 count=1 of=/dev/mtd$(find_mtd_index "bs")
reboot
rotanid commented 1 year ago

@ecsv thanks for pointing this out.

i suggest once the flashing issue has been checked and maybe ruled out by @grische , we/i open a new issue which contains a description matching the "bigger" problem with the device.

blocktrron commented 1 year ago

Duplicate of #1301