freifunk-gluon / gluon

a modular framework for creating OpenWrt-based firmwares for wireless mesh nodes
https://gluon.readthedocs.io
Other
550 stars 325 forks source link

Frequent Reboots under 2020.1 with ar71xx-* #1982

Closed margau closed 4 years ago

margau commented 4 years ago

In Frankfurt, there are reports of nodes often rebooting since 2020.1. Some nodes are collected in https://md.margau.net/ffffm-nodereboot, and it seems that the problem is related to ar71xx-*. Our stats, e.g. https://freifunk.fail/d/000000002/nodes?orgId=1&from=now-90d&to=now&var-name=ORPLID-SEE, suggest that it not an memory leak.

Bug report

Nodes are frequently rebooting without any visible cause.

What is the expected behaviour? Nodes should not reboot. Was no problem before 2020.1

Gluon Version: Tag 2020.1 and Tag 2020.1.1

Site Configuration: https://github.com/freifunk-ffm/site-ffffm

Custom patches:

blocktrron commented 4 years ago

Can you acquire a crashtrace from the routers console? Otherwise it's pure guesswork what is going on here.

mweinelt commented 4 years ago

I can confirm more frequent reboots since v2020.1.

https://stats.darmstadt.freifunk.net/dashboard/snapshot/cb8TAPiqn16keAicL3eVp32z7hyFXlIB?orgId=1

rotanid commented 4 years ago

how big are the layer2 segments (domains, hoods, ...) in which these nodes reside? in darmstadt's case, they aren't big as far as i know...

jluebbe commented 4 years ago

We (Braunschweig) are seeing this as well. It's happening on mostly idle nodes with a few batman neighbors (~3) as well. I've got only one Oops captured so far, but will try collect more:

[ 9037.708855] Data bus error, epc == 8012b1c0, ra == 8012b1b4                                                                                                       
[ 9037.714630] Oops[#1]:                                                                                                                                                                      
[ 9037.716987] CPU: 0 PID: 11688 Comm: logger Not tainted 4.14.171 #0                                                                                                                         
[ 9037.723371] task: 818f2d00 task.stack: 807b2000                                                                                                                                            
[ 9037.728042] $ 0   : 00000000 fffffff8 80a3d000 81042070                                                                                                                                    
[ 9037.733454] $ 4   : 00000001 014000c0 00000000 00000004                                                                                                                                    
[ 9037.738856] $ 8   : 00000014 80070a24 8011bb3c 00000048                                                                                                                                    
[ 9037.744259] $12   : 7f99a180 77f422a0 00000000 00000000                                                                                                                                    
[ 9037.749662] $16   : 80a3d000 80a3d010 80460000 01400000                                                                                                                                    
[ 9037.755064] $20   : 77f2776c 00000000 00000000 00000000                                                                                                                                    
[ 9037.760468] $24   : 00459050 77f05d50                                                                                                                                                      
[ 9037.765871] $28   : 807b2000 807b3ea8 77f3bed4 8012b1b4                                                                                                                                    
[ 9037.771283] Hi    : 000002f4                                                                                                                                                               
[ 9037.774258] Lo    : 0001cae9                                                                                                                                                               
[ 9037.777239] epc   : 8012b1c0 0x8012b1c0                                                                                                                                                    
[ 9037.781197] ra    : 8012b1b4 0x8012b1b4                                                                                                                                                    
[ 9037.785154] Status: 1000f403 KERNEL EXL IE                                                                                                                                                 
[ 9037.789483] Cause : 5080801c (ExcCode 07)                                                                                                                                                  
[ 9037.793623] PrId  : 00019374 (MIPS 24Kc)                                                                                                                                                   
[ 9037.797670] Modules linked in: ath9k ath9k_common batman_adv ath9k_hw ath mac80211 cfg80211 xt_FLOWOFFLOAD xt_CT wireguard nf_flow_table_hw nf_flow_table compat gpio_button_hotplug       
[ 9037.814325] Process logger (pid: 11688, threadinfo=807b2000, task=818f2d00, tls=77f43efc)                                                                                                  
[ 9037.822766] Stack : 00000000 00000006 00000000 80460000 00000000 00082000 ffffff9c 77f2776c                                                                                                
[ 9037.831412]         7f99a23c 77f3bef4 77f3cdb0 8011ba50 00000001 807b3f28 8040ef20 00000000                                                                                                
[ 9037.840059]         00000000 00000004 00100875 00002000 000000a0 00000004 00000100 00000001                                                                                                
[ 9037.848705]         7f99a374 77f2776c 00080000 77f251e8 00000000 7f99a23c 77f3bef4 8007088c                                                                                                
[ 9037.857352]         00000012 00000000 7f9d87a8 00000000 77f422a0 00000000 00000000 00459050                                                                                                
[ 9037.865998]         ...                                                                                                                                                                    
[ 9037.868525] Call Trace:                                                                                                                                                                    
[ 9037.868539] [<8011ba50>] 0x8011ba50 linux-4.14.171/fs/open.c:1078
[ 9037.874664] [<8007088c>] 0x8007088c linux-4.14.171/arch/mips/kernel/scall32-o32.S:104
[ 9037.878274] [<80070ad4>] 0x80070ad4 linux-4.14.171/arch/mips/kernel/scall32-o32.S:208
[ 9037.881880] [<80091134>] 0x80091134 linux-4.14.171/kernel/sys.c:877
[ 9037.885488]                                                                                                                                                                                
[ 9037.887027] Code: 10400025  24510010  00408025  24070ff0  02202025  02802825  0c0f7fb8  00e03025                                                                                 
[ 9037.897116]                                                                                                                                                                                
[ 9037.899783] ---[ end trace fe7a5f183ed40ba3 ]---                                                                                                                                           
[ 9037.907443] Kernel panic - not syncing: Fatal exception                                                                                                                                    
[ 9037.914339] Rebooting in 3 seconds..    

The trace location were decoded using addr2line.

jluebbe commented 4 years ago

Here are some more crashes.txt and the corresponding vmlinux.gz.

mweinelt commented 4 years ago

Do you have an idea what triggers this trace?

jluebbe commented 4 years ago

Do you have an idea what triggers this trace?

None so far. :/ It is very random (every few days to a few every day). So far it happens

The traces all seem to touch fs/open.c, so it's not completely random...

tackin commented 4 years ago

Hmmm ... Don't see this issue at FFTR.

oszilloskop commented 4 years ago

@tackin Often you have to look for nodes with Wifi-Mesh and with client connection. At the map of FFTR I found a lot of fresh rebooted Nodes. If you don't have Grafana or something like that you can search with Map-Filter for them.

Bildschirmfoto 2020-04-15 um 17 52 18

margau commented 4 years ago

An Crashlog from FFM (TL-WR1043N/ND v4, actually on master gluon-v2020.1-73-gc5f43add):

<1>[10076.502356] Data bus error, epc == 80152abc, ra == 80152abc
<4>[10076.508124] Oops[#1]:
<4>[10076.510472] CPU: 0 PID: 17820 Comm: fw3 Not tainted 4.14.172 #0
<4>[10076.516580] task: 82d285a0 task.stack: 81c2c000
<4>[10076.521253] $ 0   : 00000000 00000001 833fb000 833f9000
<4>[10076.526656] $ 4   : 81082070 00000001 005d3b8d 005d3b8e
<4>[10076.532059] $ 8   : 00000005 80071ea4 80147f78 00000000
<4>[10076.537462] $12   : 7fdbac80 77fe82a0 a8c8717c 7fffffff
<4>[10076.542865] $16   : 00000001 ffffff9c 0042325c 80570000
<4>[10076.548268] $20   : 01400000 00000001 00000000 ffffff69
<4>[10076.553670] $24   : 00423260 77f9d0a4
<4>[10076.559073] $28   : 81c2c000 81c2de08 00000000 80152abc
<4>[10076.564477] Hi    : 0000000c
<4>[10076.567442] Lo    : 00000007
<4>[10076.570435] epc   : 80152abc getname_flags+0x44/0x190
<4>[10076.575646] ra    : 80152abc getname_flags+0x44/0x190
<4>[10076.580856] Status: 1100dc03      KERNEL EXL IE
<4>[10076.585174] Cause : 4080801c (ExcCode 07)
<4>[10076.589305] PrId  : 00019750 (MIPS 74Kc)
<4>[10076.593345] Modules linked in: ath9k ath9k_common batman_adv ath9k_hw ath smsc95xx rndis_host pl2303 mcs7830 mac80211 ftdi_sio dm9601 cfg80211 cdc_subset cdc_ether cdc_eem$
<4>[10076.653650] Process fw3 (pid: 17820, threadinfo=81c2c000, task=82d285a0, tls=77fe9efc)
<4>[10076.661818] Stack : 00000000 00000000 00000000 00000000 00000001 ffffff9c 81c2de70 ffffff9c
<4>[10076.670454]         0042325c 81c2deb8 000007ff 80152f34 ffffff9c 00000000 7fdbacb0 00000001
<4>[10076.679092]         7fdbacb0 00000001 00000800 801474f0 00000000 00000000 00000000 00000000
<4>[10076.687720]         00000000 82e15640 834c2028 83ad26f0 83ad2700 7fdbacb0 7fdbc7d7 ffffff9c
<4>[10076.696348]         7fdbacb0 0042325c 77fe8000 77fe2db0 00000000 80147a0c 834bfe58 82e15640
<4>[10076.704976]         ...
<4>[10076.707505] Call Trace:
<4>[10076.710030] [<80152abc>] getname_flags+0x44/0x190
<4>[10076.714888] [<80152f34>] user_path_at_empty+0x34/0x60
<4>[10076.720099] [<801474f0>] vfs_statx+0x90/0x10c
<4>[10076.724596] [<80147a0c>] SYSC_stat64+0x2c/0x50
<4>[10076.729197] [<800719cc>] syscall_common+0x34/0x58
<4>[10076.734051] Code: afb00010  0c04e467  00c0b025 <10400046> 24510010  00408025  ac510000  24070ff0  02202025
<4>[10076.744124]
<4>[10076.746591] ---[ end trace d947e05201486236 ]---
tackin commented 4 years ago

Arrrg ... my node uptime-monitoring failed, so yes we are effected as well: A crashlog of an TP-Link TL-WR1043N/ND v4 on gluon 2020.1.1

Time: 1586971311.906410
Modules:        ath9k@831e0000+1ba28    ath9k_common@8307c000+31ad      batman_adv@83080000+28e2b       ath9k_hw@83180000+5ac8c ath@830d0000+4c53       mac80211@83100000+71928 cfg80211@83040000+38dea xt_FLOWOFFLOAD@83b5f000+c20 xt_CT@83b52000+a90       nf_flow_table_hw@83b45000+870   nf_flow_table@83b58000+363f     compat@83b50000+1977    ledtrig_usbport@83b43000+940    ehci_platform@83a2e000+10f0     ehci_hcd@83ac0000+8ad7  gpio_button_hotplug@83ad6000+1450    usbcore@83b00000+21062  nls_base@83a2a000+1420  usb_common@83a22000+61e
<6>[    0.000000] PID hash table entries: 256 (order: -2, 1024 bytes)
<6>[    0.000000] Dentry cache hash table entries: 8192 (order: 3, 32768 bytes)
<6>[    0.000000] Inode-cache hash table entries: 4096 (order: 2, 16384 bytes)
<6>[    0.000000] Writing ErrCtl register=00000000
<6>[    0.000000] Readback ErrCtl register=00000000
<6>[    0.000000] Memory: 58680K/65536K available (4145K kernel code, 206K rwdata, 984K rodata, 288K init, 213K bss, 6856K reserved, 0K cma-reserved)
<6>[    0.000000] SLUB: HWalign=32, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
<6>[    0.000000] NR_IRQS: 51
<6>[    0.000000] Clocks: CPU:775.000MHz, DDR:650.000MHz, AHB:258.333MHz, Ref:25.000MHz
<6>[    0.000000] clocksource: MIPS: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 4932285024 ns
<6>[    0.000007] sched_clock: 32 bits at 387MHz, resolution 2ns, wraps every 5541893118ns
<6>[    0.008219] Calibrating delay loop... 385.84 BogoMIPS (lpj=1929216)
<6>[    0.074795] pid_max: default: 32768 minimum: 301
<6>[    0.079831] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes)
<6>[    0.086824] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes)
<6>[    0.096134] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 19112604462750000 ns
<6>[    0.106535] futex hash table entries: 256 (order: -1, 3072 bytes)
<6>[    0.113973] NET: Registered protocol family 16
<6>[    0.119920] MIPS: machine is TP-LINK TL-WR1043ND v4
<6>[    0.378656] clocksource: Switched to clocksource MIPS
<6>[    0.384931] NET: Registered protocol family 2
<6>[    0.390290] TCP established hash table entries: 1024 (order: 0, 4096 bytes)
<6>[    0.397637] TCP bind hash table entries: 1024 (order: 0, 4096 bytes)
<6>[    0.404393] TCP: Hash tables configured (established 1024 bind 1024)
<6>[    0.411231] UDP hash table entries: 256 (order: 0, 4096 bytes)
<6>[    0.417396] UDP-Lite hash table entries: 256 (order: 0, 4096 bytes)
<6>[    0.424219] NET: Registered protocol family 1
<7>[    0.428864] PCI: CLS 0 bytes, default 32
<4>[    0.432224] Crashlog allocated RAM at address 0x3f00000
<6>[    0.439217] workingset: timestamp_bits=30 max_order=14 bucket_order=0
<6>[    0.450300] squashfs: version 4.0 (2009/01/31) Phillip Lougher
<6>[    0.456445] jffs2: version 2.2 (NAND) (SUMMARY) (LZMA) (RTIME) (CMODE_PRIORITY) (c) 2001-2006 Red Hat, Inc.
<6>[    0.475432] io scheduler noop registered
<6>[    0.479607] io scheduler deadline registered (default)
<6>[    0.485382] Serial: 8250/16550 driver, 16 ports, IRQ sharing enabled
<6>[    0.493866] console [ttyS0] disabled
<6>[    0.517712] serial8250.0: ttyS0 at MMIO 0x18020000 (irq = 11, base_baud = 1562500) is a 16550A
<6>[    0.526816] console [ttyS0] enabled
<6>[    0.534373] bootconsole [early0] disabled
<4>[    0.549047] m25p80 spi0.0: found w25q128, expected m25p80
<6>[    0.565778] m25p80 spi0.0: w25q128 (16384 Kbytes)
<5>[    0.570689] 7 cmdlinepart partitions found on MTD device spi0.0
<5>[    0.576802] Creating 7 MTD partitions on "spi0.0":
<5>[    0.581774] 0x000000000000-0x000000020000 : "u-boot"
<5>[    0.588387] 0x000000020000-0x000000f50000 : "firmware"
<5>[    0.632250] 2 tplink-fw partitions found on MTD device firmware
<5>[    0.638377] 0x000000020000-0x0000001d3226 : "kernel"
<5>[    0.644150] 0x0000001d3226-0x000000f50000 : "rootfs"
<5>[    0.649901] mtd: device 3 (rootfs) set to be root filesystem
<5>[    0.655781] 1 squashfs-split partitions found on MTD device rootfs
<5>[    0.662216] 0x000000420000-0x000000f50000 : "rootfs_data"
<5>[    0.670229] 0x000000f50000-0x000000f70000 : "product-info"
<5>[    0.677054] 0x000000f70000-0x000000fc0000 : "config"
<5>[    0.683950] 0x000000fc0000-0x000000fd0000 : "partition-table"
<5>[    0.691085] 0x000000fd0000-0x000000ff0000 : "logs"
<5>[    0.697778] 0x000000ff0000-0x000001000000 : "ART"
<6>[    0.705427] libphy: Fixed MDIO Bus: probed
<6>[    0.711363] tun: Universal TUN/TAP device driver, 1.6
<6>[    0.733056] switch0: Atheros AR8337 rev. 2 switch registered on ag71xx-mdio.0
<6>[    1.448767] libphy: ag71xx_mdio: probed
<6>[    2.080290] ag71xx ag71xx.0: connected to PHY at ag71xx-mdio.0:00 [uid=004dd036, driver=Atheros AR8216/AR8236/AR8316]
<6>[    2.091767] eth0: Atheros AG71xx at 0xb9000000, irq 4, mode: sgmii
<6>[    2.099616] nf_conntrack version 0.5.0 (1024 buckets, 4096 max)
<6>[    2.105928] xt_time: kernel timezone is -0000
<6>[    2.110865] ip_tables: (C) 2000-2006 Netfilter Core Team
<6>[    2.117053] NET: Registered protocol family 10
<6>[    2.126718] Segment Routing with IPv6
<6>[    2.130653] ip6_tables: (C) 2000-2006 Netfilter Core Team
<6>[    2.136417] NET: Registered protocol family 17
<6>[    2.141085] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
<6>[    2.154474] Ebtables v2.0 registered
<6>[    2.158842] 8021q: 802.1Q VLAN Support v1.8
<6>[    2.164610] hctosys: unable to open rtc device (rtc0)
<6>[    2.175403] VFS: Mounted root (squashfs filesystem) readonly on device 31:3.
<6>[    2.184299] Freeing unused kernel memory: 288K
<4>[    2.188916] This architecture does not have kernel memory protection.
<5>[    2.558690] random: fast init done
<14>[    2.660005] init: Console is alive
<14>[    2.663731] init: - watchdog -
<14>[    3.344490] kmodloader: loading kernel modules from /etc/modules-boot.d/*
<6>[    3.385569] usbcore: registered new interface driver usbfs
<6>[    3.391350] usbcore: registered new interface driver hub
<6>[    3.396911] usbcore: registered new device driver usb
<6>[    3.407072] ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
<6>[    3.415300] ehci-platform: EHCI generic platform driver
<6>[    3.420832] ehci-platform ehci-platform.0: EHCI Host Controller
<6>[    3.426979] ehci-platform ehci-platform.0: new USB bus registered, assigned bus number 1
<6>[    3.437442] ehci-platform ehci-platform.0: irq 48, io mem 0x1b000000
<6>[    3.468694] ehci-platform ehci-platform.0: USB 2.0 started, EHCI 1.00
<6>[    3.476007] hub 1-0:1.0: USB hub found
<6>[    3.480221] hub 1-0:1.0: 1 port detected
<6>[    3.484678] ehci-platform ehci-platform.1: EHCI Host Controller
<6>[    3.490868] ehci-platform ehci-platform.1: new USB bus registered, assigned bus number 2
<6>[    3.501343] ehci-platform ehci-platform.1: irq 49, io mem 0x1b400000
<6>[    3.528700] ehci-platform ehci-platform.1: USB 2.0 started, EHCI 1.00
<6>[    3.536010] hub 2-0:1.0: USB hub found
<6>[    3.540261] hub 2-0:1.0: 1 port detected
<14>[    3.545225] kmodloader: done loading kernel modules from /etc/modules-boot.d/*
<14>[    3.563121] init: - preinit -
<6>[    4.121053] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
<5>[    4.145336] random: procd: uninitialized urandom read (4 bytes read)
<6>[    5.150321] eth0: link up (1000Mbps/Full duplex)
<6>[    5.155299] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
<5>[    7.367239] jffs2: notice: (490) jffs2_build_xattr_subsystem: complete building xattr subsystem, 19 of xdatum (2 unchecked, 17 orphan) and 39 of xref (17 dead, 0 orphan) found.
<14>[    7.385403] mount_root: switching to jffs2 overlay
<4>[    7.400586] overlayfs: upper fs does not support tmpfile.
<12>[    7.411600] urandom-seed: Seeding with /etc/urandom.seed
<6>[    7.510483] eth0: link down
<14>[    7.524414] procd: - early -
<14>[    7.527478] procd: - watchdog -
<14>[    8.110785] procd: - watchdog -
<14>[    8.114307] procd: - ubus -
<5>[    8.149661] random: ubusd: uninitialized urandom read (4 bytes read)
<5>[    8.168255] random: ubusd: uninitialized urandom read (4 bytes read)
<5>[    8.175262] random: ubusd: uninitialized urandom read (4 bytes read)
<14>[    8.182751] procd: - init -
<14>[    8.737767] kmodloader: loading kernel modules from /etc/modules.d/*
<6>[    8.792142] Loading modules backported from Linux version v4.19.98-0-gd183c8e2647a
<6>[    8.800009] Backport generated by backports.git v4.19.98-1-0-g8204eb99
<14>[    9.114842] urngd: v1.0.2 started.
<6>[    9.247799] batman_adv: B.A.T.M.A.N. advanced openwrt-2019.2-4 (compatibility version 15) loaded
<7>[    9.284444] ath: EEPROM regdomain: 0x0
<7>[    9.284452] ath: EEPROM indicates default country code should be used
<7>[    9.284455] ath: doing EEPROM country->regdmn map search
<7>[    9.284468] ath: country maps to regdmn code: 0x3a
<7>[    9.284472] ath: Country alpha2 being used: US
<7>[    9.284475] ath: Regpair used: 0x3a
<7>[    9.297028] ieee80211 phy0: Selected rate control algorithm 'minstrel_ht'
<6>[    9.298846] ieee80211 phy0: Atheros AR9561 Rev:0 mem=0xb8100000, irq=47
<14>[    9.389079] kmodloader: done loading kernel modules from /etc/modules.d/*
<5>[    9.432636] random: crng init done
<5>[    9.436156] random: 6 urandom warning(s) missed due to ratelimiting
<6>[   18.890100] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
<6>[   18.898002] eth0: link up (1000Mbps/Full duplex)
<6>[   18.908772] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
<6>[   18.920004] br-client: port 1(eth0.1) entered blocking state
<6>[   18.925856] br-client: port 1(eth0.1) entered disabled state
<6>[   18.932020] device eth0.1 entered promiscuous mode
<6>[   18.936972] device eth0 entered promiscuous mode
<6>[   19.011352] br-client: port 1(eth0.1) entered blocking state
<6>[   19.017206] br-client: port 1(eth0.1) entered forwarding state
<6>[   19.023408] IPv6: ADDRCONF(NETDEV_UP): br-client: link is not ready
<6>[   19.212705] br-wan: port 1(eth0.2) entered blocking state
<6>[   19.218294] br-wan: port 1(eth0.2) entered disabled state
<6>[   19.224176] device eth0.2 entered promiscuous mode
<6>[   19.341154] br-wan: port 1(eth0.2) entered blocking state
<6>[   19.346745] br-wan: port 1(eth0.2) entered forwarding state
<6>[   19.498937] IPv6: ADDRCONF(NETDEV_UP): local-node: link is not ready
<6>[   19.612478] IPv6: ADDRCONF(NETDEV_CHANGE): local-node: link becomes ready
<6>[   19.687972] br-client: port 2(local-port) entered blocking state
<6>[   19.694233] br-client: port 2(local-port) entered disabled state
<6>[   19.808797] device local-port entered promiscuous mode
<6>[   19.861825] br-client: port 2(local-port) entered blocking state
<6>[   19.868039] br-client: port 2(local-port) entered forwarding state
<6>[   19.878877] IPv6: ADDRCONF(NETDEV_CHANGE): br-client: link becomes ready
<6>[   20.330577] 8021q: adding VLAN 0 to HW filter on device bat0
<6>[   20.337952] br-client: port 3(bat0) entered blocking state
<6>[   20.343681] br-client: port 3(bat0) entered disabled state
<6>[   20.349621] device bat0 entered promiscuous mode
<6>[   20.354507] br-client: port 3(bat0) entered blocking state
<6>[   20.360226] br-client: port 3(bat0) entered forwarding state
<6>[   20.828718] batman_adv: bat0: No IGMP Querier present - multicast optimizations disabled
<6>[   20.837085] batman_adv: bat0: No MLD Querier present - multicast optimizations disabled
<6>[   22.345251] batman_adv: bat0: Adding interface: primary0
<6>[   22.350802] batman_adv: bat0: Interface activated: primary0
<7>[   22.368161] ath: EEPROM regdomain: 0x8114
<7>[   22.368169] ath: EEPROM indicates we should expect a country code
<7>[   22.368172] ath: doing EEPROM country->regdmn map search
<7>[   22.368176] ath: country maps to regdmn code: 0x37
<7>[   22.368181] ath: Country alpha2 being used: DE
<7>[   22.368184] ath: Regpair used: 0x37
<7>[   22.368189] ath: regdomain 0x8114 dynamically updated by user
<6>[   22.731860] batman_adv: bat0: Interface deactivated: primary0
<6>[   22.799851] IPv6: ADDRCONF(NETDEV_UP): primary0: link is not ready
<6>[   22.807681] batman_adv: bat0: Interface activated: primary0
<6>[   26.150220] IPv6: ADDRCONF(NETDEV_UP): wlan0-1: link is not ready
<6>[   26.256791] br-wan: port 2(wlan0-1) entered blocking state
<6>[   26.262515] br-wan: port 2(wlan0-1) entered disabled state
<6>[   26.268467] device wlan0-1 entered promiscuous mode
<6>[   27.835987] IPv6: ADDRCONF(NETDEV_CHANGE): wlan0-1: link becomes ready
<6>[   27.842949] br-wan: port 2(wlan0-1) entered blocking state
<6>[   27.848627] br-wan: port 2(wlan0-1) entered forwarding state
<6>[   27.949908] br-client: port 4(client0) entered blocking state
<6>[   27.955853] br-client: port 4(client0) entered disabled state
<6>[   27.962564] device client0 entered promiscuous mode
<6>[   28.015800] IPv6: ADDRCONF(NETDEV_UP): client0: link is not ready
<6>[   28.022263] br-client: port 4(client0) entered blocking state
<6>[   28.028215] br-client: port 4(client0) entered forwarding state
<6>[   28.139227] IPv6: ADDRCONF(NETDEV_CHANGE): client0: link becomes ready
<6>[   28.502965] IPv6: ADDRCONF(NETDEV_UP): mesh0: link is not ready
<6>[   28.682873] IPv6: ADDRCONF(NETDEV_CHANGE): mesh0: link becomes ready
<6>[   28.998751] batman_adv: bat0: IGMP Querier appeared
<6>[   29.003801] batman_adv: bat0: MLD Querier appeared
<6>[   29.766543] batman_adv: bat0: Adding interface: mesh0
<6>[   29.771823] batman_adv: bat0: Interface activated: mesh0
<6>[   31.169543] batman_adv: bat0: Adding interface: mesh-vpn
<6>[   31.175103] batman_adv: bat0: The MTU of interface mesh-vpn is too small (1406) to handle the transport of batman-adv packets. Packets going over this interface will be fragmented on layer2 which could impact the performance. Setting the MTU to 1532 would solve the problem.
<6>[   31.200644] batman_adv: bat0: Interface activated: mesh-vpn
<1>[33430.557715] Data bus error, epc == 804656a0, ra == 80152b00
<4>[33430.563482] Oops[#1]:
<4>[33430.565830] CPU: 0 PID: 1050 Comm: netifd Not tainted 4.14.171 #0
<4>[33430.572117] task: 8389e1c0 task.stack: 833ae000
<4>[33430.576790] $ 0   : 00000000 805e0000 83243000 83242000
<4>[33430.582192] $ 4   : 83243010 7f81853c 00000ff0 00000ff0
<4>[33430.587595] $ 8   : 00000154 80071ca4 80147e54 00000001
<4>[33430.592998] $12   : 7f818518 77faa2a0 2f736269 77faa2a0
<4>[33430.598401] $16   : 83243000 83243010 7f81853c 80570000
<4>[33430.603804] $20   : 01400000 00004000 833afef4 fffffffe
<4>[33430.609207] $24   : 00000000 77f7c5f4
<4>[33430.614610] $28   : 833ae000 833afe80 ffffff69 80152b00
<4>[33430.620013] Hi    : 00000002
<4>[33430.622979] Lo    : 00000000
<4>[33430.625967] epc   : 804656a0 __strncpy_from_kernel_asm+0x0/0x70
<4>[33430.632095] ra    : 80152b00 getname_flags+0x68/0x190
<4>[33430.637306] Status: 1100dc03      KERNEL EXL IE
<4>[33430.641625] Cause : 0080041c (ExcCode 07)
<4>[33430.645766] PrId  : 00019750 (MIPS 74Kc)
<4>[33430.649814] Modules linked in: ath9k ath9k_common batman_adv ath9k_hw ath mac80211 cfg80211 xt_FLOWOFFLOAD xt_CT nf_flow_table_hw nf_flow_table compat ledtrig_usbport ehci_platform ehci_hcd gpio_button_hotplug usbcore nls_base usb_common
<4>[33430.671564] Process netifd (pid: 1050, threadinfo=833ae000, task=8389e1c0, tls=77fabefc)
<4>[33430.679910] Stack : 3aff71fd 00000000 00000000 00008295 00004000 ffffff9c 833afef8 7f818508
<4>[33430.688547]         00600000 00700000 ffffffea 80152f54 00000000 83b99320 0000000a 77ef8408
<4>[33430.697185]         00000011 7f81853c 00000011 80147ed4 00000000 80189aa8 00402455 800cbf64
<4>[33430.705822]         833afef4 0000001c 00008295 00000002 7f818754 00000000 838158d0 83578bb0
<4>[33430.714450]         00000011 7f818520 7f818754 00402455 77f7ed88 77faa000 77fa4db0 00000000
<4>[33430.723078]         ...
<4>[33430.725607] Call Trace:
<4>[33430.728130] [<804656a0>] __strncpy_from_kernel_asm+0x0/0x70
<4>[33430.733885] [<80152b00>] getname_flags+0x68/0x190
<4>[33430.738742] [<80152f54>] user_path_at_empty+0x34/0x60
<4>[33430.743954] [<80147ed4>] SyS_readlink+0x80/0x144
<4>[33430.748734] [<800719cc>] syscall_common+0x34/0x58
<4>[33430.753596] Code: 00000000  00000000  00000000 <8f820014> 00451024  14400013  00000000  00004025  00a01825
<4>[33430.763668]
<4>[33430.765771] ---[ end trace a11beaded94e5b58 ]--- 
jluebbe commented 4 years ago
<4>[10076.570435] epc : 80152abc getname_flags+0x44/0x190 <4>[10076.575646] ra : 80152abc getname_flags+0x44/0x190

This is the same spot we're hitting.

herrbett commented 4 years ago

Looking at the Reboot statistics of Freifunk Karlsruhe, I also see a lot of reboots for devices in the ath79-generic target. (devolo wifi pro 1200e and devolo wifi pro 1200i)

Ill try to obtain a crash log as soon as possible from a device in the ath79-generic target.

https://stats.karlsruhe.freifunk.net/d/000000018/nodes-summary?orgId=2&refresh=5m&from=1586417565351&to=1587022365351&fullscreen&panelId=8

Delta1977 commented 4 years ago

I see this Problem on TP-Link TL-WR940N v4 which was upgraded from 2019.1.1 to 2020.1. Memoryusage was over 80%, CPU 80-90 % After a "Upgrade" from 2020.1 to 2020.1 without applying the old settings in WEB-UI the node runs in same location very fine with 60-70 % Memory usage and smooth cpu

kpanic23 commented 4 years ago

Here a couple I have access to: 841v10-crashlog.txt 841v9-crashlog.txt unifi-crashlog.txt Whereas the Unifi seems to have had a classic OOM. The 841v10 is upgraded to 64MB RAM.

--EDIT Here's my vmlinux file, maybe someone more capable than I can make more use of it... vmlinux.debug.gz

jluebbe commented 4 years ago

@kpanic23 Do you still have the corresponding vmlinux, so you could use addr2line to decode the addresses?

kpanic23 commented 4 years ago

Should still be on my build server. Let me see.

kpanic23 commented 4 years ago

Okay, got it. What now?

mweinelt commented 4 years ago

Use https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/scripts/faddr2line to decode the addresses. README inline.

kpanic23 commented 4 years ago

Sure, but which addresses? I suppose the Program Counter?

kpanic@gemaecht:~/src/gluon/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_tiny$ /home/kpanic/src/gluon/openwrt/staging_dir/toolchain-mips_24kc_gcc-7.5.0_musl/bin/mips-openwrt-linux-addr2line -f -e vmlinux.debug 8012afb4
getname_flags
??:?
kpanic@gemaecht:~/src/gluon/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_tiny$ /home/kpanic/src/gluon/openwrt/staging_dir/toolchain-mips_24kc_gcc-7.5.0_musl/bin/mips-openwrt-linux-addr2line -f -e vmlinux.debug 80117b00
__kmalloc_track_caller
??:?
mweinelt commented 4 years ago

No, the addresses in the call trace, though pc and rip are also fine.

kpanic23 commented 4 years ago

0x802a59cc: pskb_expand_head ??:?

0x802a19bc: __kmalloc_reserve.isra.32 skbuff.c:?

0x80116a10: ___slab_alloc.isra.88.constprop.94 slub.c:?

0x802a59cc: pskb_expand_head ??:?

0x82ce38dc: ?? ??:0

0x8335500c: ?? ??:0

0x83353c04: ?? ??:0

0x800a2df0: dequeue_task_fair fair.c:?

0x83354460: ?? ??:0

0x803e5718: __schedule core.c:?

0x80095fe8: process_one_work workqueue.c:?

0x80096638: worker_thread workqueue.c:?

0x800963a4: worker_thread workqueue.c:?

0x8009b42c: kthread kthread.c:?

0x8009b324: kthread kthread.c:?

0x8009b324: kthread kthread.c:?

0x80066b18: ret_from_kernel_thread ??:?

kpanic23 commented 4 years ago

And for the v9:

8011b890: do_sys_open ??:?

8007086c: syscall_common arch/mips/kernel/scall32-o32.o:?

kpanic23 commented 4 years ago

I have no idea why I'm missing the line numbers. I've attached my vmlinux file to my prior post (https://github.com/freifunk-gluon/gluon/issues/1982#issuecomment-614627463), maybe someone else can have a look?

jluebbe commented 4 years ago

I have no idea why I'm missing the line numbers. I've attached my vmlinux file to my prior post (#1982 (comment)), maybe someone else can have a look?

Your vmlinux doesn't contain debug info. Did you use openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_tiny/linux-4.14.171/vmlinux? For me, that's ~26MB.

We have the following change in our gluon repo:

diff --git a/targets/generic b/targets/generic
index 65982ef4bfbd..9b34563e5d54 100644
--- a/targets/generic
+++ b/targets/generic
@@ -44,6 +44,8 @@ config 'CONFIG_PACKAGE_ATH_DEBUG=y'
 config '# CONFIG_KERNEL_IP_MROUTE is not set'
 config '# CONFIG_KERNEL_IPV6_MROUTE is not set'

+config 'CONFIG_COLLECT_KERNEL_DEBUG=y'
+
 try_config 'CONFIG_TARGET_MULTI_PROFILE=y'
 try_config 'CONFIG_TARGET_PER_DEVICE_ROOTFS=y'

This produces an openwrt/bin/targets/*/kernel-debug.tar.bz2 which you can keep with your images. I'm not sure if there's a better way to enable this via the site repo.

jluebbe commented 4 years ago

Using only the symbols, I found:

841v9-crashlog.txt

841v10-crashlog.txt

unifi-crashlog.txt

margau commented 4 years ago

We have disabled uhttpd on a node as @blocktrron suggested, but it doesn't really seem to help: https://freifunk.fail/d/000000002/nodes?orgId=1&var-name=60487-FrickelFritze&from=now-7d&to=now

blocktrron commented 4 years ago

This crashes an AC Mesh reproducible after around 1 - 2 million iterations (1-2 minutes).

#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>

char buf[255];
char path[] = "/lib/firmware/wireless";

int main(int argc, char *argv[])
{
    for (int i = 0; 1; i++) {
        readlink(path, buf, 255);
        printf("%d\n", i);
    }
    return 0;
}

So this might be helpful to determine whether something fixed the issue or not.

belzebub40k commented 4 years ago

@blocktrron what is the best way to compile this?

blocktrron commented 4 years ago

Download the OpenWrt 19.07 SDK for ar71xx and compile it using the GCC from staging_dir/toolchain-mips_24kc_gcc-7.5.0_musl/bin/.

blocktrron commented 4 years ago

Aforementioned snippet still seems to crash ar71xx when using next. However we passed >15 million iterations on a ath79 board without any issues so far (Kernel 4.19).

Will try to get a crashlog tomorrow.

belzebub40k commented 4 years ago

@blocktrron Is /lib/firmware/wireless a path that should exist? On an Unifi AC Lite I have only these files/folders.

# ls -lh /lib/firmware
drwxr-xr-x    1 root     root           0 Apr  5 19:23 ath10k
-rw-r--r--    1 root     root        4.1K Mar 31 21:50 regulatory.db
belzebub40k commented 4 years ago

Got it to crash on a Ubiquiti Unifi AC Lite with blocktrons code. 1.919.677 iterations.

[56463.575330] Call Trace:
[56463.577853] getname_flags (/gluon/gluon/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_generic/linux-4.14.176/fs/namei.c:139) 
[56463.582713] user_path_at_empty (/gluon/gluon/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_generic/linux-4.14.176/fs/namei.c:2633) 
[56463.587932] SyS_readlink (/gluon/gluon/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_generic/linux-4.14.176/fs/stat.c:394 /gluon/gluon/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_generic/linux-4.14.176/fs/stat.c:382 /gluon/gluon/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_generic/linux-4.14.176/fs/stat.c:421 /gluon/gluon/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_generic/linux-4.14.176/fs/stat.c:418) 
[56463.592712] syscall_common (/gluon/gluon/openwrt/build_dir/target-mips_24kc_musl/linux-ar71xx_generic/linux-4.14.176/arch/mips/kernel/scall32-o32.S:104) 

crash1-decoded.txt

belzebub40k commented 4 years ago

Produced an other crash on a TP-Link TL-WR842v3.

crash1-wr842v3-decoded.txt

neocturne commented 4 years ago

You should set CROSS_COMPILE to a sensible value when using the decode script, dumping MIPS machine code as x86-64 assembly is not very helpful (only affects the assembly dump part of the log though).

belzebub40k commented 4 years ago

I decoded the traces again and used _openwrt-sdk-19.07.2-ar71xx-generic_gcc-7.5.0_musl.Linux-x8664 in _CROSSCOMPILE.

crash1-unifi-ac-lite-decoded2.txt crash1-wr842v3-decoded2.txt

neocturne commented 4 years ago

So I've had a short look at this issue, and it seems the trap occurs exactly on the return from __getname() - which is just a define for kmem_cache_alloc(). It is unclear to me how such an error could possibly happen there - neither the instruction at epc (first instruction after return from kmem_cache_alloc()), nor the last instruction of kmem_cache_alloc() access any memory...

neocturne commented 4 years ago

Well, this is just great. From the MIPS 24K manual:

[...] a data-side bus error is usually caused by a load, and the (non-blocking) load which caused it may have happened a long time before the busy cycle finishes and the error is signalled. So a bus error exception caused by a load or store is imprecise; EPC does not necessarily (or even usually) point to the instruction causing the memory read.

I think it is likely that the error actually occurs in kmem_cache_alloc(), but I'll have to look a bit deeper to be sure.

neocturne commented 4 years ago

I recommend using this variant of @blocktrron's test program (only one line of output every 2^16 loop cycles, allowing it to run much faster):

#include <unistd.h>
#include <stdlib.h>
#include <stdio.h>

char buf[255];
char path[] = "/lib/firmware/wireless";

int main(int argc, char *argv[])
{
        for (unsigned i = 0; 1; i++) {
                readlink(path, buf, 255);
                if (i % 65536 == 0)
                        printf("%08x\n", i);
        }
        return 0;
}

I have also found a potential fix: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/mm/slub.c?id=0882ff9190e3bc51e2d78c3aadd7c690eeaa91d5

Backporting this patch from Linux 4.19 makes the issue disappear on my TL-WR841N v9 (running for ~300M loop cycles so far). I'll open a PR, so the change can get a bit more testing.

rotanid commented 4 years ago

maybe part of these reboots are also due to #2032