freifunk-gluon / gluon

a modular framework for creating OpenWrt-based firmwares for wireless mesh nodes
https://gluon.readthedocs.io
Other
553 stars 325 forks source link

wr841n reboots spontaneously - out of memory? #540

Closed jannic closed 9 years ago

jannic commented 9 years ago

Trying out current development firmware (commit f6f8350) on a wr841n v9, I get spontaneous reboots. In most cases, the router reboots ~30s after enabling the network.

The router is connected to the ffac mesh via wifi, and to a client machine via ethernet. Fastd is running, but can't connect, as there is no WAN available. ffac is using batman-adv-14. ibss meshing is enabled, 11s meshing is disabled in site.conf. The full site.conf is available here: https://github.com/jannic/site/tree/ffd24feb87bda67a119805f4286c78a6819e2274

If I disable fastd, the reboot doesn't happen, and the router is running stable for several hours. (I got a single reboot last night after several hours of uptime - might be related, but I'm not sure.)

With a serial cable I captured the following oops. It seems like there is not enough memory available?

[   56.150000] kswapd0: page allocation failure: order:0, mode:0x20
[   56.150000] CPU: 0 PID: 66 Comm: kswapd0 Not tainted 3.18.21 #2
[   56.150000] Stack : 00000000 00000000 00000000 00000000 803d4292 00000033 00000000 00000000
          00000000 00000000 80326e60 80380ba3 00000042 803d34d0 8182bc68 00000000
          00000000 00000000 803807fc 800a388c 00000006 8008049c 00000000 00000000
          8032a458 818cb864 00000000 00000000 00000000 00000000 00000000 00000000
          00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
          ...
[   56.150000] Call Trace:
[   56.150000] [<800710f8>] show_stack+0x50/0x84
[   56.150000] [<800cd110>] warn_alloc_failed+0xe8/0x10c
[   56.150000] [<800cf000>] __alloc_pages_nodemask+0x51c/0x68c
[   56.150000] [<802319a4>] __netdev_alloc_frag+0x78/0x1c8
[   56.150000] [<80236e88>] __netdev_alloc_skb+0x5c/0x118
[   56.150000] [<80d58090>] ath_rxbuf_alloc+0x30/0x98 [ath]
[   56.150000] [<80ee69a0>] ath_rx_tasklet+0x588/0xafc [ath9k]
[   56.150000] [<80ee453c>] ath9k_tasklet+0x23c/0x2c8 [ath9k]
[   56.150000] [<80083194>] tasklet_action+0x84/0xcc
[   56.150000] [<80082aa4>] __do_softirq+0xf8/0x230
[   56.150000] [<80082e10>] irq_exit+0x54/0x70
[   56.150000] [<80060830>] ret_from_irq+0x0/0x4
[   56.150000] [<8007b12c>] blast_icache32_page+0x80/0x9c
[   56.150000] [<8007bba0>] r4k_flush_cache_page+0x10c/0x18c
[   56.150000] [<800ef148>] try_to_unmap_one+0xac/0x2a8
[   56.150000] [<800efde8>] rmap_walk+0x1c0/0x29c
[   56.150000] [<800f00ec>] try_to_unmap+0x78/0xa0
[   56.150000] [<800d6288>] shrink_page_list+0x434/0xa74
[   56.150000] [<800d6da4>] shrink_inactive_list+0x250/0x3f0
[   56.150000] [<800d761c>] shrink_zone+0x34c/0x56c
[   56.150000] [<800d811c>] kswapd+0x460/0x65c
[   56.150000] [<80095fa4>] kthread+0xd8/0xe4
[   56.150000] [<80060878>] ret_from_kernel_thread+0x14/0x1c
[   56.150000] 
[   56.150000] Mem-Info:
[   56.150000] Normal per-cpu:
[   56.150000] CPU    0: hi:    0, btch:   1 usd:   0
[   56.150000] active_anon:772 inactive_anon:15 isolated_anon:0
[   56.150000]  active_file:1381 inactive_file:897 isolated_file:28
[   56.150000]  unevictable:0 dirty:0 writeback:0 unstable:0
[   56.150000]  free:63 slab_reclaimable:357 slab_unreclaimable:1842
[   56.150000]  mapped:509 shmem:31 pagetables:74 bounce:0
[   56.150000]  free_cma:0
[   56.150000] Normal free:252kB min:672kB low:840kB high:1008kB active_anon:3088kB inactive_anon:60kB active_file:5524kB inactive_file:3588kB unevictable:0kB isolated(anon):0kB isolated(file):112kB present:32768kB managed:28540kB mlocked:0kB dirty:0kB writeback:0kB mapped:2036kB shmem:124kB slab_reclaimable:1428kB slab_unreclaimable:7368kB kernel_stack:320kB pagetables:296kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[   56.150000] lowmem_reserve[]: 0 0
[   56.150000] Normal: 1*4kB (R) 5*8kB (R) 1*16kB (R) 2*32kB (R) 0*64kB 1*128kB (R) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 252kB
[   56.150000] 2313 total pagecache pages
[   56.150000] 0 pages in swap cache
[   56.150000] Swap cache stats: add 0, delete 0, find 0/0
[   56.150000] Free swap  = 0kB
[   56.150000] Total swap = 0kB
[   56.150000] 8192 pages RAM
[   56.150000] 0 pages HighMem/MovableOnly
[   56.150000] 1057 pages reserved
[   56.450000] ath: skbuff alloc of size 1974 failed
jannic commented 9 years ago

Another slightly different oops:

[   58.930000] CPU 0 Unable to handle kernel paging request at virtual address 0000001c, epc == 80c950ec, ra == 80c95e1c
[   58.940000] Oops[#1]:
[   58.940000] CPU: 0 PID: 0 Comm: swapper Not tainted 3.18.21 #2
[   58.940000] task: 80380c90 ti: 8037a000 task.ti: 8037a000
[   58.940000] $ 0   : 00000000 00000001 00000078 00000283
[   58.940000] $ 4   : 0000000c 8087a3f8 00000080 00000078
[   58.940000] $ 8   : 2a032260 01140000 00000000 00000002
[   58.940000] $12   : 00000000 80e10019 00000000 00000000
[   58.940000] $16   : 0000000c 00000080 00000284 00000000
[   58.940000] $20   : 8087a3f8 00000000 00000001 00000000
[   58.940000] $24   : 00000000 80c867dc                  
[   58.940000] $28   : 8037a000 8037b900 00000284 80c95e1c
[   58.940000] Hi    : 00000960
[   58.940000] Lo    : 00000000
[   58.940000] epc   : 80c950ec ip6_route_cleanup+0xe50/0x19dc [ipv6]
[   58.940000]     Not tainted
[   58.940000] ra    : 80c95e1c fib6_add+0xa4/0x790 [ipv6]
[   58.940000] Status: 1000f403 KERNEL EXL IE 
[   58.940000] Cause : 00800008
[   58.940000] BadVA : 0000001c
[   58.940000] PrId  : 00019374 (MIPS 24Kc)
[   58.940000] Modules linked in: ath9k ath9k_common iptable_nat ath9k_hw ath nf_nat_ipv4 nf_conntrack_ipv6 nf_conntrack_ipv4 mac80211 ipt_REJECT ipt_MASQUERADE ebtable_nat ebtable_filter ebtable_broute cfg80211 xt_time xt_tcpudp xt_state xt_quota xt_pkttype xt_physdev xt_owner xt_nat xt_multiport xt_mark xt_mac xt_limit xt_id xt_conntrack xt_comment xt_addrtype xt_TCPMSS xt_REDIRECT xt_LOG xt_CT nf_reject_ipv4 nf_nat_masquerade_ipv4 nf_nat nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack macvlan iptable_raw iptable_mangle iptable_filter ip_tables ebtables ebt_vlan ebt_stp ebt_snat ebt_redirect ebt_pkttype ebt_mark_m ebt_mark ebt_limit ebt_ip6 ebt_ip ebt_dnat ebt_arpreply ebt_arp ebt_among ebt_802_3 compat sch_teql sch_sfq sch_red sch_prio sch_pie sch_netem sch_htb sch_gred sch_fq sch_dsmark sch_codel em_text em_nbyte em_meta em_cmp act_ipt act_skbedit act_mirred em_u32 cls_u32 cls_tcindex cls_flow cls_route cls_fw sch_hfsc batman_adv libcrc32c ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_raw ip6table_mangle ip6table_filter ip6_tables x_tables tun ipv6 arc4 crypto_blkcipher act_police cls_basic sch_tbf sch_ingress gpio_button_hotplug crc16 crc32c_generic crypto_hash
[   58.940000] Process swapper (pid: 0, threadinfo=8037a000, task=80380c90, tls=00000000)
[   58.940000] Stack : 00000000 00000001 813da300 802693c4 8087a380 0000000c 81985900 00000000
      8037b998 00000000 00000001 00000000 00000284 80c95e1c 808209bc 80390000
      00000006 00000078 00000001 00000000 00000284 00000020 808209cc 807c7600
      81985900 8087a380 00000001 80fda000 808209bc 80390000 00000006 80c918a4
      00000000 00000000 808209cc 807c7600 00000000 808209de 00000000 803956b0
      ...
[   58.940000] Call Trace:
[   58.940000] [<80c950ec>] ip6_route_cleanup+0xe50/0x19dc [ipv6]
[   58.940000] 
[   58.940000] 
Code: afa70044  0000a821  8fa20044 <8e1e0010> 96130014  03c2b021  0233102a  10400006  00131142 
[   59.210000] ---[ end trace 8308ee7c3a8f6320 ]---
[   59.220000] Kernel panic - not syncing: Fatal exception in interrupt
[   59.220000] Rebooting in 3 seconds..ÿ
neocturne commented 9 years ago

The first one looks like OOM, but the second one is very interesting. As you can reproduce this issue easily, which of the errors is more common?

jannic commented 9 years ago

From three tries, I saw the second one twice. But I'll let it capture some more logs.

jannic commented 9 years ago

Ok, from a few more attempts, I got 11 times the second case (Unable to handle kernel paging request at virtual address 0000001c), once the "page allocation failure" and four times I rebooted manually after a while because the router didn't crash.

neocturne commented 9 years ago

Okay, the crashing function is actually fib6_add_1, which gets the invalid pointer 0000000c as its first argument root. It's quite annoying that the call trace is useless...

neocturne commented 9 years ago

Okay, it's a OOM issue as well, combined with a missing error check.

How many nodes and clients does your network have? How many radvds? It is weird that you have these OOM issues, here in Lübeck the master is running fine...

neocturne commented 9 years ago

My last conclusion might have been wrong... I'll have to investigate the problem further.

jannic commented 9 years ago

At the moment, Aachen has 871 online nodes and 2611 clients, and I see RAs from 4 routers.

neocturne commented 9 years ago

Please apply the debug patch https://home.universe-factory.net/neoraider/0001-Add-debug-patch-1.patch using git am in the "openwrt" subdirectory of the Gluon tree and rebuild (don't call make update afterwards, as that will revert the "openwrt" repo to the original state; also take care not to call make from the "openwrt" directory, but only from the Gluon root).

After the patch has been applied, make clean or make target/linux/clean must be called to get it applied to the kernel source.

jannic commented 9 years ago
[   68.400000] Kernel bug detected[#1]:
[   68.400000] CPU: 0 PID: 0 Comm: swapper Not tainted 3.18.21 #1
[   68.400000] task: 80380c90 ti: 8037a000 task.ti: 8037a000
[   68.400000] $ 0   : 00000000 00000001 807490e0 00000000
[   68.400000] $ 4   : 00000000 00000001 00000000 00000000
[   68.400000] $ 8   : 8037b998 01140000 00000000 00000002
[   68.400000] $12   : 00000000 80d85cd9 00000000 00000000
[   68.400000] $16   : 80ab6b6c 80995e00 81879900 807490e0
[   68.400000] $20   : 00000001 80fd8000 80ab6b5c 80390000
[   68.400000] $24   : 00000000 80c867dc                  
[   68.400000] $28   : 8037a000 8037b968 00000006 80c918f8
[   68.400000] Hi    : 00000d20
[   68.400000] Lo    : 00000000
[   68.400000] epc   : 80c90574 rt6_lookup+0x1ec/0x418 [ipv6]
[   68.400000]     Not tainted
[   68.400000] ra    : 80c918f8 ip6_ins_rt+0x2c/0x53c [ipv6]
[   68.400000] Status: 1000f403 KERNEL EXL IE 
[   68.400000] Cause : 10800034
[   68.400000] PrId  : 00019374 (MIPS 24Kc)
[   68.400000] Modules linked in: ath9k ath9k_common iptable_nat ath9k_hw ath nf_nat_ipv4 nf_conntrack_ipv6 nf_conntrack_ipv4 mac80211 ipt_REJECT ipt_MASQUERAD
E ebtable_nat ebtable_filter ebtable_broute cfg80211 xt_time xt_tcpudp xt_state xt_quota xt_pkttype xt_physdev xt_owner xt_nat xt_multiport xt_mark xt_mac xt_l
imit xt_id xt_conntrack xt_comment xt_addrtype xt_TCPMSS xt_REDIRECT xt_LOG xt_CT nf_reject_ipv4 nf_nat_masquerade_ipv4 nf_nat nf_log_ipv4 nf_defrag_ipv6 nf_de
frag_ipv4 nf_conntrack_rtcache nf_conntrack macvlan iptable_raw iptable_mangle iptable_filter ip_tables ebtables ebt_vlan ebt_stp ebt_snat ebt_redirect ebt_pkt
type ebt_mark_m ebt_mark ebt_limit ebt_ip6 ebt_ip ebt_dnat ebt_arpreply ebt_arp ebt_among ebt_802_3 compat sch_teql sch_sfq sch_red sch_prio sch_pie sch_netem 
sch_htb sch_gred sch_fq sch_dsmark sch_codel em_text em_nbyte em_meta em_cmp act_ipt act_skbedit act_mirred em_u32 cls_u32 cls_tcindex cls_flow cls_route cls_f
w sch_hfsc batman_adv libcrc32c ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_raw ip6table_mangle ip6table_filter ip6_tables x_tables tun ipv6 
arc4 crypto_blkcipher act_police cls_basic sch_tbf sch_ingress gpio_button_hotplug crc16 crc32c_generic crypto_hash
[   68.400000] Process swapper (pid: 0, threadinfo=8037a000, task=80380c90, tls=00000000)
[   68.400000] Stack : 000a9900 80ab6b6c 807490e0 81879900 80ab6b6c 80c9069c 80ab6b6c 80c918f8
          80ab6b6c 80995e00 00000000 80ab6b7e 00000000 803956b0 00000000 81879900
          80ab6b7e 80c92f10 00000008 80064f30 00000000 00000001 00000000 00000000
          80ab6b7c 00000000 80ab6b84 00000000 00000000 00000000 0114ac01 32b5c2ff
          fecb3f84 00000000 00000000 81879900 80d85000 803956b0 00000000 00000000
          ...
[   68.400000] Call Trace:
[   68.400000] [<80c90574>] rt6_lookup+0x1ec/0x418 [ipv6]
[   68.400000] 
[   68.400000] 
Code: 00a04021  00c01821  2c850001 <00050336> 8f850014  24a50200  af850014  afa70010  2484000c 
[   68.670000] ---[ end trace 3fd43cd8456a7706 ]---
[   68.680000] Kernel panic - not syncing: Fatal exception in interrupt
neocturne commented 9 years ago

The next one: https://home.universe-factory.net/neoraider/0001-Add-debug-patch-2.patch (must replace the old one).

Besides further tracking the origin of the NULL deref, I've also increased the length of the stack dump... the hardcoded limit of 40 u32 was much too small.

jannic commented 9 years ago

At the moment, with 872 online nodes and 2228 clients, the router seems to be stable. I guess the number of clients will increase in the evening. Let's hope the crash is actually triggered by large number of clients, so I can reproduce it again, later.

neocturne commented 9 years ago

@jannic, the crash is triggered by a ICMPv6 redirect packet, maybe in combination with a race condition and/or OOM situation.

jannic commented 9 years ago

Got another crash. Unfortunately, the call trace doesn't look better:

[   74.410000] Kernel bug detected[#1]:
[   74.410000] CPU: 0 PID: 0 Comm: swapper Not tainted 3.18.21 #1
[   74.410000] task: 80380c90 ti: 8037a000 task.ti: 8037a000
[   74.410000] $ 0   : 00000000 00000001 00000001 80a450dc
[   74.410000] $ 4   : 00000000 00000006 00000000 00000008
[   74.410000] $ 8   : 00000008 80064f30 744156ff febf9d63
[   74.410000] $12   : 00000000 80d3d0d9 00000000 2a032260
[   74.410000] $16   : 80a450b4 80d3d0c0 81879900 80a450de
[   74.410000] $20   : 00000001 80390000 00000000 80390000
[   74.410000] $24   : 00000000 80c867dc                  
[   74.410000] $28   : 8037a000 8037b9b0 00000006 80c92de4
[   74.410000] Hi    : 0000001b
[   74.410000] Lo    : 00ef7481
[   74.410000] epc   : 80c92e5c ip6_del_rt+0x1c8/0x34c [ipv6]
[   74.410000]     Not tainted
[   74.410000] ra    : 80c92de4 ip6_del_rt+0x150/0x34c [ipv6]
[   74.410000] Status: 1000f403 KERNEL EXL IE 
[   74.410000] Cause : 10800034
[   74.410000] PrId  : 00019374 (MIPS 24Kc)
[   74.410000] Modules linked in: ath9k ath9k_common iptable_nat ath9k_hw ath nf_nat_ipv4 nf_conntrack_ipv6 nf_conntrack_ipv4 mac80211 ipt_REJECT ipt_MASQUERADE ebtable_nat ebtable_filter ebtable_broute cfg80211 xt_time xt_tcpudp xt_state xt_quota xt_pkttype xt_physdev xt_owner xt_nat xt_multiport xt_mark xt_mac xt_limit xt_id xt_conntrack xt_comment xt_addrtype xt_TCPMSS xt_REDIRECT xt_LOG xt_CT nf_reject_ipv4 nf_nat_masquerade_ipv4 nf_nat nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack macvlan iptable_raw iptable_mangle iptable_filter ip_tables ebtables ebt_vlan ebt_stp ebt_snat ebt_redirect ebt_pkttype ebt_mark_m ebt_mark ebt_limit ebt_ip6 ebt_ip ebt_dnat ebt_arpreply ebt_arp ebt_among ebt_802_3 compat sch_teql sch_sfq sch_red sch_prio sch_pie sch_netem sch_htb sch_gred sch_fq sch_dsmark sch_codel em_text em_nbyte em_meta em_cmp act_ipt act_skbedit act_mirred em_u32 cls_u32 cls_tcindex cls_flow cls_route cls_fw sch_hfsc batman_adv libcrc32c ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_raw ip6table_mangle ip6table_filter ip6_tables x_tables tun ipv6 arc4 crypto_blkcipher act_police cls_basic sch_tbf sch_ingress gpio_button_hotplug crc16 crc32c_generic crypto_hash
[   74.410000] Process swapper (pid: 0, threadinfo=8037a000, task=80380c90, tls=00000000)
[   74.410000] Stack : 32b9c2cb 3f840000 00000000 00000001 00000000 00000000 80a450dc 00000000
          80a450e4 00000000 00000000 00000000 0114ac01 32b5c2ff fecb3f84 00000000
          00000000 81879900 80d3d0c0 803956b0 00000000 00000000 80390000 00000000
          80390000 80c93090 80a44fe0 80c22e30 00000018 00000000 00000000 00000001
          00000000 00000000 00000000 2a032260 01140000 00000000 00000004 2a032260
          0114ac01 32b5c2ff fecb3f84 00000000 00000000 00000089 8061f420 80d3d0c0
          80a45114 00000089 00000000 80c9df54 8037bc08 8037bbe0 00000004 8037bc2c
          00008623 00000000 8037e500 80c2db24 0000000a 80c254f0 80d3d0c0 00000028
          00000000 00000089 00000000 80390000 00000001 80c9b78c 816c5440 80c2bca4
          80fcc000 80ea6238 00000028 00000000 8037e500 8037bc1c 00000000 80ca1a88
          80cdb38c 00000001 80c91db8 803956b0 00000028 00000000 80371108 8037bbe0
          80d3d0c0 80a450ec 80a45144 000000d4 80df1800 80c9b5dc 80df1800 80390000
          00000001 80390000 00000006 802331b4 00000000 00000000 80a450dc 00000000
          80a450e4 00000000 00000000 00000000 80fd1000 80238f00 80d3d0c0 00000089
          80a450b4 803956b0 80df1800 80390000 00000001 80390000 00000006 80ca1eb4
          8037bc18 8037da30 80d3d0c0 80fd1000 00000000 8037bc18 8037bc18 8037da30
          80d3d0c0 8026b898 00000000 000086dd 8037d1b8 8037d148 80c85cdc 80cb6ea4
          8037da30 80d3d0c0 80df1800 80cc2b80 0000003a 80cc2ea4 80a45094 80ca0d6c
          80fd1000 8026b988 80c85c20 80000000 00000006 8037d968 00000000 8037bc18
          80d3d0c0 80df1800 80d3d0c0 80df1800 80cb91f8 0000003a fffffffe 8037eb68
          00600000 80c85fe8 8037d1b8 8037d148 80c59788 80c8683c 80cbda20 8037d1a4
          80d3d0c0 80fd1000 00000000 000086dd 8037d1b8 8037d148 80c59788 802407d8
          81bfe600 80a4507e 8037bd38 803968e8 8037d1b8 8037d148 802d9228 802d8eb4
          80d3d0c0 80a4507e 81bfe600 80a4507e 8037bd38 803968e8 80380000 8037d148
          802d9228 802d9558 80d3d0c0 80e1e208 8037bcc0 8037bcc0 00000000 802d8edc
          80000000 00000000 00000000 80d3d0c0 00000001 816c5000 00000000 80240664
          00000000 00000000 00000000 00000000 8037d148 00000000 80d3dcd8 80e1cb00
          803808a0 00000000 80380894 80380874 00000040 80380868 80380000 80380000
          80354a34 802413b4 80d3d0c0 818d2038 00000001 80e1d720 803808a0 80380860
          00000009 0000012c 00000040 80240c50 80060000 800a5328 8037f338 00000000
          ffffa7e3 8037f338 00000008 803d3924 00000009 00000004 00000100 803d0000
          803d3918 00000003 0000000c 80082aa4 00000002 800a5fe0 80060000 00000002
          ...
[   74.410000] Call Trace:
[   74.410000] [<80c92e5c>] ip6_del_rt+0x1c8/0x34c [ipv6]
[   74.410000] 
[   74.410000] 
Code: 00000000  8e420050  2c420001 <00020336> 24020001  a6420032  8e350014  26160008  3c1780cc 
[   74.860000] ---[ end trace a2898d907d084fe9 ]---
[   74.870000] Kernel panic - not syncing: Fatal exception in interrupt
[   74.870000] Rebooting in 3 seconds..
neocturne commented 9 years ago

No problem, I can reconstruct the call trace from the stack dump myself. And with the longer stack dump, I should be able to get deeper than 2 stack frames at a time...

neocturne commented 9 years ago

Okay, I have a few theories what might be going wrong. If I'm right, the next patch will fix the issue: https://home.universe-factory.net/neoraider/0001-Add-debug-patch-3.patch

It will also add additional debug output, so even if it doesn't crash, please send me any "failed nexthop" lines you find in the log.

jannic commented 9 years ago

No crash yet, but got a failed nexthop message:

[   63.980000] IPv6: failed nexthop: 81985900 (81985600 81985800 81985700 81985900)
jannic commented 9 years ago

Another two:

[   63.500000] IPv6: failed nexthop: 81985900 (81985600 81985800 81985700 81985900)
[   63.510000] IPv6: failed nexthop: 81985900 (81985600 81985800 81985700 81985900)
neocturne commented 9 years ago

Okay, seems like it is fixed indeed. I'll prepare a proper patch.