Closed jannic closed 9 years ago
Another slightly different oops:
[ 58.930000] CPU 0 Unable to handle kernel paging request at virtual address 0000001c, epc == 80c950ec, ra == 80c95e1c
[ 58.940000] Oops[#1]:
[ 58.940000] CPU: 0 PID: 0 Comm: swapper Not tainted 3.18.21 #2
[ 58.940000] task: 80380c90 ti: 8037a000 task.ti: 8037a000
[ 58.940000] $ 0 : 00000000 00000001 00000078 00000283
[ 58.940000] $ 4 : 0000000c 8087a3f8 00000080 00000078
[ 58.940000] $ 8 : 2a032260 01140000 00000000 00000002
[ 58.940000] $12 : 00000000 80e10019 00000000 00000000
[ 58.940000] $16 : 0000000c 00000080 00000284 00000000
[ 58.940000] $20 : 8087a3f8 00000000 00000001 00000000
[ 58.940000] $24 : 00000000 80c867dc
[ 58.940000] $28 : 8037a000 8037b900 00000284 80c95e1c
[ 58.940000] Hi : 00000960
[ 58.940000] Lo : 00000000
[ 58.940000] epc : 80c950ec ip6_route_cleanup+0xe50/0x19dc [ipv6]
[ 58.940000] Not tainted
[ 58.940000] ra : 80c95e1c fib6_add+0xa4/0x790 [ipv6]
[ 58.940000] Status: 1000f403 KERNEL EXL IE
[ 58.940000] Cause : 00800008
[ 58.940000] BadVA : 0000001c
[ 58.940000] PrId : 00019374 (MIPS 24Kc)
[ 58.940000] Modules linked in: ath9k ath9k_common iptable_nat ath9k_hw ath nf_nat_ipv4 nf_conntrack_ipv6 nf_conntrack_ipv4 mac80211 ipt_REJECT ipt_MASQUERADE ebtable_nat ebtable_filter ebtable_broute cfg80211 xt_time xt_tcpudp xt_state xt_quota xt_pkttype xt_physdev xt_owner xt_nat xt_multiport xt_mark xt_mac xt_limit xt_id xt_conntrack xt_comment xt_addrtype xt_TCPMSS xt_REDIRECT xt_LOG xt_CT nf_reject_ipv4 nf_nat_masquerade_ipv4 nf_nat nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack macvlan iptable_raw iptable_mangle iptable_filter ip_tables ebtables ebt_vlan ebt_stp ebt_snat ebt_redirect ebt_pkttype ebt_mark_m ebt_mark ebt_limit ebt_ip6 ebt_ip ebt_dnat ebt_arpreply ebt_arp ebt_among ebt_802_3 compat sch_teql sch_sfq sch_red sch_prio sch_pie sch_netem sch_htb sch_gred sch_fq sch_dsmark sch_codel em_text em_nbyte em_meta em_cmp act_ipt act_skbedit act_mirred em_u32 cls_u32 cls_tcindex cls_flow cls_route cls_fw sch_hfsc batman_adv libcrc32c ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_raw ip6table_mangle ip6table_filter ip6_tables x_tables tun ipv6 arc4 crypto_blkcipher act_police cls_basic sch_tbf sch_ingress gpio_button_hotplug crc16 crc32c_generic crypto_hash
[ 58.940000] Process swapper (pid: 0, threadinfo=8037a000, task=80380c90, tls=00000000)
[ 58.940000] Stack : 00000000 00000001 813da300 802693c4 8087a380 0000000c 81985900 00000000
8037b998 00000000 00000001 00000000 00000284 80c95e1c 808209bc 80390000
00000006 00000078 00000001 00000000 00000284 00000020 808209cc 807c7600
81985900 8087a380 00000001 80fda000 808209bc 80390000 00000006 80c918a4
00000000 00000000 808209cc 807c7600 00000000 808209de 00000000 803956b0
...
[ 58.940000] Call Trace:
[ 58.940000] [<80c950ec>] ip6_route_cleanup+0xe50/0x19dc [ipv6]
[ 58.940000]
[ 58.940000]
Code: afa70044 0000a821 8fa20044 <8e1e0010> 96130014 03c2b021 0233102a 10400006 00131142
[ 59.210000] ---[ end trace 8308ee7c3a8f6320 ]---
[ 59.220000] Kernel panic - not syncing: Fatal exception in interrupt
[ 59.220000] Rebooting in 3 seconds..ÿ
The first one looks like OOM, but the second one is very interesting. As you can reproduce this issue easily, which of the errors is more common?
From three tries, I saw the second one twice. But I'll let it capture some more logs.
Ok, from a few more attempts, I got 11 times the second case (Unable to handle kernel paging request at virtual address 0000001c), once the "page allocation failure" and four times I rebooted manually after a while because the router didn't crash.
Okay, the crashing function is actually fib6_add_1
, which gets the invalid pointer 0000000c
as its first argument root
. It's quite annoying that the call trace is useless...
Okay, it's a OOM issue as well, combined with a missing error check.
How many nodes and clients does your network have? How many radvds? It is weird that you have these OOM issues, here in Lübeck the master is running fine...
My last conclusion might have been wrong... I'll have to investigate the problem further.
At the moment, Aachen has 871 online nodes and 2611 clients, and I see RAs from 4 routers.
Please apply the debug patch https://home.universe-factory.net/neoraider/0001-Add-debug-patch-1.patch using git am
in the "openwrt" subdirectory of the Gluon tree and rebuild (don't call make update
afterwards, as that will revert the "openwrt" repo to the original state; also take care not to call make
from the "openwrt" directory, but only from the Gluon root).
After the patch has been applied, make clean
or make target/linux/clean
must be called to get it applied to the kernel source.
[ 68.400000] Kernel bug detected[#1]:
[ 68.400000] CPU: 0 PID: 0 Comm: swapper Not tainted 3.18.21 #1
[ 68.400000] task: 80380c90 ti: 8037a000 task.ti: 8037a000
[ 68.400000] $ 0 : 00000000 00000001 807490e0 00000000
[ 68.400000] $ 4 : 00000000 00000001 00000000 00000000
[ 68.400000] $ 8 : 8037b998 01140000 00000000 00000002
[ 68.400000] $12 : 00000000 80d85cd9 00000000 00000000
[ 68.400000] $16 : 80ab6b6c 80995e00 81879900 807490e0
[ 68.400000] $20 : 00000001 80fd8000 80ab6b5c 80390000
[ 68.400000] $24 : 00000000 80c867dc
[ 68.400000] $28 : 8037a000 8037b968 00000006 80c918f8
[ 68.400000] Hi : 00000d20
[ 68.400000] Lo : 00000000
[ 68.400000] epc : 80c90574 rt6_lookup+0x1ec/0x418 [ipv6]
[ 68.400000] Not tainted
[ 68.400000] ra : 80c918f8 ip6_ins_rt+0x2c/0x53c [ipv6]
[ 68.400000] Status: 1000f403 KERNEL EXL IE
[ 68.400000] Cause : 10800034
[ 68.400000] PrId : 00019374 (MIPS 24Kc)
[ 68.400000] Modules linked in: ath9k ath9k_common iptable_nat ath9k_hw ath nf_nat_ipv4 nf_conntrack_ipv6 nf_conntrack_ipv4 mac80211 ipt_REJECT ipt_MASQUERAD
E ebtable_nat ebtable_filter ebtable_broute cfg80211 xt_time xt_tcpudp xt_state xt_quota xt_pkttype xt_physdev xt_owner xt_nat xt_multiport xt_mark xt_mac xt_l
imit xt_id xt_conntrack xt_comment xt_addrtype xt_TCPMSS xt_REDIRECT xt_LOG xt_CT nf_reject_ipv4 nf_nat_masquerade_ipv4 nf_nat nf_log_ipv4 nf_defrag_ipv6 nf_de
frag_ipv4 nf_conntrack_rtcache nf_conntrack macvlan iptable_raw iptable_mangle iptable_filter ip_tables ebtables ebt_vlan ebt_stp ebt_snat ebt_redirect ebt_pkt
type ebt_mark_m ebt_mark ebt_limit ebt_ip6 ebt_ip ebt_dnat ebt_arpreply ebt_arp ebt_among ebt_802_3 compat sch_teql sch_sfq sch_red sch_prio sch_pie sch_netem
sch_htb sch_gred sch_fq sch_dsmark sch_codel em_text em_nbyte em_meta em_cmp act_ipt act_skbedit act_mirred em_u32 cls_u32 cls_tcindex cls_flow cls_route cls_f
w sch_hfsc batman_adv libcrc32c ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_raw ip6table_mangle ip6table_filter ip6_tables x_tables tun ipv6
arc4 crypto_blkcipher act_police cls_basic sch_tbf sch_ingress gpio_button_hotplug crc16 crc32c_generic crypto_hash
[ 68.400000] Process swapper (pid: 0, threadinfo=8037a000, task=80380c90, tls=00000000)
[ 68.400000] Stack : 000a9900 80ab6b6c 807490e0 81879900 80ab6b6c 80c9069c 80ab6b6c 80c918f8
80ab6b6c 80995e00 00000000 80ab6b7e 00000000 803956b0 00000000 81879900
80ab6b7e 80c92f10 00000008 80064f30 00000000 00000001 00000000 00000000
80ab6b7c 00000000 80ab6b84 00000000 00000000 00000000 0114ac01 32b5c2ff
fecb3f84 00000000 00000000 81879900 80d85000 803956b0 00000000 00000000
...
[ 68.400000] Call Trace:
[ 68.400000] [<80c90574>] rt6_lookup+0x1ec/0x418 [ipv6]
[ 68.400000]
[ 68.400000]
Code: 00a04021 00c01821 2c850001 <00050336> 8f850014 24a50200 af850014 afa70010 2484000c
[ 68.670000] ---[ end trace 3fd43cd8456a7706 ]---
[ 68.680000] Kernel panic - not syncing: Fatal exception in interrupt
The next one: https://home.universe-factory.net/neoraider/0001-Add-debug-patch-2.patch (must replace the old one).
Besides further tracking the origin of the NULL deref, I've also increased the length of the stack dump... the hardcoded limit of 40 u32 was much too small.
At the moment, with 872 online nodes and 2228 clients, the router seems to be stable. I guess the number of clients will increase in the evening. Let's hope the crash is actually triggered by large number of clients, so I can reproduce it again, later.
@jannic, the crash is triggered by a ICMPv6 redirect packet, maybe in combination with a race condition and/or OOM situation.
Got another crash. Unfortunately, the call trace doesn't look better:
[ 74.410000] Kernel bug detected[#1]:
[ 74.410000] CPU: 0 PID: 0 Comm: swapper Not tainted 3.18.21 #1
[ 74.410000] task: 80380c90 ti: 8037a000 task.ti: 8037a000
[ 74.410000] $ 0 : 00000000 00000001 00000001 80a450dc
[ 74.410000] $ 4 : 00000000 00000006 00000000 00000008
[ 74.410000] $ 8 : 00000008 80064f30 744156ff febf9d63
[ 74.410000] $12 : 00000000 80d3d0d9 00000000 2a032260
[ 74.410000] $16 : 80a450b4 80d3d0c0 81879900 80a450de
[ 74.410000] $20 : 00000001 80390000 00000000 80390000
[ 74.410000] $24 : 00000000 80c867dc
[ 74.410000] $28 : 8037a000 8037b9b0 00000006 80c92de4
[ 74.410000] Hi : 0000001b
[ 74.410000] Lo : 00ef7481
[ 74.410000] epc : 80c92e5c ip6_del_rt+0x1c8/0x34c [ipv6]
[ 74.410000] Not tainted
[ 74.410000] ra : 80c92de4 ip6_del_rt+0x150/0x34c [ipv6]
[ 74.410000] Status: 1000f403 KERNEL EXL IE
[ 74.410000] Cause : 10800034
[ 74.410000] PrId : 00019374 (MIPS 24Kc)
[ 74.410000] Modules linked in: ath9k ath9k_common iptable_nat ath9k_hw ath nf_nat_ipv4 nf_conntrack_ipv6 nf_conntrack_ipv4 mac80211 ipt_REJECT ipt_MASQUERADE ebtable_nat ebtable_filter ebtable_broute cfg80211 xt_time xt_tcpudp xt_state xt_quota xt_pkttype xt_physdev xt_owner xt_nat xt_multiport xt_mark xt_mac xt_limit xt_id xt_conntrack xt_comment xt_addrtype xt_TCPMSS xt_REDIRECT xt_LOG xt_CT nf_reject_ipv4 nf_nat_masquerade_ipv4 nf_nat nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack macvlan iptable_raw iptable_mangle iptable_filter ip_tables ebtables ebt_vlan ebt_stp ebt_snat ebt_redirect ebt_pkttype ebt_mark_m ebt_mark ebt_limit ebt_ip6 ebt_ip ebt_dnat ebt_arpreply ebt_arp ebt_among ebt_802_3 compat sch_teql sch_sfq sch_red sch_prio sch_pie sch_netem sch_htb sch_gred sch_fq sch_dsmark sch_codel em_text em_nbyte em_meta em_cmp act_ipt act_skbedit act_mirred em_u32 cls_u32 cls_tcindex cls_flow cls_route cls_fw sch_hfsc batman_adv libcrc32c ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_raw ip6table_mangle ip6table_filter ip6_tables x_tables tun ipv6 arc4 crypto_blkcipher act_police cls_basic sch_tbf sch_ingress gpio_button_hotplug crc16 crc32c_generic crypto_hash
[ 74.410000] Process swapper (pid: 0, threadinfo=8037a000, task=80380c90, tls=00000000)
[ 74.410000] Stack : 32b9c2cb 3f840000 00000000 00000001 00000000 00000000 80a450dc 00000000
80a450e4 00000000 00000000 00000000 0114ac01 32b5c2ff fecb3f84 00000000
00000000 81879900 80d3d0c0 803956b0 00000000 00000000 80390000 00000000
80390000 80c93090 80a44fe0 80c22e30 00000018 00000000 00000000 00000001
00000000 00000000 00000000 2a032260 01140000 00000000 00000004 2a032260
0114ac01 32b5c2ff fecb3f84 00000000 00000000 00000089 8061f420 80d3d0c0
80a45114 00000089 00000000 80c9df54 8037bc08 8037bbe0 00000004 8037bc2c
00008623 00000000 8037e500 80c2db24 0000000a 80c254f0 80d3d0c0 00000028
00000000 00000089 00000000 80390000 00000001 80c9b78c 816c5440 80c2bca4
80fcc000 80ea6238 00000028 00000000 8037e500 8037bc1c 00000000 80ca1a88
80cdb38c 00000001 80c91db8 803956b0 00000028 00000000 80371108 8037bbe0
80d3d0c0 80a450ec 80a45144 000000d4 80df1800 80c9b5dc 80df1800 80390000
00000001 80390000 00000006 802331b4 00000000 00000000 80a450dc 00000000
80a450e4 00000000 00000000 00000000 80fd1000 80238f00 80d3d0c0 00000089
80a450b4 803956b0 80df1800 80390000 00000001 80390000 00000006 80ca1eb4
8037bc18 8037da30 80d3d0c0 80fd1000 00000000 8037bc18 8037bc18 8037da30
80d3d0c0 8026b898 00000000 000086dd 8037d1b8 8037d148 80c85cdc 80cb6ea4
8037da30 80d3d0c0 80df1800 80cc2b80 0000003a 80cc2ea4 80a45094 80ca0d6c
80fd1000 8026b988 80c85c20 80000000 00000006 8037d968 00000000 8037bc18
80d3d0c0 80df1800 80d3d0c0 80df1800 80cb91f8 0000003a fffffffe 8037eb68
00600000 80c85fe8 8037d1b8 8037d148 80c59788 80c8683c 80cbda20 8037d1a4
80d3d0c0 80fd1000 00000000 000086dd 8037d1b8 8037d148 80c59788 802407d8
81bfe600 80a4507e 8037bd38 803968e8 8037d1b8 8037d148 802d9228 802d8eb4
80d3d0c0 80a4507e 81bfe600 80a4507e 8037bd38 803968e8 80380000 8037d148
802d9228 802d9558 80d3d0c0 80e1e208 8037bcc0 8037bcc0 00000000 802d8edc
80000000 00000000 00000000 80d3d0c0 00000001 816c5000 00000000 80240664
00000000 00000000 00000000 00000000 8037d148 00000000 80d3dcd8 80e1cb00
803808a0 00000000 80380894 80380874 00000040 80380868 80380000 80380000
80354a34 802413b4 80d3d0c0 818d2038 00000001 80e1d720 803808a0 80380860
00000009 0000012c 00000040 80240c50 80060000 800a5328 8037f338 00000000
ffffa7e3 8037f338 00000008 803d3924 00000009 00000004 00000100 803d0000
803d3918 00000003 0000000c 80082aa4 00000002 800a5fe0 80060000 00000002
...
[ 74.410000] Call Trace:
[ 74.410000] [<80c92e5c>] ip6_del_rt+0x1c8/0x34c [ipv6]
[ 74.410000]
[ 74.410000]
Code: 00000000 8e420050 2c420001 <00020336> 24020001 a6420032 8e350014 26160008 3c1780cc
[ 74.860000] ---[ end trace a2898d907d084fe9 ]---
[ 74.870000] Kernel panic - not syncing: Fatal exception in interrupt
[ 74.870000] Rebooting in 3 seconds..
No problem, I can reconstruct the call trace from the stack dump myself. And with the longer stack dump, I should be able to get deeper than 2 stack frames at a time...
Okay, I have a few theories what might be going wrong. If I'm right, the next patch will fix the issue: https://home.universe-factory.net/neoraider/0001-Add-debug-patch-3.patch
It will also add additional debug output, so even if it doesn't crash, please send me any "failed nexthop" lines you find in the log.
No crash yet, but got a failed nexthop message:
[ 63.980000] IPv6: failed nexthop: 81985900 (81985600 81985800 81985700 81985900)
Another two:
[ 63.500000] IPv6: failed nexthop: 81985900 (81985600 81985800 81985700 81985900)
[ 63.510000] IPv6: failed nexthop: 81985900 (81985600 81985800 81985700 81985900)
Okay, seems like it is fixed indeed. I'll prepare a proper patch.
Trying out current development firmware (commit f6f8350) on a wr841n v9, I get spontaneous reboots. In most cases, the router reboots ~30s after enabling the network.
The router is connected to the ffac mesh via wifi, and to a client machine via ethernet. Fastd is running, but can't connect, as there is no WAN available. ffac is using batman-adv-14. ibss meshing is enabled, 11s meshing is disabled in site.conf. The full site.conf is available here: https://github.com/jannic/site/tree/ffd24feb87bda67a119805f4286c78a6819e2274
If I disable fastd, the reboot doesn't happen, and the router is running stable for several hours. (I got a single reboot last night after several hours of uptime - might be related, but I'm not sure.)
With a serial cable I captured the following oops. It seems like there is not enough memory available?