Open aparcar opened 6 years ago
sumpfralle:
Sorry - there was a confusing typo in my description above:
Above you see 13k “workingset_refault” events within 60 seconds. The “workingset_refault” value stays at zero for routers with the same kernel, that do now show this problem.
"that do now show this problem" -> "that do not show this problem"
sumpfralle:
Does someone have an idea how I could debug this issue?
It is a bit sad to see, that we are starting to replace the XM devices in our local wireless community, since they cannot run the latest firmware releases due to the above device-specific excessive load.
ynezz:
perf top
could be a good start
sumpfralle:
perf top
could be a good start
Thank you for this suggestion!
I just built an image with perf and tried to run it, bit sadly it segfaults :(
I am bit at loss here, how to investigate this due to the limited resources of the device. Do you have an advice for me?
Thank you!
ynezz:
I just built an image with perf and tried to run it, bit sadly it segfaults :(
This is happening on the latest ar71xx/ath79? Do you've perf support enabled in kernel as well? I'm using following configuration on ath79(should as well work on ar71xx) and perf top
works for me:
CONFIG_KERNEL_DYNAMIC_DEBUG=y
CONFIG_KERNEL_DYNAMIC_FTRACE=y
CONFIG_KERNEL_FTRACE=y
CONFIG_KERNEL_FTRACE_SYSCALLS=y
CONFIG_KERNEL_FUNCTION_GRAPH_TRACER=y
CONFIG_KERNEL_FUNCTION_PROFILER=y
CONFIG_KERNEL_FUNCTION_TRACER=y
CONFIG_KERNEL_KPROBES=y
CONFIG_KERNEL_KPROBE_EVENT=y
CONFIG_KERNEL_KPROBE_EVENTS=y
CONFIG_KERNEL_PERF_EVENTS=y
CONFIG_KERNEL_PROFILING=y
CONFIG_PACKAGE_iperf3=y
CONFIG_PACKAGE_perf=y
sumpfralle:
Our local wireless community uses a lot of Ubiquiti devices.
They all worked well with Chaos Calmer.
With LEDE 17.01 we started to see load issues with Nanostation M5 XM devices (the older Nanostation model, only 32 MB). We did not notice the issue with any other device up to now.
After a few hours of uptime the routers will start to develop persistent high load (>8) and usually "recover" only after a reboot. "wifi up/down" do not seem to affect the issue.
The problem is almost non-existing for devices using only a single ethernet port. Devices using both ethernet ports suffer greatly (problems starting usually within 24 hours). Thus I could imagine, that [[https://bugs.openwrt.org/index.php?do=details&task_id=296|issue #296]] is related (just wild guessing).
Traffic on the wireless interface seems to increase the likelyhood of the problem (maybe CPU utilization in general).
"top" and other tools do not show processes, that could cause the high load.
The only unusual metric that seems to be connected to the high-load situation seems to be "workingsetrefault" (see /proc/vmstat). See the following output:
root@AP-1-96:~# while sleep 10; do grep workingset
/proc/vmstat; done workingset_refault 1304983 workingset_activate 392198 workingset_nodereclaim 10330 workingset_refault 1308585 workingset_activate 393391 workingset_nodereclaim 10352 workingset_refault 1308671 workingset_activate 393412 workingset_nodereclaim 10352 workingset_refault 1310284 workingset_activate 393940 workingset_nodereclaim 10374 workingset_refault 1317360 workingset_activate 396226 workingset_nodereclaim 10454 workingset_refault 1317465 workingset_activate 396251 workingset_nodereclaim 10454 workingset_refault 1317540 workingset_activate 396292 workingset_nodereclaim 10454 workingset_refault 1324449 workingset_activate 398402 workingset_nodereclaim 10508 workingset_refault 1328418 workingset_activate 399908 workingset_nodereclaim 10536 workingset_refault 1328796 workingset_activate 400114 workingset_nodereclaim 10536 workingset_refault 1329186 workingset_activate 400213 workingset_nodereclaim 10546 workingset_refault 1333889 workingset_activate 401528 workingset_nodereclaim 10594Above you see 13k "workingset_refault" events within 60 seconds. The "workingset_refault" value stays at zero for routers with the same kernel, that do now show this problem. Thus I could imagine, that this is related to the high load.
Now I am running out of ideas, how to research the issue. Maybe someone can give me a hint, what I could try?
Just for reference: we are also discussing this issue in the bug tracker of our local wireless community (https://dev.opennet-initiative.de/ticket/187 - only in German). But this discussion may be a bit hard to read, as we were hunting down different potential causes of the problem. But sadly each of our theories dissolved without giving a hint for the root cause.