berlin-open-wireless-lab / DAWN

Decentralized WiFi Controller
GNU General Public License v2.0
369 stars 64 forks source link

Memory consumption grows until machine bogs down #155

Closed protectivedad closed 2 years ago

protectivedad commented 2 years ago

I have four routers all different brands/memory running DAWN. Also of note they all have ZRAM. Yesterday I updated all of them to the latest version of DAWN and reset all the configs. Since then two of the routers (one has 128 MB and the other 64 MB of RAM) have run into problems. The router runs fine for hours but after a while it becomes sluggish and will starts killing process with OOM (dnsmasq, openvpn, DAWN, etc). The first time I had to force reboot -f. The second time I wanted to find the problem application. Since DAWN was the last thing I updated I tried /etc/init.d/dawn restart, and voila the problem went away.

Before restart PS showed:

  PID USER       VSZ STAT COMMAND
    1 root      1628 S    /sbin/procd
    2 root         0 SW   [kthreadd]
    3 root         0 IW<  [rcu_gp]
    4 root         0 IW<  [rcu_par_gp]
    6 root         0 RW<  [kworker/0:0H+kb]
    8 root         0 IW<  [mm_percpu_wq]
    9 root         0 SW   [ksoftirqd/0]
   10 root         0 IW   [rcu_sched]
   11 root         0 SW   [migration/0]
   12 root         0 SW   [cpuhp/0]
   13 root         0 SW   [cpuhp/1]
   14 root         0 SW   [migration/1]
   15 root         0 SW   [ksoftirqd/1]
   17 root         0 IW<  [kworker/1:0H-kb]
   18 root         0 SW   [cpuhp/2]
   19 root         0 SW   [migration/2]
   20 root         0 SW   [ksoftirqd/2]
   22 root         0 IW<  [kworker/2:0H-kb]
   23 root         0 SW   [cpuhp/3]
   24 root         0 SW   [migration/3]
   25 root         0 SW   [ksoftirqd/3]
   27 root         0 IW<  [kworker/3:0H-kb]
   28 root         0 IW<  [netns]
   31 root         0 IW   [kworker/3:1-mm_]
  124 root         0 IW   [kworker/2:1-eve]
  153 root         0 SW   [oom_reaper]
  154 root         0 IW<  [writeback]
  155 root         0 SW   [kcompactd0]
  161 root         0 IW<  [pencrypt_serial]
  162 root         0 IW<  [pdecrypt_serial]
  174 root         0 IW<  [kblockd]
  175 root         0 IW<  [blkcg_punt_bio]
  192 root         0 SW   [watchdogd]
  214 root         0 SW   [kswapd0]
  296 root         0 IW<  [kthrotld]
  328 root         0 RW   [spi0]
  392 root         0 IW<  [dsa_ordered]
  437 root         0 SW   [irq/25-keys]
  438 root         0 SW   [irq/26-keys]
  448 root         0 IW   [kworker/2:2-rcu]
  537 root         0 SWN  [jffs2_gcd_mtd6]
  619 ubus      1268 S    /sbin/ubusd
  620 root       940 S    /sbin/askfirst /usr/libexec/login.sh
  654 root      1044 S    /sbin/urngd
  716 root         0 IW<  [cfg80211]
  730 root         0 SW   [mt76-tx phy0]
  801 root         0 SW   [mt76-tx phy1]
 1212 logd      1264 S    /sbin/logd -S 64
 1264 root      2228 S    /sbin/rpcd -s /var/run/ubus/ubus.sock -t 30
 1359 root      1184 S    /usr/sbin/dropbear -F -P /var/run/dropbear.1.pid -p 22 -K 300 -T 3
 1410 root      4376 D    /usr/sbin/hostapd -s -g /var/run/hostapd/global
 1411 root      4476 S    /usr/sbin/wpa_supplicant -n -s -g /var/run/wpa_supplicant/global
 1469 root      1784 S    /sbin/netifd
 1514 root         0 IW<  [kworker/3:2H]
 1576 root         0 IW<  [kworker/2:2H]
 1585 root      3656 D    /usr/sbin/uhttpd
 2421 root      1284 S    udhcpc -p 
 2473 root      1300 S    /bin/sh /usr/bin/watchcat.sh ping_reboot 
 2592 root      1288 S<   /usr/sbin/ntpd -n -N -S /usr/sbin/ntpd-hotplug -p 
 2672 root         0 IW   [kworker/3:2-rcu]
 3310 root      8384 S    /usr/sbin/dawn
 7792 root         0 IW   [kworker/1:1]
11958 root         0 IW   [kworker/1:2-eve]
14268 root         0 IW<  [kworker/1:2H]
17692 root         0 IW   [kworker/0:2-eve]
18634 root         0 IW<  [kworker/0:1H]
19230 root         0 IW   [kworker/u8:2-ev]
19308 root         0 IW<  [kworker/u9:1-na]
19356 root         0 IW   [kworker/u8:0-ph]
19428 root         0 IW   [kworker/u8:1-ph]
19462 root         0 IW<  [kworker/u9:0-na]
19488 root         0 IW   [kworker/0:0-eve]
19516 root         0 IW<  [kworker/u9:2-na]
19522 root         0 IW   [kworker/0:1-eve]
19528 root         0 IW<  [kworker/u9:3-na]
19540 root         0 IW<  [kworker/1:1H-kb]
19541 root         0 IW<  [kworker/0:2H-kb]
19542 root         0 IW<  [kworker/2:1H]
19543 root         0 IW<  [kworker/3:1H]
19573 root      1208 D    /usr/sbin/dropbear -F -P /var/run/dropbear.1.pid -p 22 -K 300 -T 3
19575 root      1292 S    -ash
19588 root      1284 S    sleep 60
19590 root      1284 R    ps

after

  PID USER       VSZ STAT COMMAND
    1 root      1628 S    /sbin/procd
    2 root         0 SW   [kthreadd]
    3 root         0 IW<  [rcu_gp]
    4 root         0 IW<  [rcu_par_gp]
    6 root         0 IW<  [kworker/0:0H-kb]
    8 root         0 IW<  [mm_percpu_wq]
    9 root         0 SW   [ksoftirqd/0]
   10 root         0 IW   [rcu_sched]
   11 root         0 SW   [migration/0]
   12 root         0 SW   [cpuhp/0]
   13 root         0 SW   [cpuhp/1]
   14 root         0 SW   [migration/1]
   15 root         0 SW   [ksoftirqd/1]
   17 root         0 IW<  [kworker/1:0H-kb]
   18 root         0 SW   [cpuhp/2]
   19 root         0 SW   [migration/2]
   20 root         0 RW   [ksoftirqd/2]
   22 root         0 IW<  [kworker/2:0H-kb]
   23 root         0 SW   [cpuhp/3]
   24 root         0 SW   [migration/3]
   25 root         0 SW   [ksoftirqd/3]
   27 root         0 IW<  [kworker/3:0H-kb]
   28 root         0 IW<  [netns]
   31 root         0 IW   [kworker/3:1-mm_]
  124 root         0 IW   [kworker/2:1-mm_]
  153 root         0 SW   [oom_reaper]
  154 root         0 IW<  [writeback]
  155 root         0 SW   [kcompactd0]
  161 root         0 IW<  [pencrypt_serial]
  162 root         0 IW<  [pdecrypt_serial]
  174 root         0 IW<  [kblockd]
  175 root         0 IW<  [blkcg_punt_bio]
  192 root         0 SW   [watchdogd]
  214 root         0 SW   [kswapd0]
  296 root         0 IW<  [kthrotld]
  328 root         0 SW   [spi0]
  392 root         0 IW<  [dsa_ordered]
  437 root         0 SW   [irq/25-keys]
  438 root         0 SW   [irq/26-keys]
  448 root         0 IW   [kworker/2:2-rcu]
  537 root         0 SWN  [jffs2_gcd_mtd6]
  619 ubus      1268 S    /sbin/ubusd
  620 root       940 S    /sbin/askfirst /usr/libexec/login.sh
  654 root      1044 S    /sbin/urngd
  716 root         0 IW<  [cfg80211]
  730 root         0 SW   [mt76-tx phy0]
  801 root         0 SW   [mt76-tx phy1]
 1212 logd      1308 S    /sbin/logd -S 64
 1264 root      2228 S    /sbin/rpcd -s /var/run/ubus/ubus.sock -t 30
 1359 root      1184 S    /usr/sbin/dropbear -F -P /var/run/dropbear.1.pid -p 22 -K 300 -T 3
 1410 root      4376 S    /usr/sbin/hostapd -s -g /var/run/hostapd/global
 1411 root      4476 S    /usr/sbin/wpa_supplicant -n -s -g /var/run/wpa_supplicant/global
 1469 root      1784 S    /sbin/netifd
 1585 root      3656 S    /usr/sbin/uhttpd 
 1973 root      1576 S    /usr/sbin/umdns
 2421 root      1284 S    udhcpc -p 
 2473 root      1300 S    /bin/sh /usr/bin/watchcat.sh ping_reboot 
 2592 root      1288 S<   /usr/sbin/ntpd -n -N -S /usr/sbin/ntpd-hotplug 
 2672 root         0 IW   [kworker/3:2-rcu]
 7792 root         0 IW   [kworker/1:1-rcu]
11958 root         0 IW   [kworker/1:2-eve]
17692 root         0 IW   [kworker/0:2-eve]
19522 root         0 IW   [kworker/0:1-eve]
19540 root         0 IW<  [kworker/1:1H-kb]
19541 root         0 IW<  [kworker/0:2H-kb]
19542 root         0 IW<  [kworker/2:1H]
19543 root         0 IW<  [kworker/3:1H]
19573 root      1208 R    /usr/sbin/dropbear -F -P /var/run/dropbear.1.pid -p 22 -K 300 -T 3
19575 root      1292 S    -ash
19619 root         0 IW<  [kworker/u9:2-na]
19630 root      2756 S    /usr/sbin/dawn
19785 root         0 IW   [kworker/u8:0-ph]
19846 root         0 IW<  [kworker/u9:0-na]
19852 root         0 IW   [kworker/u8:1-ev]
19874 root         0 IW<  [kworker/u9:1-na]
19880 root         0 IW   [kworker/u8:2-ev]
19885 root      1284 S    sleep 60
19886 root      1284 R    ps

The thing I notice DAWN's VSZ goes from 8384 to 2756. Unfortunately, I didn't get any other info. My other routers have DAWN's VSZ at 2856, 5604, and 10788. They all run different things and have different memory sizes.

I did have a router kill DAWN with an OOM before I updated. That was the reason I updated, I wanted DAWN to restart. But I had DAWN running for months without issue. Though if it was killing DAWN I'm not sure I would have noticed.

What can I do to narrow down what is using up all the memory? If it is DAWN, do you think that it is inherent and DAWN will require lots of memory or is there a memory leak/problem?

Thanks.

PolynomialDivision commented 2 years ago

OpenWrt 21.02 or trunk? ;)

PolynomialDivision commented 2 years ago

There were some important hostapd umdns fixes.

protectivedad commented 2 years ago

OpenWrt 21.02-SNAPSHOT r16312 Plus I cherry-picked the latest changes to DAWN.

PolynomialDivision commented 2 years ago

Can you try OpenWrt Snapshot and not 21.02? We had a lot of changes and buf fixes for hostapd.

protectivedad commented 2 years ago

Oh, you're killing me :) They just released 21.02.1 and I was hoping to get away from bleeding edge :). It will take me a while I have my own hardware changes that I need to merge in if I move over. I'll monitor the memory usage as I work to switch over to master.

PolynomialDivision commented 2 years ago

Oh, you're killing me :) They just released 21.02.1 and I was hoping to get away from bleeding edge :). It will take me a while I have my own hardware changes that I need to merge in if I move over. I'll monitor the memory usage as I work to switch over to master.

Maybe we can also migrate hostapd changes to 21.02. ;)

protectivedad commented 2 years ago

GIT never ceases to amaze me (I'm a old school Atari programmer). With a few commands I could get all my commits across 21.02 and cherry-pick them on a fresh openwrt master and run my build scripts. I'll update the firmwares and watch them for memory growth.

Ian-Clowes commented 2 years ago

I've been away and may be reading things wrongly, but...

it looks like DAWN_MEMORY_AUDIT is defined by default, so a kill -HUP to the process should put some details of memory that DAWN has allocated into the system log. If that info is getting longer as memory is consumed then DAWN may be at fault, but if it staying stable then it could well be a different library that is leaking memory.

Ian-Clowes commented 2 years ago

... so a kill -HUP to the process should put some details of memory that DAWN has allocated into the system log.

Just got DAWN up and running again to check this. It's not quite right as the memory logging goes to stdout. You can make stdout appear in logread if you edit the obvious line in /etc/init.d/dawn

procd_set_param stdout 1

Ian-Clowes commented 2 years ago

Example after DAWN has been running for a week or so on my router. The NNN= values are a sequential indicator that helps show what is a new or old allocation.

root@OpenWrt:~# ps|grep dawn
 3773 root      2688 S    /tmp/dawn

root@OpenWrt:~# kill -HUP 3773
root@OpenWrt:~# logread | grep dawn
Mon Dec 20 23:17:32 2021 daemon.info dawn[3773]: mem-audit: Currently recorded allocations...
Mon Dec 20 23:17:32 2021 daemon.info dawn[3773]: mem-audit:        3=X - ubus.c@589: 0
Mon Dec 20 23:17:32 2021 daemon.info dawn[3773]: mem-audit:  1125075=M - msghandler.c@150: 96
Mon Dec 20 23:17:32 2021 daemon.info dawn[3773]: mem-audit:        4=C - ubus.c@1320: 744
Mon Dec 20 23:17:32 2021 daemon.info dawn[3773]: mem-audit:  1124391=M - msghandler.c@150: 96
Mon Dec 20 23:17:32 2021 daemon.info dawn[3773]: mem-audit:  1125087=M - msghandler.c@150: 96
Mon Dec 20 23:17:32 2021 daemon.info dawn[3773]: mem-audit:  1125085=M - msghandler.c@475: 608
Mon Dec 20 23:17:32 2021 daemon.info dawn[3773]: mem-audit:  1125105=M - msghandler.c@150: 96
Mon Dec 20 23:17:32 2021 daemon.info dawn[3773]: mem-audit:        0=X - dawn_uci.c@417: 0
Mon Dec 20 23:17:32 2021 daemon.info dawn[3773]: mem-audit:        1=X - dawn_uci.c@436: 0
Mon Dec 20 23:17:32 2021 daemon.info dawn[3773]: mem-audit: [End of list: 1736 bytes]
Ian-Clowes commented 2 years ago

@protectivedad - Is this resolved now? If not I'll try and repro it to fix.

protectivedad commented 2 years ago

Sorry for the long delay. I was never able to switch to the latest master as discussed. When I attempted it the dnsmasq stop working, and other things started breaking. To the point where my wife and kids were very upset :). I updated 21.02 on Nov 27 which captured a lot of hostapd fixes. Since Nov 27 I haven't had any issues. I have changed the stdout param and will continue to monitor. Thanks for getting back to me and again sorry for the long delay on responding,

protectivedad commented 2 years ago

Seems to be fine now. The process isn't being killed and all my routers DAWN VSZ are all the abouth the same. This is with the latest DAWN and openwrt-21.02 from 01-23-2022. Thanks for taking the time.