TheThingSystem / steward

The Thing System is a set of software components and network protocols that aims to fix the Internet of Things. Our steward software is written in node.js making it both portable and easily extensible. It can run on your laptop, or fit onto a small single board computer like the Raspberry Pi.
http://thethingsystem.com
Other
346 stars 81 forks source link

Steward consumes 100% CPU in 1.6+ #197

Closed jukkar closed 10 years ago

jukkar commented 10 years ago

I have this system with multiple network interfaces. If those interfaces have an IP address when steward is started, then the steward starts to consume 100% CPU and eventually it runs out of memory and is killed by the OS. If there is only one interface with IP address when steward starts, I see no problems.

Perhaps this is a side effect of the fix to this bug https://github.com/TheThingSystem/steward/issues/185

My steward is at this commit id bbc9e0c9aec8f3087e67f2093f7884340644e4c1

mrose17 commented 10 years ago

well that's a new one. what does the steward output say?

jukkar commented 10 years ago

@mrose17

The log is below. The log are from image that I am running in the VirtualBox. I also tried running TTS in Intel NUC device and I currently do not see the same error there, the results are quite confusing. I am doing more tests for this.

The log from steward looks pretty normal to my eyes:

debug: [steward] begin
debug: [devices] begin
debug: [server] begin
*** WARNING *** The program 'node' uses the Apple Bonjour compatibility layer of Avahi.
*** WARNING *** Please fix your application to use the native API of Avahi!
*** WARNING *** For more information see <http://0pointer.de/avahi-compat?s=libdns_sd&e=node>
*** WARNING *** The program 'node' called 'DNSServiceRegister()' which is not supported (or only supported partially) in the Apple Bonjour compatibility layer of Avahi.
*** WARNING *** Please fix your application to use the native API of Avahi!
*** WARNING *** For more information see <http://0pointer.de/avahi-compat?s=libdns_sd&e=node&f=DNSServiceRegister>
debug: [discovery] begin
debug: [manage] begin
info: [steward] scanning enp0s3
info: [steward] scanning tether
info: [steward] loading place actor
warning: [steward] place/1 event=read .git/logs/HEAD, diagnostic=ENOENT, open '/opt/TheThingSystem/steward/actors/../../.git/logs/HEAD'
info: [devices] loading climate driver
debug: [climate] begin
info: [devices] loading gateway driver
debug: [gateway] begin
info: [devices] loading indicator driver
debug: [indicator] begin
info: [devices] loading lighting driver
debug: [lighting] begin
info: [devices] loading media driver
debug: [media] begin
info: [devices] loading motive driver
debug: [motive] begin
info: [devices] loading presence driver
debug: [presence] begin
info: [devices] loading sensor driver
debug: [sensor] begin
info: [devices] loading switch driver
debug: [switch] begin
info: [devices] loading wearable driver
debug: [wearable] begin
info: [climate] loading ecobee-control driver
info: [climate] loading flower-power-sensor driver
info: [climate] loading koubachi-sensor driver
info: [climate] loading nest-control driver
info: [climate] loading netatmo-sensor driver
info: [climate] loading oregon-scientific-sensor driver
info: [climate] loading samsung-control driver
debug: [discovery] listening network_interface=enp0s3, ipaddr=10.237.67.149, portno=1900
debug: [discovery] listening network_interface=tether, ipaddr=192.168.0.1, portno=1900
info: [climate] loading yoctopuce-meteo driver
info: [discovery] SSDP listening on http://0.0.0.0:1900
debug: [discovery] multicasting network_interface=enp0s3, ipaddr=10.255.255.255, portno=1900
info: [discovery] SSDP listening on http://0.0.0.0:1900
debug: [discovery] multicasting network_interface=tether, ipaddr=192.168.0.255, portno=1900
info: [gateway] loading automatic-cloud gateway
info: [gateway] loading ecobee-cloud gateway
info: [devices] failing ecobee-cloud gateway (continuing) diagnostic=Cannot find module 'ecobee-api'
info: [gateway] loading flower-power-cloud gateway
info: [gateway] loading insteon-9761 gateway
info: [gateway] loading koubachi-cloud gateway
info: [gateway] loading lockitron-cloud gateway
info: [gateway] loading nest-cloud gateway
info: [gateway] loading netatmo-cloud gateway
info: [gateway] loading openzwave-usb gateway
info: [gateway] loading reelyactive-reel gateway
info: [gateway] loading rfxcom-usb gateway
info: [gateway] loading tesla-cloud gateway
info: [gateway] loading wink-cloud gateway
info: [gateway] loading yoctopuce-hub gateway
info: [gateway] loading zigbee-gmo gateway
info: [discovery] reelyactive-reel driver listening on  udp://*:7018
info: [indicator] loading cassandra-nosql driver
info: [indicator] loading dweetio-sensor driver
info: [indicator] loading grovestreams-sensor driver
info: [indicator] loading mqtt-text driver
info: [indicator] loading nma-text driver
info: [indicator] loading prowl-text driver
info: [indicator] loading wink-dial driver
info: [indicator] loading wink-nimbus driver
info: [indicator] loading xively-sensor driver
info: [lighting] loading blink1-led driver
info: [lighting] loading heroicrobotics driver
info: [lighting] loading hue driver
info: [lighting] loading insteon-led driver
info: [lighting] loading lifx-bulb driver
info: [lighting] loading robosmart-led driver
Mon, 07 Apr 2014 09:44:19 GMT hci-ble hciBle = /opt/TheThingSystem/steward/node_modules/noble/lib/linux/../../build/Release/hci-ble
info: [lighting] loading tabu-lumen driver
info: [lighting] loading tcpi-led driver
*** WARNING *** The program 'node' called 'DNSServiceRegister()' which is not supported (or only supported partially) in the Apple Bonjour compatibility layer of Avahi.
*** WARNING *** Please fix your application to use the native API of Avahi!
*** WARNING *** For more information see <http://0pointer.de/avahi-compat?s=libdns_sd&e=node&f=DNSServiceRegister>
info: [lighting] loading yoctopuce-color driver
info: [lighting] loading yoctopuce-powercolor driver
info: [discovery] PixelPusher listening on udp://*:7331
info: [media] loading audio-sonos driver
info: [media] loading video-appletv driver
info: [media] loading video-chromecast driver
info: [media] loading video-roku driver
info: [motive] loading automatic-vehicle driver
info: [motive] loading lockitron-lock driver
info: [motive] loading tesla-model-s driver
info: [presence] loading fob driver
info: [presence] loading mobile-mqtt driver
info: [presence] loading reelyactive-tag driver
info: [switch] loading insteon-dimmer driver
info: [switch] loading insteon-onoff driver
info: [switch] loading wemo-onoff driver
info: [switch] loading wink-onoff driver
info: [switch] loading wink-pivot-power-genius driver
info: [switch] loading zwave-dimmer driver
info: [switch] loading zwave-onoff driver
info: [wearable] loading watch driver
debug: [discovery] multicasting network_interface=enp0s3, ipaddr=10.255.255.255, portno=1900
debug: [discovery] multicasting network_interface=tether, ipaddr=192.168.0.255, portno=1900
Mon, 07 Apr 2014 09:45:53 GMT hci-ble buffer = "adapterState poweredOn\n"
Mon, 07 Apr 2014 09:45:53 GMT hci-ble line = adapterState poweredOn
Mon, 07 Apr 2014 09:45:53 GMT hci-ble adapterState = poweredOn
debug: [discovery] SSDP response address=192.168.0.1, family=IPv4, port=1900, size=189
debug: [discovery] SSDP response address=192.168.0.1, family=IPv4, port=1900, size=189
debug: [discovery] multicasting network_interface=enp0s3, ipaddr=10.255.255.255, portno=1900
debug: [discovery] multicasting network_interface=tether, ipaddr=192.168.0.255, portno=1900
info: [sensor] loading nest-smoke driver
info: [sensor] loading texas-instruments-sensortag driver
./start-steward.debug: line 21:   584 Killed                  DEBUG=hci-ble node $DIR/server.js $@

For some reason the steward went crazy as I see this in dmesg:

[  262.141277] device wlan0 entered promiscuous mode
[  262.229413] tether: port 1(wlan0) entered forwarding state
[  262.229625] tether: port 1(wlan0) entered forwarding state
[  324.378180] device enp0s3 entered promiscuous mode
[  324.420286] device tether entered promiscuous mode
[  453.426491] systemd-journal invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[  453.427260] systemd-journal cpuset=/ mems_allowed=0
[  453.555625] CPU: 0 PID: 255 Comm: systemd-journal Tainted: G        W    3.14.0-rc2-bt6lowpan #2
[  453.557062] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  453.558242]  00000000 00000000 c71f7ca4 c180e9a8 c4e7c140 c71f7d04 c180b918 c1a53900
[  453.560409]  c4e7c500 000201da 00000000 00000000 00000206 c1b70160 c71f7cd0 c108e1eb
[  453.562172]  c71f7ce0 c18166e1 c1b70160 00000001 c71f7d04 c1380d9a c79f2b80 000201da
[  453.563807] Call Trace:
[  453.564221]  [<c180e9a8>] dump_stack+0x4b/0x75
[  453.564650]  [<c180b918>] dump_header.isra.9+0x76/0x269
[  453.565155]  [<c108e1eb>] ? trace_hardirqs_on+0xb/0x10
[  453.565537]  [<c18166e1>] ? _raw_spin_unlock_irqrestore+0x41/0x70
[  453.565996]  [<c1380d9a>] ? ___ratelimit+0x8a/0x110
[  453.566409]  [<c110c132>] oom_kill_process+0x1c2/0x380
[  453.566788]  [<c104d6b4>] ? has_capability_noaudit+0x14/0x20
[  453.567169]  [<c110c753>] out_of_memory+0x2c3/0x310
[  453.567575]  [<c11106ab>] __alloc_pages_nodemask+0x94b/0x9b0
[  453.567962]  [<c110a7c9>] filemap_fault+0x2d9/0x510
[  453.568373]  [<c112ab7a>] ? __do_fault+0x5a/0x4c0
[  453.568805]  [<c112ab7a>] __do_fault+0x5a/0x4c0
[  453.569120]  [<c11479d0>] ? sock_update_memcg+0x220/0x220
[  453.569499]  [<c112de66>] handle_mm_fault+0x156/0x8f0
[  453.569870]  [<c181a280>] ? __do_page_fault+0x110/0x570
[  453.570105]  [<c181a2dc>] __do_page_fault+0x16c/0x570
[  453.570466]  [<c116dc2a>] ? mntput_no_expire+0x6a/0x220
[  453.570848]  [<c1152b91>] ? __fput+0x121/0x200
[  453.571352]  [<c1817451>] ? restore_all+0xf/0xf
[  453.571714]  [<c1061a97>] ? task_work_run+0x97/0xb0
[  453.572085]  [<c181a6e0>] ? __do_page_fault+0x570/0x570
[  453.572529]  [<c181a6ed>] do_page_fault+0xd/0x10
[  453.572896]  [<c1817beb>] error_code+0x5f/0x64
[  453.574092]  [<c1810000>] ? e1000_regdump+0x34b/0x381
[  453.574482]  [<c181a6e0>] ? __do_page_fault+0x570/0x570
[  453.706181] Mem-Info:
[  453.706529] DMA per-cpu:
[  453.706862] CPU    0: hi:    0, btch:   1 usd:   0
[  453.791091] Normal per-cpu:
[  453.791432] CPU    0: hi:   18, btch:   3 usd:   0
[  453.791804] active_anon:14943 inactive_anon:797 isolated_anon:0
[  453.791804]  active_file:36 inactive_file:124 isolated_file:0
[  453.791804]  unevictable:0 dirty:0 writeback:0 unstable:0
[  453.791804]  free:413 slab_reclaimable:2505 slab_unreclaimable:2160
[  453.791804]  mapped:5563 shmem:1092 pagetables:197 bounce:0
[  453.791804]  free_cma:0
[  454.081175] DMA free:548kB min:192kB low:240kB high:288kB active_anon:11368kB inactive_anon:496kB active_file:104kB inactive_file:288kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15992kB managed:15916kB mlocked:0kB dirty:0kB writeback:0kB mapped:552kB shmem:700kB slab_reclaimable:572kB slab_unreclaimable:1068kB kernel_stack:88kB pagetables:136kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:1155 all_unreclaimable? yes
[  454.308911] lowmem_reserve[]: 0 89 89 89
[  454.328974] Normal free:1108kB min:1112kB low:1388kB high:1668kB active_anon:48404kB inactive_anon:2692kB active_file:48kB inactive_file:208kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:114624kB managed:93500kB mlocked:0kB dirty:0kB writeback:0kB mapped:21692kB shmem:3668kB slab_reclaimable:9448kB slab_unreclaimable:7572kB kernel_stack:672kB pagetables:652kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:1325 all_unreclaimable? yes
[  454.605629] lowmem_reserve[]: 0 0 0 0
[  454.649197] DMA: 37*4kB (EMR) 2*8kB (R) 22*16kB (R) 1*32kB (R) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 548kB
[  454.835953] Normal: 277*4kB (UE) 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1108kB
[  455.083528] 1258 total pagecache pages
[  455.083870] 0 pages in swap cache
[  455.171160] Swap cache stats: add 0, delete 0, find 0/0
[  455.171645] Free swap  = 0kB
[  455.171969] Total swap = 0kB
[  455.277845] 32654 pages RAM
[  455.309081] 0 pages HighMem/MovableOnly
[  455.309445] 5281 pages reserved
[  455.391165] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[  455.391847] [  255]     0   255     2735      664       7        0             0 systemd-journal
[  455.477368] [  278]     0   278     2301       84       6        0         -1000 systemd-udevd
[  455.562152] [  314]     0   314      622       13       5        0             0 syslogd
[  455.650065] [  316]     0   316     1371       55       6        0             0 connman-vpnd
[  455.650548] [  318]     0   318     1301       81       5        0             0 ofonod
[  455.837121] [  320]     0   320      606       15       5        0             0 klogd
[  455.837833] [  322]     0   322      763       37       5        0             0 start-eca-web
[  455.928057] [  327]     0   327    24330     1370      16        0             0 python
[  455.928731] [  334]   998   334      730       78       5        0          -900 dbus-daemon
[  456.035061] [  364]     0   364      736       45       5        0             0 systemd-logind
[  456.035743] [  366]     0   366     1511       70       6        0             0 connmand
[  456.207744] [  367]   999   367      819       54       5        0             0 avahi-daemon
[  456.340193] [  368]   999   368      788       38       4        0             0 avahi-daemon
[  456.340823] [  369]     0   369     1066       47       5        0             0 bluetoothd
[  456.426548] [  379]     0   379     1336       68       6        0             0 wpa_supplicant
[  456.617146] [  387]     0   387      789       69       5        0             0 sh
[  456.617811] [  581]     0   581      726       32       5        0             0 start-steward.d
[  456.794684] [  582]     0   582      607       15       5        0             0 tee
[  456.836164] [  584]     0   584    36160    17006      67        0             0 node
[  456.924057] [  591]     0   591      537       21       5        0             0 hci-ble
[  456.924673] [  609]     0   609      600       21       5        0             0 systemd-cgroups
[  457.026127] [  610]     0   610      600       22       5        0             0 systemd-cgroups
[  457.026753] [  611]     0   611     1185      258       4        0             0 (agetty)
[  457.144358] Out of memory: Kill process 584 (node) score 605 or sacrifice child
[  457.208414] Killed process 591 (hci-ble) total-vm:2148kB, anon-rss:84kB, file-rss:0kB
jukkar commented 10 years ago

I was too optimistic. After running fine for some time in Intel NUC, steward started to behave badly there also. OOM-killer kills steward, it is then restarted automatically by systemd but because of this restart cycle, the web UI is pretty much unusable.

aallan commented 10 years ago

Interesting, so I'm guessing here, but I'm presuming that one of the network interfaces really doesn't like dropping into promiscuous mode,

[  262.141277] device wlan0 entered promiscuous mode
[  262.229413] tether: port 1(wlan0) entered forwarding state
[  262.229625] tether: port 1(wlan0) entered forwarding state
[  324.378180] device enp0s3 entered promiscuous mode
[  324.420286] device tether entered promiscuous mode

what can you tell us about enp0s3 and tether ..?

jukkar commented 10 years ago

enp0s3 is normal ethernet interface tether is bridged wlan interface

aallan commented 10 years ago

Okay, hmm… a bridged wlan interface. A lot of wireless adaptors don't support promiscuous mode, it's possible that, since it's a bridged interface the kernel isn't spotting it, trying to throw it into promiscuous mode, and getting trapped into a 100% CPU utilisation loop..?

Can you try throwing it into promiscuous mode manually form the command line and seeing if you get the same result?

jukkar commented 10 years ago

The wlan adapter (Intel Centrino Advanced-N 6235) I am using supports promiscuous mode just fine.

I tried with two ethernet interfaces (one pci based and the other usb-ethernet dongle) and disabled the wlan and I see the same results i.e., steward process uses 100% cpu and the memory consumption increases all the time.

aallan commented 10 years ago

Okay, if that's ruled out, the next thing to do then is to look at the fact it's killing BLE related processes. Could you go to steward/steward/discovery/discovery-ble.js and disable the module by adding the following two lines right at the top,

   exports.start = function() {};
   return;

and restart the steward.

jukkar commented 10 years ago

I think the hci-ble is killed also because the parent process (steward) was killed first.

Anyway, editing discovery-ble.js causes this error

alert: [steward] exception diagnostic=Object #<Object> has no method 'register'
alert: [steward] exception stack=[{"fileName":"/opt/TheThingSystem/steward/devices/devices-lighting/lighting-robosmart-led.js","lineNumber":202,"functionName":"Object.exports.start","typeName":"Object","methodName":"exports.start","columnNumber":46,"native":false},{"fileName":"/opt/TheThingSystem/steward/core/utility.js","lineNumber":170,"functionName":null,"typeName":null,"methodName":null,"columnNumber":41,"native":false},{"fileName":"fs.js","lineNumber":107,"functionName":"Object.oncomplete","typeName":"Object","methodName":"oncomplete","columnNumber":15,"native":false}]
uncaught exception: TypeError: Object #<Object> has no method 'register'
jukkar commented 10 years ago

I am not sure how useful this log is but I straced steward and the log can be found here https://docs.google.com/file/d/0B92N-GBvquUaSUV6eTZ5Y3NPVFk/edit?pli=1

Steward is doing something crazy as there is constant flow of these lines in the log

poll([{fd=37, events=POLLIN}], 1, 0)    = 0 (Timeout)
epoll_wait(5, {{EPOLLIN, {u32=36, u64=36}}}, 1024, 4) = 1
mrose17 commented 10 years ago

@jukkar - thanks. could i test your patience a bit? what i would like to understand is whether the issue is related to a particular interface or a combination of interfaces.

would it be possible for you to run a few tests to narrow this down? i.e., run it with tether only, then enps03 only. and see if it gets killed or not. thanks!

finally: i forgot to mention earlier, the fix to https://github.com/TheThingSystem/steward/issues/185 occurs when the steward starts, once it starts listening and "doing things" any code associated with that fix is no longer running...

jukkar commented 10 years ago

@mrose17 - Some test results below

Host: Intel NUC device OS: Linux 3.10.19 TTS: 1.6+ (commit id bbc9e0c9ae) node.js: 0.10.26

No network interfaces available when steward starts:

One interface up and having IP address:

Two interfaces up and having IP address:

Three interface up and having IP address:

Four interface up and having IP address:

It seems that having multiple IP addresses in the device is the triggering condition. Multiple network interfaces without IP address did not trigger the memory issue.

Just by accident I used TTS v1.5 in Intel NUC and still got the same memory issue. For some reason I did not see the "two interface up" issue fixed by #185 this time in Intel NUC hw. So the fix to #185 is clearly not causing this issue described in this bug, just as you suggested in your previous comment.

mrose17 commented 10 years ago

thanks very much for running the tests.

tomorrow evening (thursday) i'll run some tests here on a system with two interfaces active, perhaps i can see some pattern going on…

thanks for your patience. hopefully we can get to the bottom of this very soon!

ps: i'm not sure about the steward waiting for an interface to come up before starting, but that's another issue. (-;

mrose17 commented 10 years ago

well, i can reproduce it, which is good news. so now i'm going to try to figure out where it's going sideways...

mrose17 commented 10 years ago

got it. had to move to the current pcap module (which i've wanted to do anyway for a while). i'm running with three interfaces and seeing < 1.2% CPU on average

mrose17 commented 10 years ago

to get the fix you need to "get the latest", viz.,

    cd steward
    git pull
    cd steward
    rm -rf node_modules
    npm -l install

and then restart the steward. thanks for reporting this and for your patience!