WebThingsIO / gateway

WebThings Gateway
http://webthings.io/gateway
Mozilla Public License 2.0
2.61k stars 335 forks source link

Gateway crash with assert in libuv #2400

Closed cdelston0 closed 4 years ago

cdelston0 commented 4 years ago

Raspberry Pi running gateway at commit 8a594194c71d14e8b89ade6875995880b96b7c35 things-url-adapter addon at 0.4.5

Using test code from https://github.com/cdelston0/gateway-test

Test command is:

taskset 0xF7 python3 test.py --gateway-url=http://10.0.0.16:8080 --gateway-user=testuser@email.com --gateway-password=testpass --things-quantity=10 --property-changes=10
MultipleThingProfiling.test_4_property_change_via_WS_no_wait_for_gateway

Doesn't crash on first invocation, does seem to reliably crash at start of second invocation.

Gateway log attached indicates an assert in libuv, and preliminary searches indicate perhaps a double close on a libuv resource (https://github.com/joyent/libuv/issues/949)

pi_gateway_crash.log

mrstegeman commented 4 years ago

I'm unable to reproduce in either of the following environments:

  1. Gateway master branch, latest commit, running with Node 12 on macOS.
  2. Gateway 0.10 Docker image (which uses Node 8).

I'm running the test script in a python:3.7 Docker container.

cdelston0 commented 4 years ago

This issue doesn't occur on my x86_64 / Linux target either. But does seem to affect both the Raspberry Pi 3 B+ and the Beaglebone Enhanced. So ARM target related. The beaglebone crash isn't the same, but is also an assert in libuv. It's possible the libuv on debian ARM is compiled with asserts enabled where other platforms aren't?

mrstegeman commented 4 years ago

Now I can't get the test to even finish once. It keeps hanging:

2020-01-13 17:21:25.464934 Waiting for 10 webthings
2020-01-13 17:21:29.446703 All webthings added
POST failed: 400 - Web thing already added
POST failed: 400 - Web thing already added
POST failed: 400 - Web thing already added
POST failed: 400 - Web thing already added
POST failed: 400 - Web thing already added
POST failed: 400 - Web thing already added
POST failed: 400 - Web thing already added
POST failed: 400 - Web thing already added
POST failed: 400 - Web thing already added
POST failed: 400 - Web thing already added
2020-01-13 17:21:29.700130 All webthings connected
--- Sending 10 property changes to 10 webthings via POST ( waiting for gateway response)
http---192.168.86.26-8800 sequence correct: True
http---192.168.86.26-8801 sequence correct: True
http---192.168.86.26-8802 sequence correct: True
http---192.168.86.26-8803 sequence correct: True
http---192.168.86.26-8804 sequence correct: True
http---192.168.86.26-8805 sequence correct: True
http---192.168.86.26-8806 sequence correct: True
http---192.168.86.26-8807 sequence correct: True
http---192.168.86.26-8808 sequence correct: True
http---192.168.86.26-8809 sequence correct: True
===
Mean propagation time: 29.30ms
Longest propagation time: 75.93ms
===
.2020-01-13 17:21:34.650883 All webthings connected
--- Sending 10 property changes to 10 webthings via POST ( not waiting for gateway response)
http---192.168.86.26-8800 sequence correct: True
http---192.168.86.26-8801 sequence correct: True
http---192.168.86.26-8802 sequence correct: True
http---192.168.86.26-8803 sequence correct: True
http---192.168.86.26-8804 sequence correct: True
http---192.168.86.26-8805 sequence correct: True
http---192.168.86.26-8806 sequence correct: True
http---192.168.86.26-8807 sequence correct: True
http---192.168.86.26-8808 sequence correct: True
http---192.168.86.26-8809 sequence correct: True
===
Mean propagation time: 29.32ms
Longest propagation time: 75.40ms
===
.2020-01-13 17:21:39.927786 All webthings connected
--- Sending 10 property changes to 10 webthings via WS:// ( waiting for gateway response)
http---192.168.86.26-8800 sequence correct: True
http---192.168.86.26-8801 sequence correct: True
http---192.168.86.26-8802 sequence correct: True
http---192.168.86.26-8803 sequence correct: True
http---192.168.86.26-8804 sequence correct: True
http---192.168.86.26-8805 sequence correct: True
http---192.168.86.26-8806 sequence correct: True
http---192.168.86.26-8807 sequence correct: True
http---192.168.86.26-8808 sequence correct: True
http---192.168.86.26-8809 sequence correct: True
===
Mean propagation time: 221.66ms
Longest propagation time: 326.78ms
===
.Task was destroyed but it is pending!
task: <Task pending coro=<WebSocketProtocol13._receive_frame_loop() done, defined at /usr/local/lib/python3.7/site-packages/tornado/websocket.py:1115> wait_for=<Future pending cb=[IOLoop.add_future.<locals>.<lambda>() at /usr/local/lib/python3.7/site-packages/tornado/ioloop.py:690, <TaskWakeupMethWrapper object at 0x75734b10>()]> cb=[IOLoop.add_future.<locals>.<lambda>() at /usr/local/lib/python3.7/site-packages/tornado/ioloop.py:690]>
2020-01-13 17:21:40.597623 All webthings connected
--- Sending 10 property changes to 10 webthings via WS:// ( not waiting for gateway response)
http---192.168.86.26-8800 sequence correct: True
http---192.168.86.26-8801 sequence correct: True
http---192.168.86.26-8802 sequence correct: True
http---192.168.86.26-8803 sequence correct: True
http---192.168.86.26-8804 sequence correct: True
http---192.168.86.26-8805 sequence correct: True
http---192.168.86.26-8806 sequence correct: True
http---192.168.86.26-8807 sequence correct: True
http---192.168.86.26-8808 sequence correct: True
http---192.168.86.26-8809 sequence correct: True
===
Mean propagation time: 237.98ms
Longest propagation time: 482.44ms
===
.2020-01-13 17:21:41.500477 All webthings connected
--- Changing property of 10 webthings 10 times, confirming change via gateway websocket
Task was destroyed but it is pending!
task: <Task pending coro=<WebSocketProtocol13._receive_frame_loop() running at /usr/local/lib/python3.7/site-packages/tornado/websocket.py:1118> wait_for=<Future pending cb=[IOLoop.add_future.<locals>.<lambda>() at /usr/local/lib/python3.7/site-packages/tornado/ioloop.py:690, <TaskWakeupMethWrapper object at 0x75734650>()]> cb=[IOLoop.add_future.<locals>.<lambda>() at /usr/local/lib/python3.7/site-packages/tornado/ioloop.py:690]>
cdelston0 commented 4 years ago

The output looks like you're running all of MultipleThingProfiling, rather than just MultipleThingProfiling.test_4_property_change_via_WS_no_wait_for_gateway

I know for a fact I was doing something stupid in test_5_property_change_at_webthing, so I don't currently expect the whole class to pass.

cdelston0 commented 4 years ago

I have committed a fix to the test harness which means that all cases do now pass on x86_64/Linux for me here https://github.com/cdelston0/gateway-test/commit/fb800b778fa6e22f50e1a2d828876b68c9a837b1

mrstegeman commented 4 years ago

Ah, yes. Had a bad copy/paste.

Things are passing for me now, with no crash (on ARM). Has your crash disappeared as well, or do you still have it?

cdelston0 commented 4 years ago

I can still crash both the beaglebone and the pi. It seems to happen on the second invocation of a test case, not the first. I.e. the first test will pass but second crashes the gateway. I've seen various causes, usually assert in libuv, sometimes a plain old segfault :(

mrstegeman commented 4 years ago

I ran several times, multiple iterations, and have been unable to crash...

cdelston0 commented 4 years ago

I've tried adding https://www.npmjs.com/package/segfault-handler to the build to improve the output when the crash occurs. Unfortunately it doesn't give much more information:

2020-01-14 11:40:59.160 INFO   : Loading add-on: thing-url-adapter
2020-01-14 11:41:00.048 INFO   : Opened a new things socket
2020-01-14 11:41:02.262 INFO   : thing-url: Opening database: /root/.mozilla-iot/config/db.sqlite3
2020-01-14 11:41:04.658 INFO   : thing-url: Loading add-on thing-url-adapter from /root/.mozilla-iot/addons/thing-url-adapter
2020-01-14 11:41:05.516 INFO   : thing-url: Starting mDNS discovery
PID 2163 received SIGSEGV for address: 0x38
/root/gateway/node_modules/segfault-handler/build/Release/segfault-handler.node(+0x20d4)[0xb64b80d4]
/lib/arm-linux-gnueabihf/libc.so.6(+0x24fe0)[0xb6cf1fe0]
Segmentation fault
npm ERR! code ELIFECYCLE
npm ERR! errno 139
npm ERR! webthings-gateway@0.10.1 start: webpack --display errors-only && node build/gateway.js
npm ERR! Exit status 139
npm ERR! 
npm ERR! Failed at the webthings-gateway@0.10.1 start script.
npm ERR! This is probably not a problem with npm. There is likely additional logging output above.

npm ERR! A complete log of this run can be found in:
npm ERR!     /root/.npm/_logs/2020-01-14T11_41_08_978Z-debug.log
cdelston0 commented 4 years ago

Things I've tried:

For clarity, there do seem to be a couple of different issues which can occur:

Can you suggest any methods I can deploy to get more information about where these might be occurring?

cdelston0 commented 4 years ago

Better bt extracted from core file:

Core was generated by `node build/gateway.js'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  uv__io_stop (loop=0x0, w=w@entry=0x2f97a44, events=events@entry=8199) at ../deps/uv/src/unix/core.c:922
922 ../deps/uv/src/unix/core.c: No such file or directory.
[Current thread is 1 (Thread 0xb6c36000 (LWP 7329))]
(gdb) bt
#0  uv__io_stop (loop=0x0, w=w@entry=0x2f97a44, events=events@entry=8199) at ../deps/uv/src/unix/core.c:922
#1  0x00daed94 in uv__poll_stop (handle=0x2f97a10) at ../deps/uv/src/unix/poll.c:103
#2  uv_poll_start (handle=0x2f97a10, pevents=1, poll_cb=0xb1f0df7d <NanomsgReadable(uv_poll_s*, int, int)>) at ../deps/uv/src/unix/poll.c:125
#3  0xb1f0de12 in PollCtx::begin_poll(int, bool) () from /root/gateway-new/node_modules/nanomsg/build/Release/node_nanomsg.node
#4  0xb1f0de44 in PollCtx::PollCtx(int, bool, v8::Local<v8::Function>) () from /root/gateway-new/node_modules/nanomsg/build/Release/node_nanomsg.node
#5  0xb1f0b070 in PollSocket(Nan::FunctionCallbackInfo<v8::Value> const&) () from /root/gateway-new/node_modules/nanomsg/build/Release/node_nanomsg.node
#6  0xb1f0ad4e in Nan::imp::FunctionCallbackWrapper(v8::FunctionCallbackInfo<v8::Value> const&) ()
   from /root/gateway-new/node_modules/nanomsg/build/Release/node_nanomsg.node
#7  0x00707b90 in v8::internal::MaybeHandle<v8::internal::Object> v8::internal::(anonymous namespace)::HandleApiCallHelper<false>(v8::internal::Isolate*, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::HeapObject>, v8::internal::Handle<v8::internal::FunctionTemplateInfo>, v8::internal::Handle<v8::internal::Object>, v8::internal::BuiltinArguments) ()
#8  0x00709564 in v8::internal::Builtin_HandleApiCall(int, unsigned int*, v8::internal::Isolate*) ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) 
mrstegeman commented 4 years ago

Looks to be yet another issue with nanomsg. 😞

mrstegeman commented 4 years ago

Potentially fixed by #2424.