Koenkk / zigbee-herdsman

A Node.js Zigbee library
MIT License
456 stars 277 forks source link

Unhandled promise rejection when stopping the coordinator #555

Closed ercksen closed 1 year ago

ercksen commented 1 year ago

Hey, I encountered a bug today where my Node process exited due to an unhandled promise rejection and finally found out that stopping the coordinator was the cause.

Following code can reproduce the issue:

const {Controller} = require("zigbee-herdsman");

const coordinator = new Controller({
    databasePath: 'test.db',
    serialPort: {path: '/dev/ttyACM0'}
});

let date = new Date();

coordinator
    .start()
    .then(() => {
        coordinator.stop()
            .then(() => console.log("Stopped", new Date() - date))
            .catch(err => console.log("Catch error (2):", new Date() - date, err));
        console.log("Controller created", new Date() - date);
    })
    .catch(err => console.log("Catch error (1):", new Date() - date, err))

I added dates for debugging purposes but when the app crashes no error handler is invoked. I used my stopwatch and it takes 10 seconds to crash. Output:

Controller created 211
Stopped 319
node:internal/process/promises:279
            triggerUncaughtException(err, true /* fromPromise */);
            ^

[UnhandledPromiseRejection: This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). The promise rejected with the reason "TIMEOUT".] {
  code: 'ERR_UNHANDLED_REJECTION'
}

I tried to get the stack using the unhandled rejection event but it was undefined, I do not know where to search. It would be nice to restart the controller without terminating Node itself. I am using CC253X with DECONZ Adapter.

Maybe this information is helpful to debug the problem. If you have any further questions, let me know. I am using v0.14.40 and the test.db file had no previous devices in it.

Koenkk commented 1 year ago

Can you provide the herdsman debug logging of this? Run your process with DEBUG=zigbee-herdsman:*

ercksen commented 1 year ago

Here is the log output. I marked the point at which the stop() function of the controller is invoked. The lines before are from regular communication I think but I have attached some of the last lines before the stop was initiated. I forgot to mention that I am using Docker with a mounted device, if that makes any difference.

2022-07-06T17:39:09.679Z zigbee-herdsman:deconz:adapter checking coordinator simple descriptor
2022-07-06T17:39:09.681Z zigbee-herdsman:deconz:driver DATA_REQUEST - destAddr: 0x0 EP:0 SeqNr. 3 request id: 1
2022-07-06T17:39:09.682Z zigbee-herdsman:deconz:frameParser DATA_REQUEST RESPONSE - request id: 1 status: 0
2022-07-06T17:39:09.682Z zigbee-herdsman:deconz:driver networkstate: 2 apsDataConfirm: 0 apsDataIndication: 0 configChanged: 0 apsRequestFreeSlots: 1
2022-07-06T17:39:09.683Z zigbee-herdsman:deconz:frameParser DEVICE_STATE changed: 10101110
2022-07-06T17:39:09.683Z zigbee-herdsman:deconz:driver networkstate: 2 apsDataConfirm: 1 apsDataIndication: 1 configChanged: 0 apsRequestFreeSlots: 1
2022-07-06T17:39:09.686Z zigbee-herdsman:deconz:driver query aps data confirm
2022-07-06T17:39:09.690Z zigbee-herdsman:deconz:driver DATA_CONFIRM - sending data state request - SeqNr. 4
2022-07-06T17:39:09.691Z zigbee-herdsman:deconz:driver query aps data indication
2022-07-06T17:39:09.692Z zigbee-herdsman:deconz:frameParser DATA_CONFIRM RESPONSE - destAddr: 0x0 request id: 1 confirm status: 0
2022-07-06T17:39:09.692Z zigbee-herdsman:deconz:driver networkstate: 2 apsDataConfirm: 0 apsDataIndication: 1 configChanged: 0 apsRequestFreeSlots: 1
2022-07-06T17:39:09.692Z zigbee-herdsman:deconz:driver query aps data indication
2022-07-06T17:39:09.695Z zigbee-herdsman:deconz:driver DATA_INDICATION - sending read data request - SeqNr. 5
2022-07-06T17:39:09.697Z zigbee-herdsman:deconz:frameParser DATA_INDICATION RESPONSE - seqNr. 5 srcAddr: 0x0 destAddr: 0x0 profile id: 0x0 cluster id: 0x8004 lqi: 17
2022-07-06T17:39:09.697Z zigbee-herdsman:deconz:frameParser response payload: 1,0,0,0,20,1,4,1,5,0,1,3,0,0,10,0,25,0,3,1,0,32,0,0,5
2022-07-06T17:39:09.697Z zigbee-herdsman:deconz:adapter resolve data request without a transSeq Nr.
2022-07-06T17:39:09.697Z zigbee-herdsman:deconz:driver networkstate: 2 apsDataConfirm: 0 apsDataIndication: 0 configChanged: 0 apsRequestFreeSlots: 1
2022-07-06T17:39:09.697Z zigbee-herdsman:deconz:adapter RECEIVING SIMPLE_DESCRIPTOR - addr: 0x0 EP:1 inClusters: 0,10,25 outClusters: 1,32,1280
2022-07-06T17:39:09.697Z zigbee-herdsman:deconz:adapter EP: 1
2022-07-06T17:39:09.698Z zigbee-herdsman:deconz:adapter profile ID: 260
2022-07-06T17:39:09.698Z zigbee-herdsman:deconz:adapter device ID: 5
2022-07-06T17:39:09.698Z zigbee-herdsman:deconz:adapter input cluster: 0x0
2022-07-06T17:39:09.698Z zigbee-herdsman:deconz:adapter input cluster: 0xa
2022-07-06T17:39:09.698Z zigbee-herdsman:deconz:adapter input cluster: 0x19
2022-07-06T17:39:09.698Z zigbee-herdsman:deconz:adapter output cluster: 0x1
2022-07-06T17:39:09.698Z zigbee-herdsman:deconz:adapter output cluster: 0x20
2022-07-06T17:39:09.698Z zigbee-herdsman:deconz:adapter output cluster: 0x500
2022-07-06T17:39:09.701Z zigbee-herdsman:deconz:driver DATA_INDICATION - sending read data request - SeqNr. 6
<CONTROLLER STOP REQUEST>
2022-07-06T17:39:16.183Z zigbee-herdsman:controller:log Disable joining
2022-07-06T17:39:16.185Z zigbee-herdsman:deconz:adapter zclFrame to all - zclFrame.payload:
2022-07-06T17:39:16.186Z zigbee-herdsman:deconz:adapter { options: 10, commisioningWindow: 0 }
2022-07-06T17:39:16.186Z zigbee-herdsman:deconz:adapter sendZclFrameToAll - message send
2022-07-06T17:39:16.186Z zigbee-herdsman:deconz:driver DATA_REQUEST - destAddr: 0xfffd EP:242 SeqNr. 7 request id: 2
2022-07-06T17:39:16.188Z zigbee-herdsman:deconz:frameParser DATA_REQUEST RESPONSE - request id: 2 status: 0
2022-07-06T17:39:16.188Z zigbee-herdsman:deconz:driver networkstate: 2 apsDataConfirm: 0 apsDataIndication: 0 configChanged: 0 apsRequestFreeSlots: 1
2022-07-06T17:39:16.188Z zigbee-herdsman:deconz:frameParser DEVICE_STATE changed: 10100110
2022-07-06T17:39:16.188Z zigbee-herdsman:deconz:driver networkstate: 2 apsDataConfirm: 1 apsDataIndication: 0 configChanged: 0 apsRequestFreeSlots: 1
2022-07-06T17:39:16.192Z zigbee-herdsman:deconz:driver query aps data confirm
2022-07-06T17:39:16.197Z zigbee-herdsman:deconz:driver DATA_CONFIRM - sending data state request - SeqNr. 8
2022-07-06T17:39:16.198Z zigbee-herdsman:deconz:frameParser DATA_CONFIRM RESPONSE - destAddr: 0xfffd request id: 2 confirm status: 0
2022-07-06T17:39:16.198Z zigbee-herdsman:deconz:driver networkstate: 2 apsDataConfirm: 0 apsDataIndication: 0 configChanged: 0 apsRequestFreeSlots: 1
2022-07-06T17:39:16.202Z zigbee-herdsman:deconz:driver DATA_REQUEST - destAddr: 0xfffc EP:0 SeqNr. 9 request id: 3
2022-07-06T17:39:16.204Z zigbee-herdsman:deconz:frameParser DATA_REQUEST RESPONSE - request id: 3 status: 0
2022-07-06T17:39:16.204Z zigbee-herdsman:deconz:driver networkstate: 2 apsDataConfirm: 0 apsDataIndication: 0 configChanged: 0 apsRequestFreeSlots: 1
2022-07-06T17:39:16.242Z zigbee-herdsman:deconz:frameParser DEVICE_STATE changed: 10100110
2022-07-06T17:39:16.242Z zigbee-herdsman:deconz:driver networkstate: 2 apsDataConfirm: 1 apsDataIndication: 0 configChanged: 0 apsRequestFreeSlots: 1
2022-07-06T17:39:16.243Z zigbee-herdsman:deconz:driver query aps data confirm
2022-07-06T17:39:16.248Z zigbee-herdsman:deconz:driver DATA_CONFIRM - sending data state request - SeqNr. 10
2022-07-06T17:39:16.249Z zigbee-herdsman:deconz:frameParser DATA_CONFIRM RESPONSE - destAddr: 0xfffc request id: 3 confirm status: 0
2022-07-06T17:39:16.249Z zigbee-herdsman:deconz:driver networkstate: 2 apsDataConfirm: 0 apsDataIndication: 0 configChanged: 0 apsRequestFreeSlots: 1
2022-07-06T17:39:16.250Z zigbee-herdsman:deconz:adapter PERMIT_JOIN - 0 seconds
<stop() function returned without an error>
2022-07-06T17:39:16.256Z zigbee-herdsman:deconz:driver send write parameter request from queue. seqNr: 11 paramId: 33 param: 0
2022-07-06T17:39:26.257Z zigbee-herdsman:deconz:driver Timeout for request - CMD: 0xb seqNr: 11
 Unhandled Rejection  at: Promise Promise { <rejected> 'TIMEOUT' } reason: TIMEOUT stack: undefined
Koenkk commented 1 year ago

@ChrisHae since this is in the deconz adapter code, can you take a look at this?

ercksen commented 1 year ago

I think I figured out why it crashes. In line 101 of src/adapter/deconz/adapter/deconzAdapter.ts there is a missing await for the drivers close function. But close() returns a promise that is not returned or caught. So if it rejects, Node will terminate. Of course that still would mean that close() throws an error but that would probably be another topic. I will do some testing on the next days when I have more time and let you know.

ercksen commented 1 year ago

Okay I did some investigation now and that was actually not the reason - although I see this as a problem and there should maybe an await be added. But the reason the promise is rejected lays in deconz/driver/driver.js. I don't know the architecture of this library well but it seems as there are no more rejection handlers after the controller stops.

I found a simple workaround (or fix?) by changing

while (i--) { (line 394)

to while (i-- && this.initialized) {

This will stop the queue processing and error rejection. Additionally, now the loop terminates gracefully. If later calling start again on the controller re-uses the old driver, it should rather be changed to

while (i--) { if (!this.initialized) return; } ...

But as I said before I don't know what is more suitable.

Also I found similar occurrences in lines 622 and 964 that look almost equally to me.

While reading the code I also saw some vague expressions that looked a little suspicious to me for unhandled rejections but they were not of importance here. If you are interested, I can have a look at it again and create an MR but idk how this is done on GitHub tbh.

EDIT: It must be the second variant. Otherwise it breaks communication with my devices. But still, after re-opening the controller, I run into problems. But that could also be a problem of my application.

ercksen commented 1 year ago

Finally I got a better workaround (maybe this time we could even call it a fix).

The problem was that there were arrays that were used by a controller instance but declared module-wide so that they are shared even if a new controller is instantiated after stopping the old one. I just set the length of all arrays to 0 (deletes all entries without destroying references), plus, I put all setInterval() calls in an array and cleared the intervals when 'closed' is emitted. Currently, the intervals get orphans after the driver was closed, probably still consuming CPU time.

Koenkk commented 1 year ago

I think I figured out why it crashes. In line 101 of src/adapter/deconz/adapter/deconzAdapter.ts there is a missing await for the drivers close function. But close() returns a promise that is not returned or caught. So if it rejects, Node will terminate. Of course that still would mean that close() throws an error but that would probably be another topic. I will do some testing on the next days when I have more time and let you know.

I've added the await, can you check if it is fixed now?

ercksen commented 1 year ago

I already tested that yesterday and that was not the reason why it crashed (as I mentioned in https://github.com/Koenkk/zigbee-herdsman/issues/555#issuecomment-1179168263). But anyways, probably good that it is added now. In the comment linked were the steps explained that I did to make it working.

And finally I found a decent solution as explained in https://github.com/Koenkk/zigbee-herdsman/issues/555#issuecomment-1179269021. Seems like you have missed these somehow.

Koenkk commented 1 year ago

I see your solution but this still looks a bit hacky to me, can we clear these arrays on a stop in the adapter code?

ercksen commented 1 year ago

Jup, this is exactly what I am doing:

constructor() {
    // ...
    this.on('close', () => {
        this.intervals.forEach(i => clearInterval(i)); // intervals is the array were I push the
        queue.length = 0;                              // results of setInterval() into
        busyQueue.length = 0;
        apsQueue.length = 0;
        apsBusyQueue.length = 0;
        apsConfirmIndQueue.length = 0;
        timeoutCounter = 0;
    });
}

Or are you concerned about the fact that it is hooked into the event emitter? We could also put this directly inside the stop function. But the close event is called from three different locations inside the driver. I thought it would be better to catch all driver closes to ensure that all resources are freed.

Koenkk commented 1 year ago

This looks good to me, could you make a pr?

ercksen commented 1 year ago

I wanted to push the commit to a new branch but got a 403. Should I fork the project instead and create a PR then? I am coming from Gitlab and actually don't know how to create the PR properly.

ercksen commented 1 year ago

driver.ts.zip

Deconz driver: Clear intervals and shared arrays after close; catch all unhandled promises and forward errors to debugger

If it's easier, I zipped the file (.ts not supported) instead.

Koenkk commented 1 year ago

You can click the edit icon here: https://github.com/Koenkk/zigbee-herdsman/blob/master/src/adapter/deconz/driver/driver.ts (pencil) then make a pr

ercksen commented 1 year ago

Wow! Didn't know of that. Thanks.

Koenkk commented 1 year ago

Thanks for the mr, I assume this can be closed now.

Dodobibi commented 1 year ago

Hi,

The problem recurs when using socket port instead of serial port.

The socket was destroyed before clearing intervals ??

Here is the full debug log: {"path":"tcp://192.168.1.40:2002","adapter":"deconz"}

////// CALL await controller.start()
  zigbee-herdsman:deconz:driver Set delay to 0 +0ms
  zigbee-herdsman:controller:log Starting with options '{"network":{"networkKeyDistribute":false,"networkKey":[*********],"panID":*******,"extendedPanID":[********],"channelList":[11]},"serialPort":{"path":"tcp://192.168.1.40:2002","adapter":"deconz"},.................' +0ms
  zigbee-herdsman:deconz:driver Opening TCP socket with 192.168.1.40:2002 +2ms
  zigbee-herdsman:deconz:driver Socket connected +336ms
  zigbee-herdsman:deconz:driver Socket ready +0ms
  zigbee-herdsman:controller:log Started with result 'resumed' +338ms
  zigbee-herdsman:controller:log Injected database: true, adapter: true +0ms
  zigbee-herdsman:deconz:driver send read parameter request from queue. seqNr: 1 paramId: 1 +4ms
  zigbee-herdsman:deconz:frameParser read firmware version response - version: 50,110,101,116 +0ms
  zigbee-herdsman:deconz:frameParser MAC: 0x00212effff0413c3 +1ms
  zigbee-herdsman:deconz:driver send read parameter request from queue. seqNr: 2 paramId: 7 +31ms
  zigbee-herdsman:deconz:frameParser NWKADDR: 0 +26ms
resumed
  zigbee-herdsman:controller:log Disable joining +2s
  zigbee-herdsman:deconz:adapter zclFrame to all - zclFrame.payload: +0ms
  zigbee-herdsman:deconz:adapter { options: 10, commisioningWindow: 0 } +0ms
  zigbee-herdsman:deconz:adapter sendZclFrameToAll - message send +1ms
  zigbee-herdsman:deconz:driver DATA_REQUEST - destAddr: 0xfffd EP:242 SeqNr. 3 request id: 1 +2s
  zigbee-herdsman:deconz:frameParser DATA_REQUEST RESPONSE - request id: 1 status: 0 +2s
  zigbee-herdsman:deconz:driver networkstate: 2 apsDataConfirm: 0 apsDataIndication: 0 configChanged: 0 apsRequestFreeSlots: 1 +19ms
  zigbee-herdsman:deconz:frameParser DEVICE_STATE changed: 10100110 +3ms
  zigbee-herdsman:deconz:driver networkstate: 2 apsDataConfirm: 1 apsDataIndication: 0 configChanged: 0 apsRequestFreeSlots: 1 +3ms
  zigbee-herdsman:deconz:driver query aps data confirm +8ms
  zigbee-herdsman:deconz:driver DATA_CONFIRM - sending data state request - SeqNr. 4 +15ms
  zigbee-herdsman:deconz:frameParser DATA_CONFIRM RESPONSE - destAddr: 0xfffd request id: 1 confirm status: 0 +40ms
  zigbee-herdsman:deconz:driver networkstate: 2 apsDataConfirm: 0 apsDataIndication: 0 configChanged: 0 apsRequestFreeSlots: 1 +17ms
  zigbee-herdsman:deconz:driver DATA_REQUEST - destAddr: 0xfffc EP:0 SeqNr. 5 request id: 2 +15ms
  zigbee-herdsman:deconz:frameParser DATA_REQUEST RESPONSE - request id: 2 status: 0 +33ms
  zigbee-herdsman:deconz:driver networkstate: 2 apsDataConfirm: 0 apsDataIndication: 0 configChanged: 0 apsRequestFreeSlots: 1 +18ms
  zigbee-herdsman:deconz:frameParser DEVICE_STATE changed: 10100110 +127ms
  zigbee-herdsman:deconz:driver networkstate: 2 apsDataConfirm: 1 apsDataIndication: 0 configChanged: 0 apsRequestFreeSlots: 1 +127ms
  zigbee-herdsman:deconz:driver query aps data confirm +7ms
  zigbee-herdsman:deconz:driver DATA_CONFIRM - sending data state request - SeqNr. 6 +16ms
  zigbee-herdsman:deconz:frameParser DATA_CONFIRM RESPONSE - destAddr: 0xfffc request id: 2 confirm status: 0 +40ms
  zigbee-herdsman:deconz:driver networkstate: 2 apsDataConfirm: 0 apsDataIndication: 0 configChanged: 0 apsRequestFreeSlots: 1 +17ms
  zigbee-herdsman:deconz:adapter PERMIT_JOIN - 0 seconds +264ms
////// Sleeping for 2 seconds...
////// CALL await controller.stop()
  zigbee-herdsman:deconz:driver Port closed +5ms
  zigbee-herdsman:deconz:driver send write parameter request from queue. seqNr: 7 paramId: 33 param: 0 +0ms
  zigbee-herdsman:deconz:driver Error writing socket Port: Cannot call write after a stream was destroyed +1ms
  zigbee-herdsman:deconz:adapter checking coordinator simple descriptor +350ms
  zigbee-herdsman:deconz:driver DATA_REQUEST - destAddr: 0x0 EP:0 SeqNr. 8 request id: 3 +361ms
  zigbee-herdsman:deconz:driver Error writing socket Port: Cannot call write after a stream was destroyed +1ms
  zigbee-herdsman:deconz:driver Timeout for request - CMD: 0xb seqNr: 7 +10s

[UnhandledPromiseRejection: This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). The promise rejected with the reason "TIMEOUT".] {
  code: 'ERR_UNHANDLED_REJECTION'
}
 ELIFECYCLE  Command failed with exit code 1.