Koenkk / zigbee-herdsman

A Node.js Zigbee library
MIT License
481 stars 301 forks source link

Waitress timer is being set at the wrong time #856

Closed corporategoth closed 3 months ago

corporategoth commented 10 months ago

The waitress timer for messages is being set at the wrong time.

Specifically, it is being set when a command is requested, NOT when it is put on the zigbee device. This can cause a command to time out (which, at least on EZSP, can cause the whole app to crash) without the command ever having gotten onto the zigbee device.

To test this, create a device with concurrency of 1. Send it 2 commands, both of which will get a response at 2/3rds the timeout period. The expected behavior is both will succeed, because from the time we sent it to the zigbee device, we got the required response before the timeout. However, because the timeout is set when the command is requested (ie. before it went into the concurrency-limited queue), NOT when it was sent to the zigbee device, the second command will be timed out, despite the fact it should succeed.

In practical terms, this means that either when there is a lot of commands going out (which hits the concurrency limiter), OR it Z2M is particularly busy (meaning the main loop gets tied up with a lot of MQTT traffic, say, in the case of HA connecting and doing it's auto-configuration), commands can time out before ever hitting the zigbee device.

Suggested fix: Put the waitress timeout on the message ONLY when it's being send to the actual device, NOT when it's enqueued to the concurrency limited queue.

corporategoth commented 10 months ago

NOTE - I am using the EZSP driver - I understand the timeout and queuing is driver specific. But I think that the EZSP driver follows the same pattern as other drivers, ie.

  1. set a timeout
  2. push to a concurrency limited queue
  3. send messages to zigbee device
  4. cancel timer when we receive a response.

The order should be:

  1. push to concurrency limited queue
  2. set a timeout
  3. send message to zigbee device
  4. cancel timer when we receive a response
Koenkk commented 10 months ago

The crashing may already have been fixed by https://github.com/Koenkk/zigbee-herdsman/commit/33fa223e19dfbd4094c3179dd70bd3499359bd29 (available in the dev branch)

corporategoth commented 10 months ago

@Koenkk It still leaves the situation of, if I have a lot of things to go out at once, the current concurrency queue doesn't work as it should (ie. things can/will still timeout before ever hitting the zigbee device).

The concurrency limiter is supposed to delay some actions until there is 'space' available on the zigbee device, and we should only consider the event timed out if it's taken more than <timeout time, currently 10s> after being sent to the zigbee network.

So the ticket is still valid, though it'd be great if the crashing was stopped (I use Z2M on TrueNAS, so it's not easy for me to test a dev version of Z2M, so I can't confirm).

Koenkk commented 10 months ago

@Nerivec you are currently looking into the EZSP queue, is this problem related to that?

Nerivec commented 9 months ago

I fixed a few waiters that were being started too early a while back, that should take care of what you mentioned in your 2nd post. The logic should be as follow everywhere now (unless I missed one):

  1. create a request
  2. create a waiter for the request
  3. send request
  4. start waiter
    • resolve on response
    • cancel on error (usually if lower-level logic detected an error and stopped the request in its track)
    • reject on timeout

The actual buffering/draining to the port is handled by serialport+node stream (the adapter should also do its own buffering logic, dictated by firmware config). There hasn't been any need to override that behavior so far...

Are you still experiencing issues with latest release (1.35.2)? What size network do you have? With about 35 devices (HAOS on NUC), I had 5 timeouts in nearly a month of Z2M running non-stop, all immediately retried with success, so just general network congestion timeouts. (Been keeping track since your post :smile:)

corporategoth commented 9 months ago

Actually, looking further, this MIGHT already be resolved.

The concurrency queue is here: https://github.com/Koenkk/zigbee-herdsman/blob/master/src/adapter/ezsp/adapter/ezspAdapter.ts#L58

With most commands being send via. the queue here: https://github.com/Koenkk/zigbee-herdsman/blob/master/src/adapter/ezsp/adapter/ezspAdapter.ts#L435

Which goes here: https://github.com/Koenkk/zigbee-herdsman/blob/master/src/adapter/ezsp/driver/driver.ts#L544 to here: https://github.com/Koenkk/zigbee-herdsman/blob/master/src/adapter/ezsp/driver/ezsp.ts#L676 to here: https://github.com/Koenkk/zigbee-herdsman/blob/master/src/adapter/ezsp/driver/ezsp.ts#L635 - which is where the timer is set before passing through to the UART, here: https://github.com/Koenkk/zigbee-herdsman/blob/master/src/adapter/ezsp/driver/uart.ts#L377

Notably, the concurrency is happening BEFORE the timer is set in this sequence, which would make this ticket moot. There are 3 queues involved here:

  1. The queue in the adapter, which is limited by concurrency
  2. The queue in ezsp, which is unbounded and seems to have no real purpose, except possibly serialization (though that is already achieved elsewhere)?
  3. The queue in the uart, which is limited to 1, to presumably prevent simultaneous accessing of the underlying serial device.
corporategoth commented 9 months ago

Are you still experiencing issues with latest release (1.35.2)? What size network do you have? With about 35 devices (HAOS on NUC), I had 5 timeouts in nearly a month of Z2M running non-stop, all immediately retried with success, so just general network congestion timeouts. (Been keeping track since your post 😄)

I have still been on 1.35.1, and this tends to be more an issue on startup with high load, which I have obviated by turning logging to warn (which really slows things down with HA autoconfig). I have 180 devices, which translates to more than 12k home assistant 'entities', we've been tracking issues with this in https://github.com/Koenkk/zigbee2mqtt/issues/20648. We are hoping that with https://github.com/home-assistant/core/pull/109030 in home assistant, Z2M can support device-based discovery and cut down the MQTT traffic too, greatly reducing the load on Z2M.

But yeah, it's likely your fixes from 3 weeks ago did fix this specific issue already.

Nerivec commented 9 months ago

The queue in ezsp, which is unbounded and seems to have no real purpose, except possibly serialization (though that is already achieved elsewhere)?

https://github.com/Koenkk/zigbee-herdsman/blob/43f6ab7338ceceaee1dc8c6a4f6a1919e3bcbfd2/src/adapter/ezsp/driver/ezsp.ts#L335 Not specifying the count defaults to 1: https://github.com/Koenkk/zigbee-herdsman/blob/43f6ab7338ceceaee1dc8c6a4f6a1919e3bcbfd2/src/utils/queue.ts#L11 But yes, I'll look into removing it eventually (it's basically holding just to pass to the UART queue).

Actually, you might get some feedback on your issues after that PR is merged, since it should then support alerting on NCP-reported issues like lack of buffers, network congestion related stuff (detected by the firmware). I'm guessing you're eating those buffers quite fast with that many devices (assuming most are not far-apart reporting).

As for the startup, I'll look if there are any improvements to be made in the number of calls to the NCP, that should help network-wise. There is also room for improvement as far as caching static or near-static values, that would avoid quite a few NCP calls. I'm currently looking into writing a lot more tests for the whole stack though; many changes, many potential mistakes... 😉

Nerivec commented 9 months ago

@corporategoth I believe I read somewhere, your background was C/C++, if you want to take a look at the SDK logic, to see if you spot features that'd improve Z2M's handling... here is the link, if you didn't have it already https://github.com/SiliconLabs/gecko_sdk I'm always up for a bit of brainstorming.

It works on callbacks, so the logic is a bit of a pain to follow (though my background's not C, so I'm probably far slower than you would). And it is split in multiple parts that interact with each other based on activated plugins/device type... which doesn't really help keep code paths separate...

If you need some bedtime reading material...

corporategoth commented 9 months ago

@Nerivec I am primarily a C/C++ developer, this is true. But I know nothing of what you are sending to/from the device, how the device works, or are trying to achieve / what shortfalls there are. The documentation you linked seems quite comprehensive though. As Z2M is in typescript, unless you're having trouble deciphering the docs, C/C++ coding ability won't help you. Reading the actual source is not useful, since what you really want to do is implement the specifications outlined in the docs.

The first doc, the gateway protocol seems fairly straightforward, it describes a wire protocol, which is not language specific. I assume this is already implemented in the EZSP implementation in Z2M.

The second doc, the reference guide, while giving information using C types, should also be fairly straightforward. Since Z2M is not written in C/C++, you can't use it directly, but it seems to just be listing the basic field names, types and order of the 'data' components referenced in the wire protocol. For Z2M purposes though, this should be fairly easy to interpret, since every field is resolving essentially down to a uint8_t (unsigned 8-bit integer), uint16_t (unsigned16-bit integer, etc). I can interpret this document pretty easily, as it's broken down into sections:

  1. The configuration items that are mutable in the device itself, what their minimum/maximum values are and what they're for.
  2. General byte formatting rules for sending/receiving frames with the device - presumably this is already coded into Z2M or you could not communicate at all.
  3. Type aliases, so that the rest of the doc can refer to types by name, rather than primitives (ie. a C typedef more or less)
  4. Structure definitions - which tell you for each message structure the order and exact fields to be sent/received.
  5. Field identifiers for the configuration items mentioned in part 1
  6. Value identifiers, presumably for a bunch of the type aliases in part 3 (think of these like enum values)
  7. And finally the actual messages you can send, including the opcode, and data structure used for the message and it's response.
  8. Although this is all using C++ types/nomenclature, you just need to copy this into typescript to make it work for Z2M.

The third document, the uart host interfacing guide, seems fairly useless, it seems to just be describing 'how to talk to a UART device' ... nothing special.

The fourth document seems more about how to packet capture on the device, or use their specific IDE to send data to/from the device. Not really useful here, since you already have the serial interface code that is talking to the device, and have bidirectional communication.

The last document, while interesting, seems more about how to use their C API. Something useful if you were coding in C, and were using the API they provide directly. But Z2M is not written in C/C++, and doesn't use their C API. What you're actually doing is re-implementing the wire API in Typescript (with the help of docs 1 and 2). So this doc also is useless for Z2M purposes. Though I guess it does describe some of the basics of initialing, sending and receiving, etc to the device - so not completely useless, but stuff that's already implemented in Z2M, otherwise we would have no EZSP driver in Z2M.

In short, I'm not really sure what you want my help on? But I am pretty sure if you wanted me to look at this, it would have to be something done real time, to try and answer your questions live, as a github comment is not a good forum for that :)

corporategoth commented 9 months ago

@Nerivec if you are on the Z2M discord, you should find me there as PreZ - feel free to reach out to me there.

Nerivec commented 9 months ago

Wow, didn't waste any time there! It was more a general suggestion really, in case you saw any interesting "lower-level" features that you thought Z2M could benefit from, the EZSP part anyway. But yes all the base protocol stuff is done, it just needs a bit of cleanup & tweaking, and then the addition/use of newer commands. It does make for exceptionally large (dangerous) commits though; so I'm focusing a bit on automated testing at the moment.

I've gone through more silabs docs than I can count, but I'm one of these people who learns faster from reading code than docs 😄

corporategoth commented 9 months ago

@Nerivec I know next-to nothing about how zigbee works, let alone EZSP itself. So I doubt I could provide much insights of any features not being used that would benefit from Z2M. Especially since I don't know what we're already using, or what anything does.

If I studied it long enough and looked at the current code, I could probably figure out how Zigbee works, how EZSP is implementing it and what features would be good to use. I did see some stuff around memory management/ram usage that would be cool to use to ensure we either use the EZSP to it's full capacity, but also don't run out of memory or whatever. I saw stuff about callbacks, etc. But I don't know really what it's doing right now, let alone what it could be doing if we only used more functions.

corporategoth commented 9 months ago

That said, one thing I would probably have done is built a library specifically for interacting with the EZSP driver that replicates all the reference doc's functionality. So declares all the structures / enums, declares functions for each message that can be sent, taking and returning the appropriate structures, etc. Then wire up all those functions so they send to or receive from the device appropriately. And ensure we can accept unsolicited messages and invoke callbacks for each message we could receive in an unsolicited manner.

Once that's done, I'd package that up as the EZSP implementation for typescript/JS. Keeping in mind there would be basically no business logic in there, just implementing the spec, so implementing all possible communications to/from the EZSP device. THEN I would use that TS library in the EZSP driver in Z2M, using the APIs and structures defined in the EZSP library created above.

ie. I would remove any/all low-level communication code from Z2M, and make Z2M operate via. an API (that you created by implementing the reference docs). THEN worry about how much of the API could you put to use in Z2M (ie. looking at all API functions, and thinking 'Am I using it? Could I use it more? Do I need it?') and code accordingly.

While this would take a little longer, it would be a better solution IMO. And you would then have a JS-based EZSP library that could be used so different projects could interact with EZSP cards, not just Z2M. But JS is not one of the languages I really use often, or well, so doing said implementation is not something I could dive into. I'm actually not a huge fan of implementing low-level functionality in high-level languages actually, even though it's possible. I would love to see Z2M as a C++ app grin - the performance would be off the charts if written well, even if you had a JS front end (UI) tallking to a C++ backend (zigbee/MQTT). But whose got time for that? :)

Nerivec commented 9 months ago

I'm sure the implementation could use some love memory-wise, serialization is a bit "bloated" at the moment in my opinion. But same, that carries a lot of risk until we have more tests.

corporategoth commented 9 months ago

I'm sure the implementation could use some love memory-wise, serialization is a bit "bloated" at the moment in my opinion. But same, that carries a lot of risk until we have more tests.

What are these "tests" you speak of ... never heard of them ;)

github-actions[bot] commented 3 months ago

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 30 days