Koenkk / zigbee-herdsman

A Node.js Zigbee library
MIT License
476 stars 296 forks source link

EZSP does not implement concurrency correctly #770

Closed corporategoth closed 11 months ago

corporategoth commented 11 months ago

Based on simple greps (in zigbee-herdsman/adapter): grep -ri utils_1.Queue *

deconz/adapter/deconzAdapter.js:        this.queue = new utils_1.Queue(concurrent);
ezsp/driver/ezsp.js:        this.queue = new utils_1.Queue();
z-stack/znp/znp.js:        this.queue = new utils_1.Queue();
z-stack/adapter/zStackAdapter.js:        this.queue = new utils_1.Queue(concurrent);
zigate/driver/zigate.js:        this.queue = new utils_1.Queue(1);
zigate/adapter/zigateAdapter.js:        this.queue = new utils_1.Queue(concurrent);

grep -ri concurr *

deconz/adapter/deconzAdapter.js:        const concurrent = this.adapterOptions && this.adapterOptions.concurrent ?
deconz/adapter/deconzAdapter.js:            this.adapterOptions.concurrent : 2;
deconz/adapter/deconzAdapter.js:        this.queue = new utils_1.Queue(concurrent);
tstype.d.ts:    concurrent?: number;
z-stack/adapter/zStackAdapter.js:        const concurrent = this.adapterOptions && this.adapterOptions.concurrent ?
z-stack/adapter/zStackAdapter.js:            this.adapterOptions.concurrent :
z-stack/adapter/zStackAdapter.js:        debug(`Adapter concurrent: ${concurrent}`);
z-stack/adapter/zStackAdapter.js:        this.queue = new utils_1.Queue(concurrent);
zigate/adapter/zigateAdapter.js:        const concurrent = this.adapterOptions && this.adapterOptions.concurrent ?
zigate/adapter/zigateAdapter.js:            this.adapterOptions.concurrent : 2;
zigate/adapter/zigateAdapter.js:        debug.log(`Adapter concurrent: ${concurrent}`);
zigate/adapter/zigateAdapter.js:        this.queue = new utils_1.Queue(concurrent);

The EZSP adapter does not implement any kind of concurrency. This means not only will the concurrency option in the config do nothing, but it will always try to send messages as fast as possible over the EZSP adapter.

This would be a large contributor to the issue I see of, when I am manipulating a lot of devices, some of them never get the memo to do the thing. This could also possibly explain why some other crashes are happening with EZSP - if it's not implementing concurrency, then some messages may never get out to the network, which will result in timeouts (possibly causing the interview crash, crash on startup due to timeouts, and much more).

Concurrency should be implemented for EZSP like it is for other adapters.

kirovilya commented 11 months ago

Your search results indicate that the EZSP adapter does not use the "concurrent" parameter... but this does not mean that the adapter does not work in concurrency mode. For example https://github.com/Koenkk/zigbee-herdsman/blob/master/src/adapter/ezsp/driver/driver.ts#L98C19

corporategoth commented 11 months ago

OK, so two things. 1) the concurrency SHOULD be able to be changed, as I said, I have a problem on my network where messages get dropped, so I need to reduce the concurrency. So this ticket is still valid.

2) and more importantly, I see you constructing that queue, as you say - but not actually using it.

For example, in ezsp.js, you create an unbounded queue: https://github.com/Koenkk/zigbee-herdsman/blob/master/src/adapter/ezsp/driver/ezsp.ts#L255 And then use it: https://github.com/Koenkk/zigbee-herdsman/blob/master/src/adapter/ezsp/driver/ezsp.ts#L553

However in this driver.js, you create a hard-coded bounded queue: https://github.com/Koenkk/zigbee-herdsman/blob/master/src/adapter/ezsp/driver/ezsp.ts#L553 But it's never used, everything just directly calls through to ezsp to execute, eg. https://github.com/Koenkk/zigbee-herdsman/blob/master/src/adapter/ezsp/driver/driver.ts#L497

So as I said, you don't have any bounded queues that are actually being used. Everything goes as fast as possible from what I can tell. I might be wrong, JS is not my primary language, and I acknowledge I suck at JS. But from my view, even the single bounded queue you create is unused, and where you DO use a queue, it's unbounded (and not configurable).

corporategoth commented 11 months ago

On further examination, I do see this:

adapter/ezspAdapter.js:        return this.driver.queue.execute(async () => {
adapter/ezspAdapter.js:        return this.driver.queue.execute(async () => {
adapter/ezspAdapter.js:        return this.driver.queue.execute(async () => {
adapter/ezspAdapter.js:        return this.driver.queue.execute(async () => {
adapter/ezspAdapter.js:        return this.driver.queue.execute(async () => {
adapter/ezspAdapter.js:        return this.driver.queue.execute(async () => {
adapter/ezspAdapter.js:        return this.driver.queue.execute(async () => {
adapter/ezspAdapter.js:        debug('sendZclFrameToEndpointInternal %s:%i/%i (%i,%i,%i)', ieeeAddr, networkAddress, endpoint, responseAttempt, dataRequestAttempt, this.driver.queue.count());
adapter/ezspAdapter.js:        return this.driver.queue.execute(async () => {
adapter/ezspAdapter.js:        return this.driver.queue.execute(async () => {
adapter/ezspAdapter.js:        return this.driver.queue.execute(async () => {
adapter/ezspAdapter.js:        return this.driver.queue.execute(async () => {
adapter/ezspAdapter.js:        return this.driver.queue.execute(async () => {
adapter/ezspAdapter.js:        return this.driver.queue.execute(async () => {
adapter/ezspAdapter.js:        return this.driver.queue.execute(async () => {
adapter/ezspAdapter.js:        return this.driver.queue.execute(async () => {
adapter/ezspAdapter.js:        return this.driver.queue.execute(async () => {
adapter/ezspAdapter.js:        return this.driver.queue.execute(async () => {

The adapter code seems to be using the driver's hard-coded queue ... though I'm not sure how the adapter code fits in with the driver code. But it could mean I'm wrong, and the queue IS used, but only via. the adapter code? It still should be configurable though, so I can tweak it so I don't get dropped signals.

Also, I'm not sure how the concurrency is supposed to work, but docs say that it's supposed to wait for a reply to free up a queue slot. Not sure if that's happening or if this is just doing fire and forget. If it's doing fire and forget (ie. not consuming that queue slot) then the queue does nothing. If it's waiting for the response to free up the queue slot, then cool. I don't know where that code is or how to check it.

I'm mentioning this as I assume you know where to check that, and could do so in seconds to verify that works correctly. All I know is that, if I try to control too many zigbee devices at once (for example a bunch of blinds, lights, etc) then not all the intended actions take place.

corporategoth commented 11 months ago

This seems inconsistent. The unicast command to send a frame to an endpoint doesn't use the queue at all: https://github.com/Koenkk/zigbee-herdsman/blob/master/src/adapter/ezsp/adapter/ezspAdapter.ts#L457

(group and all do): https://github.com/Koenkk/zigbee-herdsman/blob/master/src/adapter/ezsp/adapter/ezspAdapter.ts#L486 https://github.com/Koenkk/zigbee-herdsman/blob/master/src/adapter/ezsp/adapter/ezspAdapter.ts#L504

In other words.

  1. Unicast calls don't use a queue at all.
  2. The queue inside the EZSP class is unbounded, meaning effectively there is no queue at all (it doesn't really do much, other than turn the sync? serial call into an async one by introducing a timeout).
  3. This means anything invoking execCommand from the EZSP class, or via. the Driver that doesn't use the driver's queue will have no concurrency protection at all
  4. There are a lot of execCommands originating from the EZSP class itself, NOT via. the driver (ie. that have 0 concurrency):
    driver/ezsp.js:        const result = await this.execCommand("version", { desiredProtocolVersion: version });
    driver/ezsp.js:            await this.execCommand("version", { desiredProtocolVersion: result.protocolVersion });
    driver/ezsp.js:        const result = await this.execCommand("networkInit");
    driver/ezsp.js:        const result = await this.execCommand("leaveNetwork");
    driver/ezsp.js:        const ret = await this.execCommand('setConfigurationValue', { configId: configId, value: value });
    driver/ezsp.js:        const ret = await this.execCommand('getConfigurationValue', { configId: configId });
    driver/ezsp.js:        const ret = await this.execCommand('getMulticastTableEntry', { index: index });
    driver/ezsp.js:        const ret = await this.execCommand('setMulticastTableEntry', { index: index, value: entry });
    driver/ezsp.js:        const ret = await this.execCommand('setInitialSecurityState', { state: entry });
    driver/ezsp.js:        const ret = await this.execCommand('getCurrentSecurityState');
    driver/ezsp.js:        const ret = await this.execCommand('setValue', { valueId, value });
    driver/ezsp.js:        const ret = await this.execCommand('getValue', { valueId });
    driver/ezsp.js:        const ret = await this.execCommand('setPolicy', { policyId: policyId, decisionId: value });
    driver/ezsp.js:    async execCommand(name, params = null) {
    driver/ezsp.js:        const v = await this.execCommand("formNetwork", { parameters: params });
    driver/ezsp.js:        return this.execCommand('sendUnicast', {
    driver/ezsp.js:        return this.execCommand('sendMulticast', {
    driver/ezsp.js:        const res = await this.execCommand('setConcentrator', {
    driver/ezsp.js:            await this.execCommand('setSourceRouteDiscoveryMode', { mode: 1 });
    driver/ezsp.js:        return this.execCommand('sendBroadcast', {
    driver/ezsp.js:            await this.execCommand('nop');
  5. Since the driver doesn't use it's own concurrency queue, and it's reliant on the caller to do so, anything that doesn't use the concurrency queue itself will have no concurrency protection.

Ideally, the driver.js itself should ensure that any API it provides uses the concurrency it provides, and not rely on external callers (ie. ezspAdapter.js) to invoke it's concurrency externally. But that's now how the code is currently designed. And the omission of the Unicast messages to use the driver's concurrency means that any unicast message is not concurrency protected. Ideally, if the driver enforced concurrency instead of relying on the caller to use it, this could not have happened.

kirovilya commented 11 months ago

Yes, the queue created in the driver is used a little higher - in the adapter level. This is an architectural feature: previously there was a queue at the adapter level, but I removed it. Technically, the driver queue should be in the adapter, but I didn't move it. Yes, currently the queue is not used when sending a message directly to the device.

I recently removed the use of a queue when sending messages to the device https://github.com/Koenkk/zigbee-herdsman/pull/727 This removed the delay and waiting for a message to be sent to one device. It became possible to send up to 8 messages in parallel and wait for their execution.

It manifested itself like this: the user pressed the brightness increase button 5 times, but the first message to the device had delivery problems and blocked other messages. After a few seconds, the timeout handler will fire and report an error in the first message. Then the remaining messages in the queue for that device timed out. As a result, no messages were delivered to the device. After the queue is deleted (in the PR link above), all messages are sent. Even if the first message does not reach the device, it will be able to receive the remaining messages. As a result, a timeout error will only occur for those messages that did not reach the device, and not for all of them.

I don't understand your problem a little. There is parallelism when sending messages - present. Waiting for a response when sending messages - present. What do you want to influence? You can try to return the queue when sending messages, you can try to increase/decrease the number of parallel messages and check how it works for you.

corporategoth commented 11 months ago

By removing the queue for the device (in #727), you have effectively allowed unlimited concurrency. The problem is not no parallelization, it's too much!

Meaning that, rather than limiting it to 8 (the hard-coded queue limit) concurrent actions, if I tried to send a command to 80 devices, it would try and send all 80 at once - which will fail for most of them.

I'm not worried so much about concurrency of messages sent to the SAME device, I'm more worried about concurrency across devices. Right now, because of it's unbounded nature, I have to carefully construct my scenes on Home Assistant, and manually add delays, or things won't work.

For example, consider this scene.

And during the holidays, also:

This is 2 group messages (subject to the concurrency), and 15 unicast messages. In the holidays, add another 2 group messages and 8 unicast messages.

So that's 4 group messages and 23 unicast messages. The group messages abide by concurrency limits, but as the unicast messages don't, they not only all try to send at the same time, the group messages ALSO won't get rate limited because the unicast messages don't use the rate limiter (queue).

This essentially means 27 messages will alll go out at once. And a bunch of those will be dropped.

So yeah,

  1. Please re-introduce concurrency effecting Unicast messages
  2. Please allow for that concurrency to be configurable
  3. Please change the concurrency to ensure that nothing (including messages sent by the ezsp.js code itself) can bypass it.

I can't really do this myself, as the way zigbee2mqtt is deployed in my environment, it's in a docker container I can't really edit reliably. The container shuts down on z2m restart (meaning that I can't log into it and edit the files). Additionally, mounting in source code does not work (everything fails). Plus I can't specify a custom image to use, as I'm using TrueNAS for my Z2M, and it HELM deployments, which don't let me specify the image.

Thanks.

corporategoth commented 11 months ago

The TL:DR is, that I should not have to hand-craft my HA scenes to try and ensure I don't get dropped zigbee messages - that's what this concurrency is supposed to BE for. But right now, any significant zigbee scene with the EZSP driver will cause some of the actions to fail because too many messages are sent at once, and some of them are dropped.

MattWestb commented 11 months ago

For group command (its broadcast in the 802.15.4) its one underlying limiting in the 802.15.4 that is protecting form broadcast storm that can killing the network. They is doing it be have max 8 broadcast in 9 seconds (its one FIFO with the broadcast table with timer). If trying sending more broadcast all router SHALL ignoring them silent. Also the old way is sending the broadcast 3 times for being sure its being OK broadcasted (counting as one FIFO position) but good Zigbee 3 routers is listening if if all neighbors is re sending it and if OK then is sending only 1 time or 2 (passive acking).

If tuning this like Z2M have doing in the TI firmware so its possible sending more broadcast then its only the first level of routers getting the frame and they is not relaying it to the second level then its being blocked. By doing so its working if all router have direct RF contact with the coordinator but its braking the network completely and also the router discovery that is using broadcast is being broke or getting strange problems and then normal uni cast commands is not working well then the network cant finding devices in its network.

If broadcast is working good in the network and have good routers source routing is also working great that is making uni cast working better but if not its better disabling source routing in the network then its being more stable at least with larger EZSP networks (advanced ZHA users experience with around 200 devices).

The unicast problem i think you 2 is better digging in it but in the end its need broadcast working OK for working or the network cant finding its devices and its being stucked and blocking the transmitting in the network.

corporategoth commented 11 months ago

@MattWestb I believe this issue is specifically about overwhelming the EZSP device with requests - specifically unicast requests, though I imagine it doesn't matter whether the request is unicast or broadcast.

I'm not saying broadcast is not working correctly. But if you look at my situation, I have an HA scene which results in 23 unicast and 4 broadcast (group) messages. Right now, due to the lack of concurrency limiting in Z2M for the EZSP driver, it will attempt to send all of them back to back (as fast as it can), without any concurrency limiting. THAT will cause the dropped messages.

I don't have any problem communicating to any group or individual node on it's own as a one-off action.

Koenkk commented 11 months ago

Assuming this can be closed now since https://github.com/Koenkk/zigbee-herdsman/pull/771 is merged, thanks @kirovilya !

kirovilya commented 11 months ago

@corporategoth try the dev/edge version of z2m and the ability to configure the "concurrent" parameter. write whether it worked or not