Issues with request queue behavior for sendWhen=active and sendWhen=immediate

What happened?

There are several issues and design shortcomings in the current message queue for devices with defaultSendRequestWhen="active" and defaultSendRequestWhen="immediate". I did some fixes for some of them with PR 676, but, apparently, this only made the remaining issues more visible.

Specifically, I see the following issues:

The queue behavior is inconsistent. After startup, messages are being send directly to the device, until one message gets queued for whatever reason. Only after that, the queue is activated and the following issues become visible (fix in PR 676)
There is no aging mechanism in the message queue. In case of e.g. a lost connection, old requests will be sent as soon as the device is back online, even after days (fix in PR 676)
Setting the default sendWhen behavior based on the checkin interval does not make sense. The checkin interval is totally unrelated to the MAC polling interval that the device is using to poll data from its parent. This is especially the case for setting the default sending behavior to fastpoll. There are devices with a MAC polling every ~7 seconds and a poll interval of e.g. 10 minutes, 30 minutes, etc. For none of these devices does it make sense to hold back data until the checkin, instead of letting the controller send it immediately upon the next MAC poll by the device. So setting fastpoll as the default behavior for these devices does not make any sense, and I would even argue that it does not make sense for any device at all. It can even be dangerous if e.g. an "unlock" command to a smart lock is seemingly not delivered immediately (lock does not react) but then the command is delivered several minutes or hours later
Even sendWhen=active should not be required for many sleepy end devices, if they poll data every 6-7 seconds. And if they don't, they cannot expect a guarantee that messages are delivered. (Only thing that we might consider is changing the NWK_INDIRECT_MSG_TIMEOUT in the Z-Stack firmware from 7 seconds to a couple more seconds. And I am not sure though, if all other coordinators also do have an internal message queue we can rely on.) *Sending a single message from the queue can take up to ~1 minute if the send request fails. During that time, several implicit checkins can happen, and sending the entire queue will be triggered multiple times in parallel. This can lead to a big mess of multiple send attempts in random order, re-sending of already delivered messages, etc.

Since this would require a bigger change, I'd like to put this up for discussion before starting to implement some stuff. I would propose following redesign of the sendWhen logic: Instead of distinguishing the Zigbee internal mechanisms of sending (sendWhen = active/immediate/fastpoll), I think it would be more meaningful for the sender to select a sending "policy", i.e. different levels of how reliably a message is delivered, whether messages must be sent in order, how urgently a message shall be delivered. I would suggest the following 3 sendingPolicies instead of the current sendWhen:

*immediate (~= immediate) "Standard" ZigBee 3.0 compliant way of sending. Sends immediately to the coordinator which uses the device/coordinator data polling mechanism. No guarantee of delivery, if the device polls less often than every 7.xx seconds ( as defined in Zigbee 3.0); no retry on failure. This is reasonable as a default for most devices, no matter if RX when idle is on or off. For example, it should be the default policy for smart locks, where unsuccessful commands should be discarded immediately or at most after several minutes; replaying an "open" command afer several hours may be even dangerous.

*retry-latest (~= active) Same as above, but on failure, stores the latest command in zigbee-herdsman and retries on explicit or implicit checkin. If a new message with 'retry-latest' arrives, the previous message is discarded. Reasonable e.g. for thermostat setpoint commands, where it is reasonable to deliver later, but where a newer command should overwrite the last one instead of replaying the whole history of commands. This is reasonable as a default for devices with known data polling interval >7 seconds or when not losing commands is important.

*bulk (~= fastpoll) If PollCtrl cluster is configured, buffer all messages with 'bulk', enable fastpoll and send them in correct order. Without PollCtrl configured, send immediately. No guarantee when the messages will be delivered, can take up to one pollPeriod. No further retry after one sending attempt (either fastpoll or immediately). This is reasonable to send a e.g. bunch of configuration messages or firmware update. I do not see any case where this is a reasonable default behavior for a device.

Note that the current sendWhen='active' would be closer to a retry-all behavior where not only the latest message remains queued. We could also add this as a policy, but I can't think of a use case where this would be really advantageous to retry-latest. For example for a thermostat, retry-all would replay all temperature set commands in order, while retry-latest would only send the last set command, which seems more reasonable to me.

What did you expect to happen?

No response

How to reproduce it (minimal and precise)

Select a device with defaultSendRequestWhen=active or defaultSendRequestWhen=fastpoll. Send a few commands to make sure the device works properly.
You need to get a failed request into the message queue first. What worked for me with the BTH-RA was the following, this might also work with other devices: Select the device in the frontend, go to the developer console, select any unsupported endpoint, select any attribute and then read this attribute. The queue behavior should now be activated.
Send one command to the device (e.g. lock/unlock, temperature change, depends on your device). The device will seemingly not respond to the command any more (in fact, it is only stored for later sending).
Send a few more commands to the device to fill the queue a little.
Wait a few minutes to hours (depends on the device's checkin interval and on how long it has been since the last checkin. For sendWhen=active devices, this should take a little shorter.
The device will now execute all commands in order. As a bonus for sendWhen=active, it might happen that the entire queue is sent multiple times in parallel. Then the device will execute all commands multiple times, and in random order. Depending on the device, and if you're lucky, this can end up in a never-ending execution of commands.
Repeat as you like. To get the device back to "normal" behavior, restart zigbee2mqtt.

Zigbee2MQTT version

1.30.2

Adapter firmware version

20221226

Adapter

Smartlight.me

Debug log

No response

Thanks for taking the time to make the initial PR, and now for typing all this up so clearly and having another go at this issue!

I'm not deeply familiar with Zigbee-specific queuing behavior, but I am very familiar with robustness in distributed systems on the software side. The proposed new delivery policies make sense to me, broadly speaking, from the perspective of an application on top of zigbee2mqtt although I don't understand all of the nuances for how they would be used in Zigbee devices, especially for things like configuration, binding, etc. But for basic get/set messages, they definitely make sense.

I was curious about some of the terminology, like fastpoll, and found this description of the various polling modes in zigbee networks https://www.qorvo.com/design-hub/blog/demystifying-polling-control-in-zigbee-networks - if you have other resources to point me to, I'd be happy to take a look.

This is potentially a high-impact change (as we found when that PR was reverted) - so thanks for your openness to discussion to get this right! Do you have ideas on how this change could be made safer? Is it possible to divide this problem, eg splitting the "messages don't age out of the queue" issue from larger changes to the sendRequestWhen enum?

Is there a need to separate queue handling behavior for "internal" zigbee messages (eg, from a switch to a light, where both the sender and the receiver are zigbee devices) vs "external" messages sent by a user or application through mqtt (eg a zigbee2mqtt/<device>/set {"state":"ON"} mqtt message). As an application developer, I'd love to be able to specify the send policy for my command message. On the other hand, I wouldn't want to change zigbee internal message handling behavior.

This sounds like a good refactor to me, especially the bulk/fastpoll mode seems to introduce problems with little added practical benefit. I would also like to ask the opinion of @kennylevinsen (since this was introduced in https://github.com/Koenkk/zigbee-herdsman/pull/446).

Hmm. I agree that queuing is suboptimal, but don't quite follow the conclusion.

Setting the default sendWhen behavior based on the checkin interval does not make sense. The checkin interval is totally unrelated to the MAC polling interval that the device is using to poll data from its parent. This is especially the case for setting the default sending behavior to fastpoll. There are devices with an in-spec MAC polling every ~7 seconds and a poll interval of e.g. 10 minutes, 30 minutes, etc.

The mac polling interval for a Sleepy End Device (SED) is not a particularly useful parameter. When not in fast poll, the polling interval defined by LongPollInterval in genPollCtrl can be very long - up to almost 21 days, although 10-30 minutes is more reasonable - and when it finally does poll it will only wake up to see if the parent had any messages queued in that instant and then immediately go back to sleep. This means that to send a message successfully to a SED while relying purely on the long MAC poll timings you would have to send a packet in a particular 7.68 second window every, say, 30 minutes. That's a 0.48% chance of success if the device did not have any messages of its own to send, which would of course prolong the window.

The check-in interval is the time where the SED actively gives its genPollCtrl clients (z2m) a chance to talk to it to push state changes if present, with the ability to ask the device to keep its radio on. Without this, even if we knew the device was on the network we would have no idea if the device would remain long enough to receive a message.

The queuing behavior is suboptimal though. The purpose of checkin is to ask the genPollCtrl servers if there is any current state changes, and not to empty a queue of 30 minutes of stale messages (which is also what causes the referenced "unlock" issue, although more on that later). A thermostat coordinator for example would use checkin to send the current sensor readings and setpoints, and would not waste any time transmitting old data. There are two ways to achieve this "proper" that I can think of:

Rather than treat values as a command in a queue, have Z2M cache the value and send the most recent one on next checkin. It might also try an immediate send, should it be lucky. This overlaps to a certain extend with the retry-latest suggestion when combined with checkin/fastpoll.
Notify MQTT subscribers of checkin and give them a fastpoll window to send the latest state.

I think either would be a significant improvement over any queuing.

t can even be dangerous if e.g. an "unlock" command to a smart lock is seemingly not delivered immediately (lock does not react) but then the command is delivered several minutes or hours later.

The example of the smartlock is a good one where you have a battery powered end device that is presumably not sleepy (curious: is this a real example of a non-sleepy device exposing genPollCtrl or a theoretical one?) , as nobody wants to wait 5 minutes for a lock to unlock. Implementing any queuing on a non-sleepy end device is not useful, and in the case you mention possibly harmful.

On the other hand, it does seem like the exception rather than the rule, with the general case being that you want commands delivered successfully, assuming MQTT clients are not supposed to implement their own error handling, retry loops and manual sendWhen control. In that case, the lock/unlock commands would be the ones with a sendWhen override, rather than everyone else.

Long responses about specs and technicalities aside, as long as my SED's work reliably with long polling and checkin intervals, I'll be happy. I will certainly not be opposed to simplifications.

EDIT: Clarified a bit around retry-latest.

Rather than treat values as a command in a queue, have Z2M cache the value and send the most recent one on next checkin.

It might also be interesting to note that some converters currently have a manual approximation of such behavior where the intended state is stored, e.g.: https://github.com/Koenkk/zigbee-herdsman-converters/blob/40ac912e280a12fe5e8110e97c9b520dcc25f5aa/converters/fromZigbee.js#L7392-L7414

Interaction between this queuing mechanism and optimistic state updates is also important: If one MQTT client issues a series of setpoint changes in a "delivery-guaranteed" mode, another client asking for the current value would likely need to see the last value queued or cached, rather than the last value successfully submitted or retrieved. Otherwise you might end up with a tug-of-war situation and confused clients.

@junosuarez , @Koenkk , @kennylevinsen, first of all thank you a lot for your feedback. I think I'll just go through them in order:

if you have other resources to point me to, I'd be happy to take a look.

@junosuarez I found this slide deck which has a quite good and compact explanation of the mechanisms on the first couple of slides https://raw.githubusercontent.com/wiki/MarkDing/IoT-Developer-Boot-Camp/files/ZB-2019Q4-ZMGC-Training/Zigbee-Sleepy-End-device.pdf

Do you have ideas on how this change could be made safer? Is it possible to divide this problem, eg splitting the "messages don't age out of the queue" issue from larger changes to the sendRequestWhen enum?

I like this suggestion. How about this:

Message aging and avoidance of simultaneous sending would be rather low-risk/high-benefit steps that should come first.
Remove fastpoll as a default behavior and set the default behavior from the long poll interval instead of the checkin interval.
Change active queue to have only the latest value for each command. Refactor semantics from sendWhen to sendingPolicies to reflect new behavior (could also be subsequent steps)
Fix inconsistent behavior that the queue is only activated after first error. This should be implemented only afer all the other issues with the queue are fixed.
As a further step, the queue and the sendingPolicies could be made available for all devices ( sleepy and non-sleepy), possibly with different defaults
Exposing the sendingPolicies to other applications should then come last

Is there a need to separate queue handling behavior for "internal" zigbee messages (eg, from a switch to a light, where both the sender and the receiver are zigbee devices) vs "external" messages sent by a user or application through mqtt (eg a zigbee2mqtt/<device>/set {"state":"ON"} mqtt message). As an application developer, I'd love to be able to specify the send policy for my command message. On the other hand, I wouldn't want to change zigbee internal message handling behavior.

No worries, the policy would not alter "internal" Zigbee behavior between devices, only "external" messages that get sent from mqtt to the ZigBee network.

The mac polling interval for a Sleepy End Device (SED) is not a particularly useful parameter. When not in fast poll, the polling interval defined by LongPollInterval in genPollCtrl can be very long - up to almost 21 days, although 10-30 minutes is more reasonable - and when it finally does poll it will only wake up to see if the parent had any messages queued in that instant and then immediately go back to sleep.

@kennylevinsen This is really interesting, I believe you have sleepy devices that behave totally different than the ones I've seen so far (not very many tbh). I sounds like your devices really go to deep sleep for a very long time - I would be very curious which devices you are using there.

Anyway, I think I was unprecise with my wording. I was referring to the setting of the default behavior in zigbee-herdsman. The default sending behavior for a device is derived based on the checkin interval, not the long poll interval. This is what does not make any sense to me. Particularly, the thermostat https://www.zigbee2mqtt.io/devices/BTH-RA.html has a checkin interval of >10 minutes, and therefore defaultSendRequestWhen=active. The thermostat https://www.zigbee2mqtt.io/devices/014G2461.html has a checkin interval of <10 minutes and therefore defaultSendRequestWhen=fastpoll. Both continuously poll from the parent with a long poll interval of only ~7s. So they would work just fine without any queueing at all (which they do after startup until the first message gets queued). As soon as something is in the message queue, they react only with at least several minutes of delay.

This means that to send a message successfully to a SED while relying purely on the long MAC poll timings you would have to send a packet in a particular 7.68 second window every, say, 30 minutes. That's a 0.48% chance of success if the device did not have any messages of its own to send, which would of course prolong the window.

Disabling the sending on PollControl checkins altogether was not what I intended to suggest. I'd rather suggest to try once (with .48% chance of success in your example), try again on each implicit checkin (with maybe a 10% chance of success), and then try again on the next PollControl checkin (with a 100% chance of success).

The example of the smartlock is a good one where you have a battery powered end device that is presumably not sleepy (curious: is this a real example of a non-sleepy device exposing genPollCtrl or a theoretical one?) , as nobody wants to wait 5 minutes for a lock to unlock. Implementing any queuing on a non-sleepy end device is not useful, and in the case you mention possibly harmful.

I had this "endless switching-back-and-forth" behavior with the thermostats and I can reproduce this with the steps described in the ticket. The lock example refers to this issue https://github.com/Koenkk/zigbee2mqtt/issues/17160, which sounds to me like the exact same behavior.

However it seems to me that we might have a different understanding of "sleepy". So to clarify, "sleepy" for me means "RX off when idle" and "has genPollCtrl cluster" - and therefore has DefaultSendRequestWhen=fastpoll or active and uses the queue logic. It does not imply that the device has to go to sleep for a particularly long time. So the thermostats would be "sleepy" according to my definition (they turn RX off 99% of the time and wake up only every 7 seconds, and they do expose genPollCtrl), but I believe they would not be sleepy according to your definition.

Rather than treat values as a command in a queue, have Z2M cache the value and send the most recent one on next checkin. It might also try an immediate send, should it be lucky. This overlaps to a certain extend with the retry-latest suggestion when combined with checkin/fastpoll.

Notify MQTT subscribers of checkin and give them a fastpoll window to send the latest state.

Both suggestions are much better than just storing the single latest command. I would go for the first one because it does not change the current interface towards applications. I put it to the sequence of steps above as step 3.

On the other hand, it does seem like the exception rather than the rule, with the general case being that you want commands delivered successfully, assuming MQTT clients are not supposed to implement their own error handling, retry loops and manual sendWhen control. In that case, the lock/unlock commands would be the ones with a sendWhen override, rather than everyone else.

I agree, but I would see this as a later step or even out of scope of this ticket because it is also not the way how it is handled for "normal" ZigBee devices today (and it can be quite annoying when a devices stays switched on because it lost connection right when the "switch off" command was sent).

as long as my SED's work reliably with long polling and checkin intervals, I'll be happy.

Fair enough. I'm counting on your feedback on any changes I do ;-)

Long story short, I sense no general disagreement, so I will see if I can implement step 1 over the weekend. That would already give a big improvement over the status quo and we will have a piece of code we can use for any further discussion :-)

However it seems to me that we might have a different understanding of "sleepy". So to clarify, "sleepy" for me means "RX off when idle" and "has genPollCtrl cluster". [...] It does not imply that the device has to go to sleep for a particularly long time. So the thermostats would be "sleepy" according to my definition (they turn RX off 99% of the time and wake up only every 7 seconds, and they do expose genPollCtrl), but I believe they would not be sleepy according to your definition.

They are sleepy by definition, but... With insomnia. A Sleepy End Device is Zigbee End Device whose radio is off by default, only waking up at its long poll interval to submit a MAC data request. Very power restricted devices or devices older than Zigbee 3.0 will not have the poll control cluster.

As for expected poll intervals, SI Labs Zigbee Application Framework documentation^1 refer to a default of 5 minutes in their implementation for long poll, and notes that the end device timeout is the soft limit to look out for. It can sleep for longer, but after that it will have to rejoin next time it wakes, which might be a poor trade-off.

STMicroelectronics sleepy end device development guide^2 also notes the following:

There are two rates of MAC data poll, slow and fast. The slow rate is determined entirely by the device. It can be as slow as once every 30 minutes or even once per day, depending on the application. [...] While in fast poll mode the end device stays awake and polls at the fast poll rate which is a value less than nwkTransactionPersistenceTime (7.68 seconds). Commonly a fast poll value of 7.5 seconds is used.

And has this interesting note about the use of the parent's messaging queue:

The indirect message queue mechanism is not designed to be a general-purpose mechanism for arbitrary network nodes to communicate with sleepy end nodes. Its purpose is to allow sleepy end nodes to associate with their parent during the joining process.

It should be noted that when the poll control cluster is available, we might be able to adjust long poll^3, but for a battery powered device you want it as long as possible as transmitting a MAC data request is expensive for a battery device. On some devices this is low out of the box, but I suspect the intention is that the coordinator increases the value after setup.

The general theme is that these intervals can be expected to be very long, making the odds of delivery outside checkin unlikely. Even when you do send at the right time, your message needs to fit in a queue with a very limited size. Anywho, on to the changes!

Your plan does look sound. The main point of question is with respect to heuristics and fastpoll: Indeed, the current heuristic is definitely broken, but I don't necessairly think it would be right to change behavior based on LongPollInterval either - whether an "unlock" command should be queued and retried isn't a function of whether the smartlock polls every 5, 10 or 15 seconds.

I imagine it would be nicer if we could avoid the heuristics and just always have the same behavior, with e.g. aging being what controls time-sensitive commands.

Regardless, good luck and keep me (loosely) in the loop. :)

First, thank you all for your comments, feedback, and further reading. Here is a short update on the rework. The first step is already merged.

Based on your feedback, I have now adapted and refined the further steps a little. @kennylevinsen I like your suggestion to use message aging as the primary way to control time-critical behavior and have an otherwise consistent sending behavior. This can also be very nicely combined with a more sophisticated message caching instead of a plain queue. The idea is to give the sender full control on how the message shall be delivered via a combination of the send policy and the request timeout, but have reasonable defaults for both of them, so that the sender does not have to specify anything in most practical cases. The only reasonable overrides I could think of that a sender would actually want would be bulk for e.g. a complex set of configuration messages, or a reduced / zero timeout if we don't want a message to be delivered later, e.g. a "door unlock" command.

The proper default send policy / caching behavior depends on the type of the Zcl message, so I had to extend the list of policies a little, see below. E.g. an "off" command should in general overwrite a previous "on" command, but if the second command is e.g. part of a larger "write undivided" command, it would be better to not remove the first one. Also, response messages should be treated differntly and should always be sent immediately and should not be queued at all. The reasonable defaults for all Zcl messages are also listed below.

Implementation steps:

Implement message aging for 'fastpoll' & 'active'. Prevent simultaneous send attempts. Always send response messages immediately and without queueing. ✅
Refactor to support filtering of requests while still keeping current queue and request behavior as-is. ✅
Introduce sendPolicies (see below) and implement request filtering in the message queue for 'fastpoll' & 'active'. All requests are filtered by default based on their default policy. If this is explicitly not desired, 'bulk' policy can be used. ✅
Change send behavior for 'fastpoll' and 'active' to always try to send a new message once, no matter if something is already queued or not. This relaxes the current strict message ordering a bit in favor of a consistent send behavior when messages are in the queue and when not. The relaxation is ok because message filtering ensures that older competing requests will be removed from the queue before sending (e.g. if a new "switch on" request arrives, all old "switch on/off" requests in the queue would be removed before sending). Also, if the end device really has its Rx off, the message will be queued and the order will still be preserved. And if immediate sending does succeed, then the other messages in the queue are likely stuck there for other reasons than the end device being asleep and should not block other messages from being sent. ✅
Remove failed messages only on timeout or when overwritten by a new request, not after a failed implicitCheckin send attempt. I.e. on each implicit checkin, all previously failed messages are attempted to be sent once until the message is aged out. ✅
Use same behavior for 'fastpoll': attempt to send new messages immediately, queue on failure, retry on every implicit checkin (a 'real' sleepy device will not be awake and send anything between poll control checkins anyway, and if it does, it does not hurt trying). Finally send all messages in fastpoll after the next poll control checkin. (if anything fails to be sent there too, it will be left in the queue and retried on implicit checkins (if any), but it will be removed before the next checkinInterval).
Deprecate now-obsolete sendWhen and remove it from endpoint.ts. ✅

Send request policies:

'bulk':             Message must be sent together with other messages in the correct sequence.
                    No immediate delivery required.
'immediate':        Request shall be sent immediately and not be kept for later retries (e.g. response message).
'keep-payload':     Request shall be sent as soon as possible.
                    If immediate delivery fails, the exact same payload is only sent once, even if there were
                    multiple requests.
'keep-command':     Request shall be sent as soon as possible.
                    If immediate delivery fails, only the latest command for each command ID is kept for delivery.
'keep-cmd-undiv':   Request shall be sent as soon as possible.
                    If immediate delivery fails, only the latest undivided set of commands is sent for each unique
                    set of command IDs.

Default policies:

const defaultSendPolicy: {[key: number]: SendPolicy} = {
    0x00: 'keep-payload',   // Read Attributes
    0x01: 'immediate',      // Read Attributes Response
    0x02: 'keep-command',   // Write Attributes
    0x03: 'keep-cmd-undiv', // Write Attributes Undivided
    0x04: 'immediate',      // Write Attributes Response
    0x05: 'keep-command',   // Write Attributes No Response
    0x06: 'keep-payload',   // Configure Reporting
    0x07: 'immediate',      // Configure Reporting Response
    0x08: 'keep-payload',   // Read Reporting Configuration
    0x09: 'immediate',      // Read Reporting Configuration Response
    0x0a: 'keep-payload',   // Report attributes
    0x0b: 'immediate',      // Default Response
    0x0c: 'keep-payload',   // Discover Attributes
    0x0d: 'immediate',      // Discover Attributes Response
    0x0e: 'keep-payload',   // Read Attributes Structured
    0x0f: 'keep-payload',   // Write Attributes Structured
    0x10: 'immediate',      // Write Attributes Structured response
    0x11: 'keep-payload',   // Discover Commands Received
    0x12: 'immediate',      // Discover Commands Received Response
    0x13: 'keep-payload',   // Discover Commands Generated
    0x14: 'immediate',      // Discover Commands Generated Response
    0x15: 'keep-payload',   // Discover Attributes Extended
    0x16: 'immediate',      // Discover Attributes Extended Response
};

Sounds good @slugzero, many thanks again for pushing this!

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days

Not stale. Next PR will follow soon

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days

Koenkk / zigbee2mqtt