Request timing for Sleepy End Devices

kennylevinsen commented 2 years ago

This summarizes information from https://github.com/Koenkk/zigbee-herdsman-converters/pull/3240

zigbee-herdsman currently treat Sleepy End Devices (SEDs) the same as any other, assuming the device will always be responsive with some simple retry logic. To fix certain SEDs that have shown problems with this behavior, a sendWhenActive option is utilized in a few places to delay a send attempt until next time data is received, which seem to work but with caveats.

Instead, we should have some general handling of Sleepy End Devices that can be enabled for all such end devices.

Overview

Polling intervals

Sleepy End Devices (SEDs) turn off their radio, making them unable to receive messages most of the time. They instead poll their parent for messages using a MAC data request. The parent is only required to hold a single message for up to 7.68 seconds.

SEDs have two polling intervals:

Long polling, defaulting to once every 5 seconds. This polling mode is the normal mode of operation.
Short polling, defaulting to once every 0.5 seconds. Short polling is intended for whenever the device wishes to be responsive, or when "fast poll" is requested through the Poll Control Cluster.

Long polling for battery-powered devices that do not need reliable Rx at all times may be much longer than the 7.68 second TTL for messages queued with their parent - which means that there is no guarantee that a request sent at an arbitrary point in time to a sleepy end device will be successfully delivered.

We are not directly informed of what polling mode a device is in, nor of when the device polls.

The Sleepy End Device must poll faster than the End Device Poll Timeout in order to not be aged out of the network. This timeout is communicated through an End Device Timeout Request.

Poll control cluster

The Poll Control Cluster also provides us with the following:

A check-in command, sent by default once every 1 hour. This should be longer than the long poll interval.

When a client (which in this case would be zigbee2mqtt) binds to this cluster, we should receive this command at the specified interval, and should respond with whether we want the server to enter fast poll, and for how long. We also have the ability to cancel a fast poll early with a fast poll stop command. While the server is in fast poll, it will use the short polling interval.
The ability to configure and query check-in, long poll and short poll intervals, as well as minimums and timeouts.

The problem

Sleepy End Devices should sleep as long as possible to save battery, and it is for all intents and purposes allowed to sleep as long as it wants. No matter what a sleeping device does, there is a risk to losing messages - if the poll interval is shorter than 7.68 seconds, the device may successfully receive no more than 1 message in long poll, while if the poll interval is longer, the device might not even receive that.

This means that we can't really send message to sleepy end devices and assume that it'll work out, even if the long poll interval is short. We can assume that a client is likely to be in short-poll for a little while after we receive a message from it, but there are no guarantees there.

The check-in command is designed specifically to allow us to occasionally take control polling on the end device, giving us selves however much time we want to send pending messages and chat with the device. This is not currently handled by us.

Improvements

Configure and use checkin for pending messages.

This has the benefit of giving us full control, and is specifically designed for this purpose. However, the default check-in duration is way too long - I wouldn't want my thermostats to change setpoint an hour later - so if used for this it would probably need to be lowered.

UPDATE: Upon properly binding genPollCtrl, it does seem as if these get sent regularly, much more often than the configured interval.
Improve sendWhenActive's pending behavior.

Currently, sendWhenActive unconditionally queues messages for next time we receive messages from the device, even if the device just sent us one. Instead, we could change the behavior so that we always try immediately, but queue as pending request if the attempt failed with an error that could indicate a sleeping device.

I believe this logic ends up being more reliable and more responsive than my previous suggestion of extending the "send when active" window to 500 ms or similar. It also lowers risk when we enable this for all sleepy end devices, as the new behavior only triggers in the event that normal behavior failed.

If needed, we could consider retrying as a pending request more than once.
Enable these behaviors for Sleepy End Device behaviors by default.

Even if some devices may seem to work okay, this handling applies to all Sleepy End Devices, and should not be detrimental to any.

Alternatives

The problem could be pushed up the stack to the other mqtt bus members, but that seems like too much of a leaky abstraction to me.

Can't think of others - up for discussion.

kennylevinsen commented 2 years ago

I am implementing some experiments here: https://github.com/kennylevinsen/zigbee-herdsman/tree/pollcontrol

Implemented at the time of writing:

Check-in handling, including decline and early termination of fast-poll when it isn't necessary (current sendPendingRequest changed to implicitCheckin)
A "retry-when-active" style reimplementation of sendWhenActive, which just tries to send a request immediately, and queues it if it failed.
Automatic enable of these behaviors for battery-powered end devices

Current TODO:

[x] Figure out if explicit check-in should be the only "pending request flush" driver when present as is currently the case.
[x] Figure out if the handling of explicit check-in is in the right place, and possibly how to silence "missing converter" errors when the check-in bubbles up.
[x] Ordering is a little broken for "retry-when-active" - we might want to discuss how it should work. Maybe we should always put the message in the queue and try flushing from head.
[ ] Consider how overwriting the sleepy state should work.

kennylevinsen commented 2 years ago

Through https://github.com/Koenkk/zigbee-herdsman/pull/446 and https://github.com/Koenkk/zigbee-herdsman/pull/447 we now have check-in support (must be enabled through device.configurePollControl on device configure) and a more opportunistic sendWhenActive behavior.

Still missing a "sleepy" device/endpoint state and possibly a heuristic for default enable.

sjorge commented 2 years ago

Any example of where configurePollControl is used?

I have a two Develco devices that seem to have genPollCtrl so would be interesting to see if I can configure that for them.

Edit, looks like it is a function on the device, and it's hardcoding endpoint 1, or am I reading this wrong: https://github.com/Koenkk/zigbee-herdsman/blob/3ee4af3c13a2a4d186c6d79d451c1c314a4e3d0f/src/controller/model/device.ts#L198-L207

For the develoco AirQuality and Heat sensors they live on different end points...

kennylevinsen commented 2 years ago

That's very interesting. The hardcoding was an assumption that seemed sound but clearly failed. Adding a source endpoint argument would be needed for your device. I'll write an update.

I was planning to use this for Danfoss thermostats once a "sleepy device" mode that auto-enables sendWhenActive for all sends is in.

EDIT: Had erroneously said the assumption was from reporting.bind, I misremembered.

sjorge commented 2 years ago

@Koenkk I think we could auto discover if genPollCtrl is available from the data stored in database.db right? If so, we could be smart about it and do the auto enable magic if the endpoint has the cluster, otherwise do the fallback?

Koenkk commented 2 years ago

@sjorge that sounds like a good idea, maybe this can be done in the device.interview()? @kennylevinsen for sure the configurePollControl has to be moved to endpoint and not device.

sjorge commented 2 years ago

Oh that's a good idea, when discovery the clusters during interview we can set the field/config for ever endpoint with the genPollCtrl server endpoint (I've not see it on a client endpoint and I'm not even sure the ZCL allows that?)

kennylevinsen commented 2 years ago

@Koenkk I think we could auto discover if genPollCtrl is available from the data stored in database.db right? If so, we could be smart about it and do the auto enable magic if the endpoint has the cluster, otherwise do the fallback?

@sjorge that sounds like a good idea, maybe this can be done in the device.interview()?

Just to understand correctly, we want to set up the binding directly from within device.interview() if we have the poll control cluster? I believe we lack the coordinator endpoint to do that. I don't see much reason to not enable poll control when present, so if we can get the coordinator endpoint it should be fine.

If it's just storing if and where it was present, we could just as well loop through the endpoints and call supportsInputCluster on all of them to find the right endpoint during configurePollControl.

@kennylevinsen for sure the configurePollControl has to be moved to endpoint and not device.

Would we still set the device-level useLegacyCheckin from there? If not, it's just the bind/unbind.

MattWestb commented 2 years ago

If one (perhaps sleepy) end device have one pull control cluster on one end point is not the same that its working. If the device (one Zigbee 3) is having one Zigbee 3 parent its requesting one "End Device Timeout Request" and its parent is replaying success to it if it was OK. If the parent is not one Zigbee 3 device or dont having pull control cluster implanted this is not working and the end device is not doing check ins. Also if the device is jumping to one new parent its can working or not but in the end the end device is always doing the request for setting up its parent for pull controll.

ZigBee Network Layer Command, Dst: 0x0000, Src: 0xf28e
    Frame Control Field: 0x1a09, Frame Type: Command, Discover Route: Suppress, Security, Destination, Extended Source Command
    Destination: 0x0000
    Source: 0xf28e
    Radius: 1
    Sequence Number: 85
    Destination: Ember_ff:fe:10:0a:49 (00:0d:6f:ff:fe:10:0a:49)
    Extended Source: SiliconL_ff:fe:5d:59:e8 (04:cd:15:ff:fe:5d:59:e8)
    ZigBee Security Header
    Command Frame: End Device Timeout Request
        Command Identifier: End Device Timeout Request (0x0b)
        Requested Timeout Enumeration: 256 min (8)
        End Device Configuration: 0x00

ZigBee Network Layer Command, Dst: 0xf28e, Src: 0x0000
    Frame Control Field: 0x1a09, Frame Type: Command, Discover Route: Suppress, Security, Destination, Extended Source Command
    Destination: 0xf28e
    Source: 0x0000
    Radius: 1
    Sequence Number: 6
    Destination: SiliconL_ff:fe:5d:59:e8 (04:cd:15:ff:fe:5d:59:e8)
    Extended Source: Ember_ff:fe:10:0a:49 (00:0d:6f:ff:fe:10:0a:49)
    ZigBee Security Header
    Command Frame: End Device Timeout Response, Success
        Command Identifier: End Device Timeout Response (0x0c)
        Status: Success (0)
        Parent Information: 0x03, MAC Data Poll Keepalive, End Device Timeout Request Keepalive

I think the best is resetting the pull control function after have getting one Device Announcement and start using pull control then the device is making its first pull control from its new parent or is very likely not working OK in all scenarios.

kennylevinsen commented 2 years ago

If the device (one Zigbee 3) is having one Zigbee 3 parent its requesting one "End Device Timeout Request" and its parent is replaying success to it if it was OK. If the parent is not one Zigbee 3 device or dont having pull control cluster implanted this is not working and the end device is not doing check ins.

Hmm, why would the parent matter in poll control? Poll control is communicated solely between the server (end device) and the client (in this case zigbee2mqtt), and the parent merely forwards the messages.

Even if it jumps to a new parent, the binding between server and client should remain operational. That is, unless I misunderstood something?

I believe a failed end device timeout request will only lead to the end device timeout remaining unchanged, being too long or too short, leading to high rate of rejoins if the device sleeps too long, or too slow age out for a portable device.

sjorge commented 2 years ago

I don't think it maters as the end device does the checking with in our case the coordinator. The other direction is what I think Matt is talking about? When the coordinator sends a message that not all routers will hold them, but that is what the Poll Control will fix right?

MattWestb commented 2 years ago

I think the best way is paring one SED with one Zigbee 3 router and see that is doing chekins and then power off the router and letting it jumping to one no Zigbee 3 router being its parent and see if the end device still sending checkins to the coordinator.

One PS New IKEA lights (3 generation) dont have pull control implemented but the first (ZLL) and second generation is have it.

In Zigbee 3 routers shall only holding messages to its children very short time and letting the commands being sent with the checkin mechanism to them. But its can being possible twerking the pull control parameters (id the SED and its parent like it) but if you is doing it wrong the SED is rejoining every time its pulling its parent and you is getting drained batteries.

kennylevinsen commented 2 years ago

Hmm, so if a device fails to negotiate a long enough timeout and only wakes up for checkin, then automatic binding poll control would increase rejoin rate.

I don't think that problem statement is strictly valid: the checkin interval must be longer than the long poll interval, so the device will do MAC data polls more often than they send check-ins. I assume that the MAC requests will either keep the child alive or be where the rejoin happens, rather than the checkin being the "point of failure".

In this case, the only solution would be to configure the poll cluster with shorter interval, hack things by having shorter reporting interval, or to replace the router.

Koenkk commented 2 years ago

@kennylevinsen

If it's just storing if and where it was present, we could just as well loop through the endpoints and call supportsInputCluster on all of them to find the right endpoint during configurePollControl.

See https://github.com/Koenkk/zigbee-herdsman/blob/81502a27653cf49991ab2011337843790b47d591/src/controller/model/device.ts#L641 , to retrieve the endpoint: const coordinatorEndpoint = coordinator.endpoints[0]

kennylevinsen commented 2 years ago

In https://github.com/Koenkk/zigbee-herdsman/pull/454 I get rid of configurePollControl and do all of this automatically as part of the interview process. Take a look and see if this is what you had in mind. :)

We can also use this logic to set the default sendWhenActive behavior.

kennylevinsen commented 2 years ago

For genPollCtrl capable devices, https://github.com/Koenkk/zigbee-herdsman/pull/453 is the last puzzle piece that will cause "sleepy" behavior to auto-enable.

For other devices, there will for now be the option to set device.defaultSendWhenActive = true within configure(). However, we may still be interested in a fallback heuristic for these devices, especially as the new sendWhenActive behavior no longer have a negative side-effect if immediate send works.

sjorge commented 2 years ago

So for example we should set defaultSendWhenActive this for the very tricky to pair SONOFF devices? cc: @Koenkk ?

Koenkk commented 2 years ago

I believe yes, @kennylevinsen can you confirm?

kennylevinsen commented 2 years ago

It should work for the devices, but not sure if it will help pairing. If set in configure manually, it won't take effect until after interview.

sjorge commented 2 years ago

It should work for the devices, but not sure if it will help pairing. If set in configure manually, it won't take effect until after interview.

Hmm yes, true if we set int in configure() it will only take effect after... would having this as a meta flag be better? Although that might be the same problem, as we need to do some interviewing before we can match the device.

kennylevinsen commented 2 years ago

We could consider just making interview use sendWhenActive as it should be harmless for non-sleepy devices, but right now the tests timeout if we do so. Have not looked into why (maybe the mock needs to be adjusted, or the manual retries need to be tuned bit?), but it should be fixable.

Endpoints also currently do not yet use the sendRequest mechanism for binding, which mean that they're not yet affected by sendWhenActive. That might also help.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days

Koenkk / zigbee-herdsman