dresden-elektronik / deconz-rest-plugin-v2

deCONZ REST-API version 2 development repository.
BSD 3-Clause "New" or "Revised" License
17 stars 0 forks source link

Bindings and Reporting #6

Open manup opened 3 years ago

manup commented 3 years ago

This is a discussion issue on how bindings and ZCL attribute reporting could be managed in the new Device code.

Excerpt from a IKEA GU10 color temperature DDF file:

{
    ...
    "reporting": [
        ["report/5", 1, "0x0006", "0x0000", 5, 1800],
        ["report/6", 1, "0x0008", "0x0000", 5, 1800, 1],
        ["report/5", 1, "0x0300", "0x0008", 1, 1800],
        ["report/6", 1, "0x0300", "0x0003", 5, 1795, 10],
        ["report/6", 1, "0x0300", "0x0004", 5, 1795, 10],
        ["report/6", 1, "0x0300", "0x0007", 5, 1800, 1],
        ["report/5", 1, "0x0300", "0x0008", 1, 1800]
    ],
    "bindings": [
        ["bind/2", 1, "0x0006"],
        ["bind/2", 1, "0x0008"],
        ["bind/2", 1, "0x0300"]
    ]
}

Here the corresponding clusters and attributes are specified which should be configured. The hard coded values should be properly specified as string constants e.g. "$BASIC_CLUSTER_ID" instead of 0x0000 for better readability and automatic validation checks.

The specified functions could be read as:

["report/5", Endpoint, ClusterId, AttributeId, Min Report Interval, Max Report Interval]
["report/6", Endpoint, ClusterId, AttributeId, Min Report Interval, Max Report Interval, Reportable Change]
["bind/2", Endpoint, ClusterId]

The given order of bindings and reports is relevant and followed in the process, for some devices it is crucial to configure them so they wake up from time to time, for example via Power Configuration Cluster.

Each Device has a state machine and after initial setup sub-state machines running in parallel doing various tasks. One of these is the Binding sub-state machine which gets activated once the DDF is loaded for a device.

Binding State Machine

Diagram source: device_bindings.puml

(Quick and dirty draft of the Binding state machine with some details like timers omitted.)

The whole process is fault tolerant and self-healing, it is used at initial setup and afterwards runs in the background. For mains powered devices and light sleepers this is pretty easy, deep sleepers are covered as well since the state machine is also triggered on the awake event.

ebaauw commented 3 years ago

What the /5 or /2 after the report or bind? Some sort of priority (cf. fsck)?

Not sure multiple state engines running in parallel is a good idea for deep sleepers. Not sure how many messages the parent is willing (supposed) to keep, and the core/device firmware won't queue too many for the same device.

I recognise some devices needing the correct setup fast, or they won't wake on shake or button press. Typically this is the setup for reporting motion, vibration, or button press; not for battery. I haven't kept a list of which devices, but typically I will have made a note in the issue for supporting the device.

SwoopX commented 3 years ago

Also had some thoughts on this lately. So what I recognized over time (and there was an issue created around this just recently) is that the reporting configuration is somehow applied "continuously". I understand this is the self-healing intention, and good by the way, which in most cases is totally fine. However, if one chose to individually configure the reporting other than the hardcoded values, this could become aching, especially if you raised the values. I guess that can and would be covered by actually reading the current config, as this is currently missing.

There's also devices out there where bindings do not need or must not go hand in hand with reporting. Good examples in this regard are our beloved friends from China, Xiaomi (and probably others). Even with the ZB3.0 Mija devices, the bindings can be done (and do not seem to cause any harm) but are ignored or not necessary for the reporting to work. Speaking of the reporting config, that is also totally ignored, so they do their own thing. Considering this, any mechanics relying or following after the binding/reporting should also be designed to not mandatorily rely on it but flexible enough to just work with what's coming in automagically. Maybe that must/should be relfected in the DDF?

Regarding the swiftness of binding and reporting, I made the same observation as Erik. The Develco motion sensors I added recently barely got any configuration at all. I would have considered them as light sleepers, also due to the fact that they've a poll control cluster. It's probably related to the "unfinished" support of it, but they react either just when they just send anything (which can be some time) or when put into a special mode physically. So those should have obtained their config as fast as possible during pairing with some sort of rapid fire.

The very same device would also have like 10 attributes to configure (rather complex device), so having like 10 of them would result in 100 parallel state machines. No clue if that might be too much. Or would that be just one per device (it would create 5 or 6 sensors)?

Lastly, I know this goes into API direction, but I'd love to be able to apply (and store) individual attribute configs, which override the DDF values.

manup commented 3 years ago

What the /5 or /2 after the report or bind? Some sort of priority (cf. fsck)?

The numbers represent the count of arguments a function has, or tuple as in this case. I've stolen this from Erlang where it is called arity of a function. It might seem a bit superfluous and bogus as first but helps to parse, verify and catch errors.

I have considered the following express the same functionality:

["bind/2", Endpoint, ClusterId]   // current version
[Endpoint, ClusterId]  // danger the intend isn't clear, have to trust the DDF writer to do the right thing
["bind-unicast", Endpoint, ClusterId]   // better but more complicated/code to verify
["bind-unicast/2", Endpoint, ClusterId]  // ok with more specific name
{"bind": "unicast", "ep": 1, "clid": "0x0000"}  // not sure, gets more complex for larger function/tuples

The /2 allows that the parser and validator could check automatically that the right number arguments are specified. This is a generic function which just checks an array [something/N, ....] has N parameters regardless what "something" is.

For example:

["bind/2", 1, "0x0006", "0x0001"]   // error, expected 2 arguments but got 3
["report/6", 1, "0x0300", "0x0004", 5, 1795]   // error 5 arguments instead 6

I'm not sure if this is the best way to express such things in DDF files but kinda like it. But would love to hear some thoughts on this or alternative approaches.

Not sure multiple state engines running in parallel is a good idea for deep sleepers. Not sure how many messages the parent is willing (supposed) to keep, and the core/device firmware won't queue too many for the same device.

Surprisingly the state machines need to know little to nothing of deep sleepers the only difference is that their processing / activation is done when the awake event is emitted, processing of light sleepers and routers are triggered by other events (currently I'm using poll) which is scheduled from a higher order place.

In this hindsight we can have hundreds of these little state machines running, most of the time they do nothing. The state machines themself are very light weight on resources, it's basically just a variable/pointer representing the state.

The key is that they aren't completely acting on their own, for certain tasks Devices are controlled from a higher layer. For example the processing of bindings is scheduled and serialized at the desired timing from higher level controller which just sends events to the device to poke the process — similar like the Poll Manager only handles one device at a time.

This higher layer knows how busy the queues are and if we are currently pairing a device (or multiple), based on that it tells specific devices to act.

The strange thing with all that is that the code gets much faster and CPU usage goes down, since we don't need hundreds of for loops and places to check for dozens of devices on each incoming command. Commands can be handled more locally in the respective Device.

I recognise some devices needing the correct setup fast, or they won't wake on shake or button press. Typically this is the setup for reporting motion, vibration, or button press; not for battery. I haven't kept a list of which devices, but typically I will have made a note in the issue for supporting the device.

The Device code and it's state machines are designed to be very fast if needed and work at a slower pace in later idle mode. So during setup process it can hammer like crazy and recovers quickly if something goes wrong and need to be retried.

The error catching will be more precise for example the machine can recover quickly if sending failed due MAC_NO_ACK 0xE9 APS Confirm without even waiting for a response and timeout.

Imho speed wise it can't get any faster or robust and runs circles around the current code :) ... bold claim but that's a goal.

One thing I really like is that the logic and behavior is completely testable since everything is based on events and data.

ebaauw commented 3 years ago

I'm not sure if this is the best way to express such things in DDF files but kinda like it. But would love to hear some thoughts on this or alternative approaches.

The last example, using an object, is the only viable long-term solution in my view. Effectively it used named parameters instead of positional parameters. This provides implicit documentation; I'd happily pay the price of some more typing during development (remember software is read more often than written). This allows adding new (optional) parameters, and even deprecating old parameters, without breaking existing definitions. Modern programming languages like Swift and even NodeJS/Javascript (using the options parameter) use named parameters.

The key is that they aren't completely acting on their own, for certain tasks Devices are controlled from a higher layer.

OK. My worry wasn't the resource utilisation by the multiple state engines, but the concurrent communication with (sleepy) end-devices. If the "higher layer" handles this, I'm happy.

manup commented 3 years ago

The last example, using an object, is the only viable long-term solution in my view. Effectively it used named parameters instead of positional parameters. This provides implicit documentation; I'd happily pay the price of some more typing during development (remember software is read more often than written).

Good point, I highly agree that readability is most important. In general I can live with the named parameter approach, more tedious to implement but can be figured out. In hindsight I expect once a few files are in place new files will often be created by copy pasting existing parts/files, so for the writer it might not be too bad.

manup commented 3 years ago

Took a while, I had some time to think about this while experimenting with the code.

Also had some thoughts on this lately. So what I recognized over time (and there was an issue created around this just recently) is that the reporting configuration is somehow applied "continuously". I understand this is the self-healing intention, and good by the way, which in most cases is totally fine. However, if one chose to individually configure the reporting other than the hardcoded values, this could become aching, especially if you raised the values. I guess that can and would be covered by actually reading the current config, as this is currently missing.

The simplest approach — for testing out different settings — is just to edit the DDF file. There is now a REST API call so that the file can be reloaded and everything reconfigured at runtime. I too find the possibility to have personal settings per device very useful, here an overwrite mechanism could be added which overwrites the parameters from the DDF file. These could be stored in a database table and edited via API or the GUI.

There's also devices out there where bindings do not need or must not go hand in hand with reporting. Good examples in this regard are our beloved friends from China, Xiaomi (and probably others). Even with the ZB3.0 Mija devices, the bindings can be done (and do not seem to cause any harm) but are ignored or not necessary for the reporting to work. Speaking of the reporting config, that is also totally ignored, so they do their own thing. Considering this, any mechanics relying or following after the binding/reporting should also be designed to not mandatorily rely on it but flexible enough to just work with what's coming in automagically. Maybe that must/should be relfected in the DDF?

The DDF files only contain configurations for bindings and reporting which should be actively applied and monitored. So in case for our special devices like Xiaomi we would just not specify any bindings / reporting. For devices like some switches only bindings are specified but no reporting. The actual C++ code completely relies on what's in the DDF files and won't act on it's own.

Regarding the swiftness of binding and reporting, I made the same observation as Erik. The Develco motion sensors I added recently barely got any configuration at all. I would have considered them as light sleepers, also due to the fact that they've a poll control cluster. It's probably related to the "unfinished" support of it, but they react either just when they just send anything (which can be some time) or when put into a special mode physically. So those should have obtained their config as fast as possible during pairing with some sort of rapid fire.

My current approach is to "hammer" the device setup while Permit Join is enabled as fast as possible (taking timeouts and APS confirm tracking into account). If we know a sleeping device is also accepting commands when certain (but not any) messages are received the machinery is also triggered due a special "awake" event.

For example here is the state/battery ResourceItem from a DDF file which when being updated by a ZCL report, would trigger the "awake" event:

{
    "name": "state/battery",
    "parse": {"fn": "zcl", "ep": 1, "cl": "0x0001", "at": "0x0021", "eval": "Item.val = Attr.val"},
     awake: true
}

The very same device would also have like 10 attributes to configure (rather complex device), so having like 10 of them would result in 100 parallel state machines. No clue if that might be too much. Or would that be just one per device (it would create 5 or 6 sensors)?

The current approach which seems to work nice so far is one Device class which has one top level state machine. It controls some smaller state machines when the Idle/Operational state is reached. Since we can send only so much APS requests at a time, a series of configurations (bindings/reporting/writing attributes) would be serialized, each of its own is a small state machine which does its thing and makes sure the thing succeeded. Oh my, this sounds horribly abstract :smile:

While experimenting with the code I noticed that a device which is completely managed by Device code becomes immune to bugs which are local to other devices DDF file changes. We "only" need to get the generic Device code correct and fully tested. A DDF created for a Hue Dimmer Switch will then be solid for the time being.

Lastly, I know this goes into API direction, but I'd live to be able to apply (and store) individual attribute configs, which override the DDF values.

Absolutely, I think this can be done quite nicely and be handy.

manup commented 3 years ago

Update on the Binding and ZCL reporting configuration in Device code.

The first working version of the state machine is now implemented, currently without actually reading the Binding Table in favor of monitoring it, the code is independent from the current code in bindings.cpp. The state machine diagram looks a bit more messy as is really is ;)

Binding State Machine

Diagram source: device_bindings.puml

Note the Idle state is entered when all is done and it can start verification after a certain time or reconfiguration via user action by going back to the beginning state Binding. This allows reconfiguration via REST API or while changing DDF files.

I'm currently testing with IKEA KADRILJ which has 3 bindings and two ZCL reporting configurations. The binding and reporting parts in the DDF file is now changed to favor named parameters instead the Erlang array approach.

Example: IKEA KADRILJ roller blind, DDF: devices/ikea/kadrilj_blind.json

"bindings": [
        {
            "bind": "unicast",
            "src.ep": 1,
            "cl": "0x0001",
            "report": [
                {"at": "0x0021", "dt": "0x20", "min": 300, "max": 600, "change": "0x01" }
            ]
        },
        {
            "bind": "unicast", "src.ep": 1, "cl": "0x0020"
        },
        {
            "bind": "unicast",
            "src.ep": 1,
            "cl": "0x0102",
            "report": [
                {"at": "0x0008", "dt": "0x20", "min": 1, "max": 600, "change": "0x01" }
            ]
        }
    ]

Notes

Here is how the processing of the Window Covering cluster binding looks in action. image

MauricioXavier13 commented 1 year ago

When adding a Binding it doesn't show a report configuration table in my Edit DDF section.

Do you know or suspect why it happens?

Thanks