dresden-elektronik / deconz-rest-plugin

deCONZ REST-API plugin to control ZigBee devices
BSD 3-Clause "New" or "Revised" License
1.9k stars 501 forks source link

Multiple fired events on one button press #4227

Closed petosak closed 3 years ago

petosak commented 3 years ago

Describe the bug

On some Zigbee switches i've got multiple events on one press. On TS0043 i've got two events, so one turns on the light and second event immediately after turn the light off. On Aquara single button i have even 6 fired events!

Steps to reproduce the behavior

I don't know. It happen only on some switches.

Expected behavior

I expect only one event.

Screenshots

Deconz log: image

Environment

HASSIO core 2021.1.4 ConBeeII on RPI4 USB3 port Version 2.07.01 / 8. 12. 2020 Firmware 26660700

deCONZ Logs

HA deconz_event listener: { "event_type": "deconz_event", "data": { "id": "car_switch", "unique_id": "00:15:8d:00:04:50:76:40", "event": 1002, "device_id": "8ca77365f3dcfc3ee5b140653b39858c" }, "origin": "LOCAL", "time_fired": "2021-01-16T13:24:25.247702+00:00", "context": { "id": "c6d94224df8f774ad7d40762250b9ecc", "parent_id": null, "user_id": null } } Event 100 fired 2:24 PM: { "event_type": "deconz_event", "data": { "id": "car_switch", "unique_id": "00:15:8d:00:04:50:76:40", "event": 1002, "device_id": "8ca77365f3dcfc3ee5b140653b39858c" }, "origin": "LOCAL", "time_fired": "2021-01-16T13:24:24.423746+00:00", "context": { "id": "878047f82de764ac9dced361b7350ed5", "parent_id": null, "user_id": null } } Event 99 fired 2:24 PM: { "event_type": "deconz_event", "data": { "id": "car_switch", "unique_id": "00:15:8d:00:04:50:76:40", "event": 1002, "device_id": "8ca77365f3dcfc3ee5b140653b39858c" }, "origin": "LOCAL", "time_fired": "2021-01-16T13:24:24.003543+00:00", "context": { "id": "b22bd330fcbf63b1876cee681e498903", "parent_id": null, "user_id": null } } Event 98 fired 2:24 PM: { "event_type": "deconz_event", "data": { "id": "car_switch", "unique_id": "00:15:8d:00:04:50:76:40", "event": 1002, "device_id": "8ca77365f3dcfc3ee5b140653b39858c" }, "origin": "LOCAL", "time_fired": "2021-01-16T13:24:23.795933+00:00", "context": { "id": "6fc82d1cd42fe6f22923d60a1dc7c261", "parent_id": null, "user_id": null } } Event 97 fired 2:24 PM: { "event_type": "deconz_event", "data": { "id": "car_switch", "unique_id": "00:15:8d:00:04:50:76:40", "event": 1002, "device_id": "8ca77365f3dcfc3ee5b140653b39858c" }, "origin": "LOCAL", "time_fired": "2021-01-16T13:24:23.696384+00:00", "context": { "id": "28d007542612cc64a8a88b1bfdb35d6e", "parent_id": null, "user_id": null } } Event 96 fired 2:24 PM: { "event_type": "deconz_event", "data": { "id": "car_switch", "unique_id": "00:15:8d:00:04:50:76:40", "event": 1002, "device_id": "8ca77365f3dcfc3ee5b140653b39858c" }, "origin": "LOCAL", "time_fired": "2021-01-16T13:24:23.529084+00:00", "context": { "id": "392550fe480ee14dda5c179deef288dc", "parent_id": null, "user_id": null } }

Additional context

I suspect this happens on switches that I use multiple times. Maybe some remains in Deconz config?

Smanar commented 3 years ago

Yep someone is working on a patch here https://github.com/dresden-elektronik/deconz-rest-plugin/issues/3611#issuecomment-762288831

No solution yet, but he is making a workaround.

This is for tuya, for aquara its something new ....

philsson commented 3 years ago

This has been happening to me for quite some time also on aqara switches. It can often be solved by a home assistant restart. Sometimes a full system restart is needed or a restart of the docker service. I've been thinking of implementing some kind of "debounce"functionality to solve this.

petosak commented 3 years ago

I did restart and reboot many times. Still the same.....

philsson commented 3 years ago

Another thing worth trying is rebuilding the deConz module.

On Tue, 19 Jan 2021, 08:37 petosak, notifications@github.com wrote:

I did restart and reboot many times. Still the same.....

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dresden-elektronik/deconz-rest-plugin/issues/4227#issuecomment-762659719, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABV6MAWZJT4POZDWCHU7SWTS2UZDFANCNFSM4WH2XURA .

petosak commented 3 years ago

how could i do that? I'm only user with RPI4 and HASSIO on it.

Smanar commented 3 years ago

Not possible with hassio, it use docker.

sebres commented 3 years ago

Not possible with hassio, it use docker.

Well, let alone you could install hassio in rpi directly, without containering stuff (dockers etc), what could this issue have with the hassio? Strictly seen, one must apply a patch, rebuild deconz-rest-plugin (and replace it in deCONZ), and the hassio in docker can access a new (replaced) API over REST (HTTP) from container (docker) too, in the same way it made that without a fix. @Smanar I don't really know what exactly you call as impossible.

Just it could work for TS0043, but if I correctly understand the issue is also about Aquara, so applying the patch (e. g. 203bc71b86a5308ff232e98f5dba0e04d84eeb49) directly would not help, because there is a check for the manufacturer in line 4538, then it should be extended with yet another manufacturer condition, e. g.:

- && sensor->manufacturer().startsWith(QLatin1String("_TZ3000_"))
+ && (sensor->manufacturer().startsWith(QLatin1String("_TZ3000_")) 
+  || sensor->manufacturer().startsWith(QLatin1String("Aquara_OR_Whatever_Manufacturer_Name_Prefix_For_That_Device"))
+ )

@petosak

how could i do that? I'm only user with RPI4 and HASSIO on it.

Well it is a bit complex for a normal user (without developer know-how), but you can try it:

  1. firstly make a backup of original plug-in version /usr/share/deCONZ/plugins/libde_rest_plugin.so
  2. you'd need to install git if you still don't have it (sudo apt install git)
  3. get dev-tools and source code:
    sudo apt install deconz-dev; # to be able build plug-in
    git clone --branch tuya https://github.com/sebres/deconz-rest-plugin.git; # get source code from repo containing patch
    cd deconz-rest-plugin
    git checkout 203bc71b86a5308ff232e98f5dba0e04d84eeb49; # check-in for 2.7.x fix (current tuya branch contains 2.9.1 fix)
  4. build and replace plugin:
    qmake && make -j2; # build
    sudo cp ../libde_rest_plugin.so /usr/share/deCONZ/plugins; # replace plugin
  5. restart deCONZ (or deCONZ GUI)
Mimiix commented 3 years ago

Well it is a bit complex for a normal user (without developer know-how), but you can try it:

And that's what smanar tries to say. Hassio is way to complicated for the users they serve with it. Nevertheless: The addon is nothing official, meaning we don't develop it.

sebres commented 3 years ago

Hassio is way to complicated for the users they serve with it.

There is nothing about hassio, if you take a look attentively, I was just talking about how to patch and (re)build deconz-rest-plugin.

Nevertheless: The addon is nothing official, meaning we don't develop it.

What do you talking about? deconz-rest-plugin is this repository and has only marginal to do with hassio. The double event occurring in deconz (I know it is either a "bug" of device firmware or whatever), but the workaround was to discard second (wrong) event from the same device if it's arriving too early (too often) in deCONZ, let alone without Smanar's tuya branch (#3611) it is impossible to bind TS004x devices as a switch, because they are recognized as a light, again in deconz).

Hassio just using conbee gateway over the API and can only consider events "published" by deconz. Nevertheless: forget the hassio here and let us speak about a matter of this repository.

Mimiix commented 3 years ago

Hassio is way to complicated for the users they serve with it.

There is nothing about hassio, if you take a look attentively, I was just talking about how to patch and (re)build deconz-rest-plugin.

Nevertheless: The addon is nothing official, meaning we don't develop it.

What do you talking about? deconz-rest-plugin is this repository and has only marginal to do with hassio. The double event occurring in deconz (I know it is either a "bug" of device firmware or whatever), but the workaround was to discard second (wrong) event from the same device if it's arriving too early (too often) in deCONZ, let alone without Smanar's tuya branch (#3611) it is impossible to bind TS004x devices as a switch, because they are recognized as a light, again in deconz).

Hassio just using conbee gateway over the API and can only consider events "published" by deconz. Nevertheless: forget the hassio here and let us speak about a matter of this repository.

I was just backing up on Smanar's "Not possible with hassio, it use docker." and explaining his stance :)

I totally agree with you.

Smanar commented 3 years ago

I have no problem with hassio, I have problem with dockers.

I say that because it s not the first one that is trying. I think you can't imagine how many time we have explained how to compile the code, or just to install correctly deconz.

Well, let alone you could install hassio in rpi directly, without containering stuff (dockers etc), what could this issue have with the hassio?

User generaly install hassio AND deconz in docker, they generaly don't do much more than waiting the docker is officialy updated and update it on their turn. I can't force them to install it a second time oustide the docker on their production machine.

firstly make a backup of original plug-in version /usr/share/deCONZ/plugins/libde_rest_plugin.so

Sorry but there is no "/usr/share/deCONZ/plugins/libde_rest_plugin.so" on docker installation, at least the last time I have tried, and the file compiled outside the docker can be incompatible inside it.

If you want to explain how to do on HASSIO, no problem, but I have already tried, I give up on my side, there is nothing to help you, first you need to explain how to install SSH, serioulsy even the principal is not in place. Good luck to explain how to install dev-tools, even on raspian some users don't succed.

For the patch with Xiaomi, IDK, for tuya, I know the hardware (and the issues) I have see all the logs with APS, I m sure the device realy send itlself 2 times the request and on the good cluster, the 0x0006

for aquara , on the log on first post, you don't see APS log, and on ZCL log you can see the code react too on cluster 0x0012.

So for me need more investigations for the aquara issue, I m realy not sure it s the same problem. For tuya it s a workaround, it s perhaps possible to solve the problem for the second brand.

sebres commented 3 years ago

I have problem with dockers. Sorry but there is no "/usr/share/deCONZ/plugins/libde_rest_plugin.so" on docker installation

I also don't prefer docker (either real virtualization), but what is the problem here? Start shell within (docker exec -t -i mycontainer /bin/bash) and search for the file (find / -name libde_rest_plugin.so) in container together with Copy files/folders between a container and the local filesystem could help here to find and replace plugin-file in docker (if deconz is dockerized at all).

In docker it is even more safe, because you can create a snapshot (and restore it if something becomes wrong).

And your sentence was "Not possible with hassio, it use docker", what is basically totally irrelevant for aforementioned reasons (hassio has nothing to do with the subject, our patch, etc, it just notices events from deconz, so doesn't use direct gateway communique). It's like to bother the editor program about broken file, if the hard disk is full with bad sectors.

I m sure the device realy send itlself 2 times the request and on the good cluster, the 0x0006

Nothing else I wrote above: "I know it is either a bug of device firmware or whatever" and that it was just a workaround of me, not a real solution.

So for me need more investigations for the aquara issue, I m realy not sure it s the same problem.

Clear, therefore I also wrote that this patch would probably not help here (also if one would extend the condition)... But one could try.

I understand that many people (calling themselves "normal user") could need several how-to's and that you don't have time (and probably the intention to navigate them or provide tips here and there), but some of them are probably smart enough to read manuals or even to use google to find some how-to self-dependently, so to tell them "it is impossible" is not only wrong, it is simply contra productive.

Back to the aquara issue (if it is really the same), I could also help to debug/investigate deeper here, but either need more logs or even the device (which I don't have at the moment). One friend of me told me that he doesn't have any issue with the button events... imported this blueprint in hassio... anyway it looks like model-id of switch is the same (lumi.remote.b1acn01). So it could be also a defect device or older firmware or it needs reset or whatever else affecting a particular device.

@petosak is it a single aquara device or do you have possible more of them? if so are other affected too in the same way?

sebres commented 3 years ago

OK, we were eventually wrong with "either a bug of device or its firmware": replaced Phoscon integration with ZHA integration (via zigpy-deconz library) in HASS and tried the same TS0044 using ZHA now. I guess it uses this blueprint.

And as result: it does not generate double events in HASS using the same rpi, same gateway (conbee 2) and same switch.

excerpt of events log (listening for "zha_event") and "home-assistant.log" ... ``` Event 0 fired 12:26 AM: // -- B1 single click -- { "event_type": "zha_event", "time_fired": "2021-01-19T23:26:32.268327+00:00", "data": { "endpoint_id": 1, "cluster_id": 6, "command": "remote_button_short_press", "args": [] } } Event 1 fired 12:26 AM: // -- B1 double click -- { "event_type": "zha_event", "time_fired": "2021-01-19T23:26:53.542416+00:00", "data": { "endpoint_id": 1, "cluster_id": 6, "command": "remote_button_double_press", "args": [] } } Event 2 fired 12:27 AM: // -- B1 long click -- { "event_type": "zha_event", "time_fired": "2021-01-19T23:27:10.254356+00:00", "data": { "endpoint_id": 1, "cluster_id": 6, "command": "remote_button_long_press", "args": [] } } Event 3 fired 12:27 AM: // -- B2 single click -- { "event_type": "zha_event", "time_fired": "2021-01-19T23:27:37.367256+00:00", "data": {"endpoint_id": 2, "cluster_id": 6, "command": "remote_button_short_press", "args": [] } } ... ``` At the same time the hass was logging into `home-assistant.log`: ``` 2021-01-20 00:26:32 WARNING (MainThread) [zigpy.zcl] [0x04e7:1:0x0006] Data remains after deserializing ZCL frame 2021-01-20 00:26:53 WARNING (MainThread) [zigpy.zcl] [0x04e7:1:0x0006] Data remains after deserializing ZCL frame 2021-01-20 00:27:10 WARNING (MainThread) [zigpy.zcl] [0x04e7:1:0x0006] Data remains after deserializing ZCL frame 2021-01-20 00:27:37 WARNING (MainThread) [zigpy.zcl] [0x04e7:2:0x0006] Data remains after deserializing ZCL frame ... ``` so I'm still unsure whether some stuff is indeed ignored by zigpy (or it's even not correct considered by deconz), must still check the spec and source code of zigpy.

So maybe (I say only maybe) it is indeed an issue of deconz, since I don't know any prevention such my workaround zigpy-deconz may do to avoid too fast coming events (must check the source code yet).

Anyway @petosak can you possibly try your Aquara switch with ZHA instead of phoscon in hassio (just stop deconz/deconz-gui services and in hassio add ZHA integration, add new device, goto devtool->events and listen for zha_event)? Interesting what you would see there.

sebres commented 3 years ago

An amend to my previous comment... Upgrade of HA from 2021.1.1 to 2021.1.4 causes that one sees 2 events by 1 click now (and the warnings about remaining data disappear), but the events deviate a bit, so (due to blueprint? or some meta?) it seems to be able distinguish both events (only one containing a command "button_short_press", whereas another "press_type"):

diff of 2 events occurring by decoding ZCL-frame in ZHA... ```diff -Event 0 fired 3:08 AM: +Event 1 fired 3:08 AM: { "event_type": "zha_event", "data": { @@ -7,16 +7,14 @@ Event 0 fired 3:08 AM: "device_id": "...", "endpoint_id": 1, "cluster_id": 6, - "command": "press_type", - "args": [0] + "command": "remote_button_short_press", + "args": [] }, - "time_fired": "2021-01-20T02:08:24.869018+00:00", + "time_fired": "2021-01-20T02:08:24.869259+00:00", } ```

Also note the time of events gets fired (reached the gateway or processed in zigpy) - since both events are done almost immediately (second only 240µs later), I guess it is indeed arriving in the same ZCL frame, so for some reason the deconz, if it receives the same data, causes a frame decoder issue and even retards the second part for 1 second (what looks at least suspicious). If this strange delay by decoding could be "fixed", the second part of the frame can be either ignored or somehow marked in order to distinguish the commands correctly. The specification of zigbee cluster library seems to be a bit ambiguous here (but maybe I found some old man, and there is some newer variant specifying another type of ZCL-layer frame or its processing more precise).

Basically I have nothing further to add here, excepting that I don't believe anymore the error is with the device.

BTW, it is opening also a possibility to provide a new common command for automation (which may be given in blueprint) like "remote_button_any_press" reacting on "press_type", which can trigger for any type of key press (combining several variants to the single common trigger), so "if any key of switch is pressed, do ...".

philsson commented 3 years ago

I can add that when this happens with my aqara switches it happens to all of them. I cannot say if it happens to switches of other brands as my other switches (IKEA) are not used in home assistant but directly with groups in deconz, and they don't experience this issue when present on the aqara ones.

sebres commented 3 years ago

So is it either sporadic issue with aquara? If it is reproducible to some extent, you could probably also try ZHA integration in HA, whether it becomes "healed" then.

philsson commented 3 years ago

I can debug some more next time I get the issue. It normally gets resolved with a restart of HA. The issue normally arises randomly, but sometimes comes after a HA restart as well, e.g. after an update. Never has it resolved itself. And as mentioned it affects all my aqara switches. I think I at some point even got three equal events on one press after trying to resolve it with a restart of HA or/and deConz.

sebres commented 3 years ago

Just as tip: zigpy could apparently have the issues with false multiple presses also, https://github.com/zigpy/zha-device-handlers/issues/712 (e. g. with Philips RWL021), so I don't think it has on board some special tricks to avoid repeated events by default. But because they seem to arrive really immediately in the same or adjacent ZCL frame, one could easy discard them regularly. Whereas in deconz due to delay ca. 1 second, it would remain a workaround either. And because the gateway and device seems to do it immediately in ZHA (at least with TS0044, see the log under spoiler in https://github.com/dresden-elektronik/deconz-rest-plugin/issues/4227#issuecomment-763219318), one should really investigate firstly where this strange latency is coming from and eliminate it. Is it the issue of an event cycle? Is it either poll with some sleep (so one could read again before entering the sleep or slow increase sleep interval successively) or "real" asynchronous notify (some locks in callback or again sleep somewhere within)? etc.

Smanar commented 3 years ago

But my question is, why now ? Seriously, we have the same issue caused by different problems (I have it too on Legrand hwardware), and just in 3/4 months, nothing before.

sebres commented 3 years ago

dunno... it isn't so simply to answer... but software evolves (deconz growing on codebase, devices get new firmware etc), so something conflicting there... and in between it is hardly conceivable to me the device is to blame (at least not alone;).

SwoopX commented 3 years ago

Gents, although it is always great to see a vivid an in-depth discussion, I wonder during my fly by why not check the actual root cause? I mean, all relevant data can be made available by setting the right debug flags. With DBG_INFO_L2, you get the ZCL payload length, attribute, data type and actual data and with DBG_APS_L2, you move up one layer and also get the full asdu, also containing the ZCL sequence number.

Checking all that info should cease any speculation about what exactly has been send and if the identical message was just repeatedly send by whatever device.

As a side not with regard to Xiaomi devices, they are not too famous for adhering to the specs. I've seen devices seemingly stopping to send reports at will or randomly reporting different clusters and endpoints. Also got one single Tuya based device in my hands, which had a rock solid appearance on the very first glimpse, but also revealed some totally random behaviour on the longer run...

Smanar commented 3 years ago

For tuya https://github.com/dresden-elektronik/deconz-rest-plugin/issues/3611#issuecomment-761701258

23:23:50:187 APS-DATA.indication srcAddr: 0x6c26, srcEp: 0x01 dstAddrMode: 2, profile: 0x0104, cluster: 0x0006, lqi: 63, rssi: -86 23:23:50:188 asdu: 014afd00 23:23:50:191 [INFO] - Button 1002 B1 short TS0044

23:23:51:193 APS-DATA.indication srcAddr: 0x6c26, srcEp: 0x01 dstAddrMode: 2, profile: 0x0104, cluster: 0x0006, lqi: 63, rssi: -86 23:23:51:194 asdu: 014afd00 23:23:51:200 [INFO] - Button 1002 B1 short TS0044

sebres commented 3 years ago

Checking all that info should cease any speculation about what exactly has been send...

Strictly seen it brings more questions as answers.

For tuya #3611 (comment)

Yep... but indirectly they are not really raw data (but an info how deconz API disassembled the frame), and there is this strange delay (so one could not be sure the second cycle obtains the same data as if first attempt, e. g. no shift in frame buffer). Anyway there is a bit more chatty log in the comment containing only one Node data and ZCL attribute report message resulting(?) to 2 APS-DATA.indication messages (and 2 equal events), what is at least strange.

I'll still provide the raw data if I get time to debug zigpy (hopefully this evening), perhaps it is a bit different, at least we will not see this extra 1 second latency in-between :)

But traversing over source code of zigpy, I found this one - https://github.com/zigpy/zha-device-handlers/pull/611/files#diff-3398e3e5af3c0cd6aa2a1854e46bc12638ebcc6c542da27de73df9e37a944bd5R400, and following blame points to this - https://github.com/tube0013/zha-device-handlers/commit/cb7efe9124c3cf392ed361b7546a3c571a90a424#diff-3398e3e5af3c0cd6aa2a1854e46bc12638ebcc6c542da27de73df9e37a944bd5R396-R400 with this code initially implemented for TS0044 (but the same can be found for TS0043 etc, so it was combined later):

        if self.last_code != message[1]:
            self.last_code = message[1]
            super().handle_message(profile, cluster, src_ep, dst_ep, message)
        else:
            _LOGGER.debug("TS0044: not handling duplicate frame")

Thus the TuyaSmartRemote.handle_message has indeed a prevention against duplicate frame before it supply that to handle_message of superclass CustomDevice.

Smanar commented 3 years ago

Lol, so, same than us, and using work around too. BUT BTW, yes I have a question for you, on their code, they just ignore a second event if this one is the same than the previous one, without using delay.

So it mean the device send the button event, (in some situation the same again), then another event but different (not used by code). Else one time you have used the device, you will be not able to use it a second time.

sebres commented 3 years ago

No, it works (tested this as first several times).

I assume in this message the byte1 (which is compared with last known message[1]) will be incremented in the next sent buffer of next click, or it gets rewritten with some next one message device sends hereafter (you know, this one arriving 240µs later). We will see more details if I get time to debug it, but again it work pretty stable without any issue in ZHA since yesterday.

Smanar commented 3 years ago

Ok so you are right.

        new_message[0] = 0x01 = frame control field
        new_message[1] = sequence number
        new_message[2] = 0xFD = command
        new_message[3] = 0x00 = payload rest ?

So they are not checking the command, but the sequence number, as you have said before , they are filtering the double command using the same sequence number. And for me this can be done for ALL stuff, I don't thing there is a device that send 2 command using the same sequence number, but BTW, if I remember ebawnn have already do something like that somewhere in the code.

Edit: Only used for 2 device ^^

    else if (sensor->modelId().startsWith(QLatin1String("SYMFONISK")))
    {
        if (zclFrame.sequenceNumber() == sensor->previousSequenceNumber)
        {
            return;
        }
        sensor->previousSequenceNumber = zclFrame.sequenceNumber();
        checkReporting = true;
    }
sebres commented 3 years ago

So a better place found to move our "workaround" to :) But some configuration enhancement were nice, so provide options there, in order to put this as some flag into config options, then it could be a regular configurable case - just extend button_maps.json for certain device and that's it, like:

+  "options": ["discard_same_sequence": true]

Fixed coding is ugly and is not too strict only, but also pollute the code with hacks used for every single device only.

Smanar commented 3 years ago

But for me it s not a hack, I m almost sure, devices never send 2 differents commands with the same sequence number (except ofc when it have cycled the values). So this code need to be used for all devices.

I will ask to ebaauwn, he have already do some test on this part.

sebres commented 3 years ago

I know, just...

devices never send 2 differents commands with the same sequence number

Unless you'll find some that doing that (at some point or probably already now), so next "hack" needs to be implemented then to support such a device (or even "options": ["dont_discard_same_sequence": true]. :) Can you really put your hand in fire for no one of the devices missed such sequence byte increment already? My (but not only my) feeling is that, zigbee is a standard, supported by manufacturers how they want and understand that (or even - it works somehow and good is).

Smanar commented 3 years ago

Ok so I have talked with ebaauw, and he have a third opinion, for him, better to correct the problem.

It seem from log the device have "disable defaut response" to false, so it s waiting for an answer from the conbee, that never arrive, so it repeat the request with the same sequence number. Unfortunately it seem deconz don't make defaut response for unicast.

So another thing you can try is making the API code sending this response, I have tested nothing (the test to be sure it s unicast is totaly random), but the code can be something like

        //Test for tuya to send DefaultResponse if not disabled for switch
        if ((!(zclFrame.frameControl() & deCONZ::ZclFCDisableDefaultResponse)) &&
            (ind.clusterId() == ONOFF_CLUSTER_ID) &&
            (!(ind.dstAddress().isNwkBroadcast())) )

        {
            sendZclDefaultResponse(ind, zclFrame, deCONZ::ZclSuccessStatus);
        }

To put in fonction apsdeDataIndication() for exemple in the loop if ((ind.profileId() == HA_PROFILE_ID) || (ind.profileId() == ZLL_PROFILE_ID))

But it s 100% theory, and it seem other dev have tried to do that, unsuccessfully I think because we can see hacks in the code.

104738958-73f9a580-5746-11eb-90fe-e87d78135154

sebres commented 3 years ago

But zigpy seems to send unicast response, so I don't really know... Anyway I'll check that and try some variants in both API's. Thanks.

SwoopX commented 3 years ago

See, with a sniff, everything gets wonderfully apparent. As a default respone would be required, but non is send, conclusion is that deconz has a bug in dealing with that correctly. And that seems indeed to be the case.

For the requirements to send a default response, one should take a close look at the specs (all criteria must be met):

  1. A device receives a unicast command that is not a Default Response command.
  2. No other command is sent in response to the received command, using the same Transaction sequence number as the received command.
  3. The Disable Default Response bit of its Frame control field is set to 0 (see 2.4.1.1.4) or when an error results.
  4. The “Effect on Receipt” clause for the received command does not override the behavior of when a Default Response command is sent

Having a look at where those are generated, we see this is currently not adhered to

https://github.com/dresden-elektronik/deconz-rest-plugin/blob/05a033b1857704103619209eb3994602840e78ec/de_web_plugin.cpp#L11626-L11630

Conclusions:

I hope all this will not get too picky in terms of timing. So let's say, if a device expects a default response within 100 ms before retransmitting. Eventually, we than have to also consider where to place whatever update in the code...

sebres commented 3 years ago

OK, the new finding about zigpy, the api seemed don't to send a unicast response to TS0044 or if it doing, doing it wrong, there are only device state changed and data indication responses in log and I see indeed 1 second later a duplicate frame from device (as well as aforementioned ignoring duplicate frame from tuya inherited class).

log excerpt of zigpy... ``` 2021-01-21 22:02:40 DEBUG [zigpy_deconz.api] Device state changed response: [, 0] 2021-01-21 22:02:40 DEBUG [zigpy_deconz.api] Command Command.aps_data_indication (1, 1) 2021-01-21 22:02:40 DEBUG [zigpy_deconz.api] APS data indication response: [27, , , 1, , 2, 260, 6, b'\x01u\xfd\x01', 0, 175, 71, 16, 195, 3, 0, -85] 2021-01-21 22:02:40 DEBUG [zigpy.zcl ] [0x04e7:2:0x0006] ZCL deserialize: manufacturer=None tsn=117 command_id=253> 2021-01-21 22:02:40 DEBUG [zigpy.zcl ] [0x04e7:2:0x0006] ZCL request 0x00fd: [1] 2021-01-21 22:02:40 DEBUG [zigpy_deconz.api] 'aps_data_indication' response from , ep: 2, profile: 0x0104, cluster_id: 0x0006, data: b'0175fd01' ... 2021-01-21 22:02:41 DEBUG [zigpy_deconz.api] Device state changed response: [, 0] 2021-01-21 22:02:41 DEBUG [zigpy_deconz.api] Command Command.aps_data_indication (1, 1) 2021-01-21 22:02:41 DEBUG [zigpy_deconz.api] APS data indication response: [27, , , 1, , 2, 260, 6, b'\x01u\xfd\x01', 0, 175, 63, 16, 195, 3, 0, -86] 2021-01-21 22:02:41 DEBUG [zigpy.zcl ] [0x04e7:2:0x0006] ZCL deserialize: manufacturer=None tsn=117 command_id=253> 2021-01-21 22:02:41 DEBUG [zigpy.zcl ] [0x04e7:2:0x0006] ZCL request 0x00fd: [1] 2021-01-21 22:02:41 DEBUG [zhaquirks.tuya ] TS004X: ignoring duplicate frame 2021-01-21 22:02:41 DEBUG [zigpy_deconz.api] 'aps_data_indication' response from

Trying to find a way to send unicast (since in python I can do it without to rebuild/deploy/restart as in deconz-rest-plugin, let alone replace of plugin on the fly causing crash of deconz with SF). If it works with zigpy (without a resent from tuya side) I can try to implement/check it for the deconz plugin.

Regardless it does work or not, I think the option "discard_same_sequence" is a good idea, because if a response (to a battery powered device!) doesn't reach it for some reasons (in 1s - Xms, dunno), it will send the duplicate frame nevertheless, so it could be safely ignored then.

sebres commented 3 years ago

A device receives a unicast command that is not a Default Response command

@SwoopX do you mean it must be emberAfSendResponse, corresponding the spec "Sending a response to an incomming command" and not emberAfSendDefaultResponse with a ZCL status?

sebres commented 3 years ago

\<offtop> BTW, as for segfaulting (well either buserror) by attempt to replace plugin on the fly in deCONZ, I'll provide the info here instead of opening new issue, because your evil bot closes them for some reason (and one have no intention to find what it exactly does not like)...

Backtrace ... ``` Thread 1 "deCONZ" received signal SIGBUS, Bus error. --- #0 0xb5fb31dc in QMetaObject::activate(QObject*, int, int, void**) () from /usr/lib/arm-linux-gnueabihf/libQt5Core.so.5 #1 0xb62a711c in ?? () from /usr/lib/arm-linux-gnueabihf/libQt5Network.so.5 Backtrace stopped: previous frame identical to this frame (corrupt stack?) ```

No idea it is ever possible at all, but if it'd work, it'll be a lot easier to try some build without to restart etc.

sebres commented 3 years ago

I made that it works pretty well in zigpy now (without repeated frame). I'll commit and provide a PR soon.

So, back to deconz, and tried to fix it there...

After first commit c1a89be03d6eb8e01bf2795597eb6205eca1101b (reverting 203bc71b86a5308ff232e98f5dba0e04d84eeb49) we have following in log, so I can confirm the "fix" works:

log excerpt 1 ... ``` # --- click B1 --- 22:32:44:600 [INFO] - No button handler for: TS0044 endpoint: 0x01 cluster: 0x0001 command: 0x0A payload[0]: 021 22:32:44:600 ZCL attribute report 0xBC33ACFFFE5E6F69 for cluster: 0x0001, ep: 0x01, frame control: 0x08, mfcode: 0x0000 22:32:44:895 [INFO] - Button 1002 - TS0044 endpoint: 0x01 cluster: ONOFF (0x0006) command: B1 short payload[0]: 000 22:32:45:916 [INFO] - Discard second event (seq-num = 13) TS0044 # --- click B1 --- 22:33:05:491 [INFO] - No button handler for: TS0044 endpoint: 0x01 cluster: 0x0001 command: 0x0A payload[0]: 021 22:33:05:492 ZCL attribute report 0xBC33ACFFFE5E6F69 for cluster: 0x0001, ep: 0x01, frame control: 0x08, mfcode: 0x0000 22:33:05:801 [INFO] - Button 1002 - TS0044 endpoint: 0x01 cluster: ONOFF (0x0006) command: B1 short payload[0]: 000 22:33:06:805 [INFO] - Discard second event (seq-num = 15) TS0044 # --- click B2 --- 22:33:10:966 [INFO] - Button 2002 - TS0044 endpoint: 0x02 cluster: ONOFF (0x0006) command: B2 short payload[0]: 000 22:33:11:986 [INFO] - Discard second event (seq-num = 16) TS0044 # --- click B2 --- 22:33:14:597 [INFO] - Button 2002 - TS0044 endpoint: 0x02 cluster: ONOFF (0x0006) command: B2 short payload[0]: 000 22:33:15:600 [INFO] - Discard second event (seq-num = 17) TS0044 ... ```

In case of B1, there is still additional event before caused No button handler, so the next click command coming always with seqnum + 2 (from previous click), in case of B2-B4 it is always seqnum + 1 (and no such thing before), duplicate with the same seq-num is discarded in any case.

Now test 2 (providing default response as suggested) - with second commit 2eae0df1964215eda22c9ccf569c3697ea22b485 I see this now:

log excerpt 2 ... ``` # --- click B2 --- (OK) 23:23:37:737 ******* (b) ** send response 1 ? yes 23:23:37:738 [INFO] - Button 2002 - TS0044 endpoint: 0x02 cluster: ONOFF (0x0006) command: B2 short payload[0]: 000 # --- click B3 --- (OK) 23:23:38:858 ******* (b) ** send response 1 ? yes 23:23:38:860 [INFO] - Button 3002 - TS0044 endpoint: 0x03 cluster: ONOFF (0x0006) command: B3 short payload[0]: 000 # --- click B4 --- (OK) 23:23:39:184 ******* (b) ** send response 1 ? yes 23:23:39:184 [INFO] - Button 4002 - TS0044 endpoint: 0x04 cluster: ONOFF (0x0006) command: B4 short payload[0]: 000 # --- click B1 --- (WRONG!) 23:23:40:810 [INFO] - No button handler for: TS0044 endpoint: 0x01 cluster: 0x000A command: 0x00 payload[0]: 007 23:23:40:944 [INFO] - No button handler for: TS0044 endpoint: 0x01 cluster: 0x0001 command: 0x0A payload[0]: 021 23:23:40:945 ZCL attribute report 0xBC33ACFFFE5E6F69 for cluster: 0x0001, ep: 0x01, frame control: 0x08, mfcode: 0x0000 23:23:40:946 ******* (a) ** send response 8 ? yes 23:23:41:174 [INFO] - Button 1002 - TS0044 endpoint: 0x01 cluster: ONOFF (0x0006) command: B1 short payload[0]: 000 23:23:42:174 [INFO] - Discard second event (seq-num = 106) TS0044 23:23:44:199 [INFO] - No button handler for: TS0044 endpoint: 0x01 cluster: 0x0001 command: 0x0A payload[0]: 021 23:23:44:200 ZCL attribute report 0xBC33ACFFFE5E6F69 for cluster: 0x0001, ep: 0x01, frame control: 0x08, mfcode: 0x0000 23:23:44:200 ******* (a) ** send response 8 ? yes 23:23:44:558 [INFO] - No button handler for: TS0044 endpoint: 0x01 cluster: BASIC (0x0000) command: 0x0A payload[0]: 0DF 23:23:44:558 ZCL attribute report 0xBC33ACFFFE5E6F69 for cluster: 0x0000, ep: 0x01, frame control: 0x08, mfcode: 0x0000 23:23:44:559 ******* (a) ** send response 8 ? yes ```

It was indeed not sent previously (but now it does). As you can see it seems to work for B2-B4, but totally wrong for B1 click, because of this unhandled event (No button handler) it sends default response too earlier, and this is not what I added to code marked as (b), but already available code marked as (a) (see the check-in 2eae0df1964215eda22c9ccf569c3697ea22b485). So it sends for the first zclframe with 0x0A and doesn't for the second frame with 0x01 (where it must).

The code of deconz is really a mess, so I don't know right now, excepting to disable this "no button handler" stuff completely and/or disable sending response after that for this first zclframe (which is also const there). Thus, I'm done for today with deconz.

sebres commented 3 years ago

Ah, forgotten... Without default response the device seems to be able to send next press command soonest after 3 seconds, with default response it can send that much faster (few dozen milliseconds), see https://github.com/zigpy/zha-device-handlers/pull/736#issuecomment-765074477

Smanar commented 3 years ago

Ha nice ^^.

But I don't understand why the (b) part is not trigger on B1, it s because all this part is skipped ? https://github.com/dresden-elektronik/deconz-rest-plugin/blob/2eae0df1964215eda22c9ccf569c3697ea22b485/de_web_plugin.cpp#L1017

Because I don't see the problem if the code send a default response too earlier with the cluster 0x0001, It can be too send one for 0x0008.

SwoopX commented 3 years ago

@sebres I'm afraid your implementation is not appropriate as well as the place you've chosen for your code. It does not adhere to the spec requirements as mentioned earlier. The approach @Smanar has previously suggested was almost right. It should be

if (!(zclFrame.frameControl() & deCONZ::ZclFCDisableDefaultResponse) &&
            (zclFrame.frameControl() & deCONZ::ZclFCClusterCommand) &&
            (ind.dstAddress().isNwkUnicast()))
    {
        sendZclDefaultResponse(ind, zclFrame, deCONZ::ZclSuccessStatus);
    }

Probably best to add that code to handleBasicClusterIndication() in basic.cpp. This should also ensure the response is send a few ms earlier. It might help getting rid of the unexpected behaviour for the B1 event...

So it sends for the first zclframe with 0x0A and doesn't for the second frame with 0x01 (where it must).

It is exactly the other way around, given I understand you correctly here. and you mean the clusters. Clusters shouldn't be of any interest here, it's only the command that matters.

For cluster: 0x000A command: 0x00, there shouldn't be any default response (command is read atributes iirc) since a response is expected. Hence, none is send. For cluster: 0x0001 command: 0x0A, there must be a default response, if requested via frame control. As it is, the default response is send.

sebres commented 3 years ago

your implementation is not appropriate as well as the place you've chosen for your code

@SwoopX As you probably seen, it is not a PR or so, but just a test branch. You are free to provide better implementation as well as to move it to better suitable place. This was just a test from my side (in hope to help you and people requesting new devices). I'm ready with deconz for several reasons... one of them for example is that my engineers soul crying by the (re)view of the code like that. Savvy?

Smanar commented 3 years ago

Probably best to add that code to handleBasicClusterIndication() in basic.cpp. This should also ensure the response is send a few ms earlier. It might help getting rid of the unexpected behaviour for the B1 event...

Why here ? During button press, handleBasicClusterIndication is not trigger ?

Smanar commented 3 years ago

Ok so we have talked with Swoops, adn to resume he is right, not possible to put a general fonction in void DeRestPluginPrivate::apsdeDataIndication(const deCONZ::ApsDataIndication &ind) because there is too much "special case".

So the better place to put the code, at least for our problem would be in handleOnOffClusterIndication()

IDK if a manuafacture check is usefull ? I think not But a unicast check yes.

If you can make a test, pls ?

sebres commented 3 years ago

Sure, here you go - a2990c0d0673a4250ac7b8dbd123da2aace41c6a, results:

13:55:56:640 [INFO] - No button handler for: TS0044 endpoint: 0x01 cluster: 0x0001 command: 0x0A payload[0]: 021
13:55:56:642 ZCL attribute report 0xBC33ACFFFE5E6F69 for cluster: 0x0001, ep: 0x01, frame control: 0x08, mfcode: 0x0000 
13:55:56:643 ******* (a) ** send response 8 ? yes (seq-num = 113)
13:55:56:943 ******* (b) ** send response 1 ? yes (seq-num = 114)
13:55:56:944 [INFO] - Button 1002 - TS0044 endpoint: 0x01 cluster: ONOFF (0x0006) command: B1 short payload[0]: 000
13:55:57:203 ******* (b) ** send response 1 ? yes (seq-num = 115)
13:55:57:204 [INFO] - Button 2002 - TS0044 endpoint: 0x02 cluster: ONOFF (0x0006) command: B2 short payload[0]: 000

Now looks good for B1 too (excepting this unhandled stuff), but the response is available, and there is no second duplicate frame (and almost no delay between B1 and B2).

Smanar commented 3 years ago

Thx a lot ^^. We are looking on the code impact, to be sure to not break something, and make a PR for nothing.

Smanar commented 3 years ago

To remove useless

13:55:56:640 [INFO] - No button handler for: TS0044 endpoint: 0x01 cluster: 0x0001 command: 0x0A payload[0]: 021

You can try to edit in fonction checkSensorButtonEvent()

    if (ind.clusterId() == OTAU_CLUSTER_ID)
    {
        return;
    }

With adding cluster POWER_CONFIGURATION_CLUSTER_ID and ELECTRICAL_MEASUREMENT_CLUSTER_ID (this one too make boring notification)

I hope the battery is not using this part by an indirect way, but from the fonction name, I think not.

IDK for others devs, but for me it look fine, you can make a PR pls, I think the modification is strict enought to don't perturbate the rest of code, and realy needed.

Smanar commented 3 years ago

But you can remove (ind.clusterId() == ONOFF_CLUSTER_ID) && this code part is only trigger for on/off cluster

but let (zclFrame.frameControl() & deCONZ::ZclFCClusterCommand) && else this check can be trigger for reporting too, and need to use it only for cluster command.

sebres commented 3 years ago

you can remove (ind.clusterId() == ONOFF_CLUSTER_ID) &&...

I don't think so, if I do that, the duplicate event appears again (despite the default response is sent, this time even twice, for duplicate also):

22:29:08:016 Send zcl default response 0, seqno: 87
22:29:08:017 [INFO] - Button 2002 - TS0044 endpoint: 0x02 cluster: ONOFF (0x0006) command: B2 short payload[0]: 000
22:29:09:031 Send zcl default response 0, seqno: 87
22:29:09:032 [INFO] - Discard second event (seq-num = 87) TS0044
Smanar commented 3 years ago

Hu ? How it can be possible ?

I have checked the code again, and I m sure the fonction handleOnOffClusterIndication() is only called if ind.clusterId() == ONOFF_CLUSTER_ID

I realy don't see the difference ...

(zclFrame.frameControl() & deCONZ::ZclFCClusterCommand) That, on the other hand, can have an impact, handleOnOffClusterIndication is trigger for all operation on this cluster, adding this check will enable the defaut response only for Cluster command, and not reporting for exemple (already managed in the previous fonction)

sebres commented 3 years ago

I have no idea, but it does (verified twice). In place change of return value of ind.clusterid() somewhere? (I don't know how and where this "property" could be changed, since this seems not to be a part of open source).

As for exclude POWER_CONFIGURATION_CLUSTER_ID and ELECTRICAL_MEASUREMENT_CLUSTER_ID, I guess they should be (also) "BASIC_CLUSTER_ID", "TIME_CLUSTER_ID", because I saw 0x00 and 0x0A there, so "No button handler" is still there for B1 sporadically (with resulting repeated frame, delay etc).

I guess this switch (I made to filter the clusters) would just grow with the time, why it shouldn't be either just comparison with ONOFF_CLUSTER_ID?

-    switch (ind.clusterId()) {
-        case BASIC_CLUSTER_ID:
-        case POWER_CONFIGURATION_CLUSTER_ID:
-        case TIME_CLUSTER_ID:
-        case OTAU_CLUSTER_ID:
-        case ELECTRICAL_MEASUREMENT_CLUSTER_ID:
-            return;
+    if (ind.clusterId() != ONOFF_CLUSTER_ID) {
+        return;
     }
Smanar commented 3 years ago

IDK for ONOFF_CLUSTER_ID, lol, have take a look at the code a second time, it s realy fantastic ...

Yep possible too, but you need too, at least ONOFF_CLUSTER_ID + COLOR_CLUSTER_ID + LEVEL_CLUSTER_ID + SCENE_CLUSTER_ID + BASIC_CLUSTER_ID (this one is used too in the code), et perhaps more I haven't see in the code can be used in the fonction.