dresden-elektronik / deconz-rest-plugin

deCONZ REST-API plugin to control ZigBee devices
BSD 3-Clause "New" or "Revised" License
1.88k stars 485 forks source link

Device binding: Infinite loop trying to read date code on Tuya Blinds Motor (AM43) #7538

Closed Monofin closed 5 months ago

Monofin commented 6 months ago

Does the issue really belong here?

Is there already an existing issue for this?

Describe the bug

Hi all: I've puplled apart the logs from trying to get a _TZE200_rddyvrci (AM43 Tuya cluster-using blinds motor) to join the network, and there's definitely something broken. Its completely repeatable - and plausibly a regression as these devices (same ones) were previously paired to the network under previous versions of deCONZ, but removing one and then trying to re-add is failing.

It looks like a part of the simple descriptors code (the date code read is line 14387 in file de_web_plugin.cpp) is getting stuck in an infinite loop trying to get the date code from the device, but timing out. The loop continues until the network is closed for joining.

It seems possible that the dateCodeAvailable flag is getting erroneously set?? - (a read to 0x0006 will always time out on these devices after binding it seems).

Procedure for joining these devices is usually this: Open net for joining, device in pairing mode, wait until it shows up in GUI, then re-read the simple descriptors. If this second read doesn't complete, then it doesn't become a window covering and remains stuck as a battery or smart plug.

The loop occurs after the device has joined as a 'battery', but after the second read of the clusters it enters this loop: (Device ID is 0x847127FFFECD7AAF)

The log snip: (More logs available)

17:27:11:872 Clear fast probe timeout for cluster 0x0000, 0x847127FFFECD7AAF 17:27:11:889 [4.1] Get date code 17:27:11:890 [4.2] get basic cluster attr 0x0006 for 0x847127fffecd7aaf 17:27:12:028 TY_DATA_REPORT: seq 17, dpid: 0x09, type: 0x02, length: 4, val: 0 17:27:12:029 TY_DATA_REPORT: seq 17, dpid: 0x09, type: 0x02, length: 4, val: 0 17:27:12:029 TY_DATA_REPORT: seq 17, dpid: 0x09, type: 0x02, length: 4, val: 0 17:27:12:029 TY_DATA_REPORT: seq 17, dpid: 0x09, type: 0x02, length: 4, val: 0 17:27:12:030 TY_DATA_REPORT: seq 17, dpid: 0x09, type: 0x02, length: 4, val: 0 17:27:12:076 MAC poll fastEnddeviceProbe() 0x847127FFFECD7AAF 17:27:12:076 wait response fastEnddeviceProbe() 0x847127FFFECD7AAF, elapsed 186 ms 17:27:12:080 MAC poll fastEnddeviceProbe() 0x847127FFFECD7AAF 17:27:12:080 wait response fastEnddeviceProbe() 0x847127FFFECD7AAF, elapsed 190 ms 17:27:12:084 MAC poll fastEnddeviceProbe() 0x847127FFFECD7AAF 17:27:12:084 wait response fastEnddeviceProbe() 0x847127FFFECD7AAF, elapsed 194 ms 17:27:12:113 FP indication 0x0104 / 0x0000 (0x847127FFFECD7AAF / 0x8462) 17:27:12:113 ... (0x847127FFFECD7AAF / 0x8462) 17:27:12:114 Clear fast probe timeout for cluster 0x0000, 0x847127FFFECD7AAF

I'll have a look at the code here and see if I can see anything obvious - this seems to be a regression for me, as I have several of the same devices (including this one) that previously joined up perfectly with the same process.

Steps to reproduce the behavior

Procedure for joining these devices is usually this: Open net for joining, device in pairing mode, wait until it shows up in GUI, then re-read the simple descriptors.

If this second read doesn't complete, then it doesn't become a window covering and remains stuck as a battery or smart plug.

Expected behavior

Device should be identified and appear in phoscon as a light (!), described as a Thermostat in Phoscon, but described as a Smart Plug in the GUI, but named as a Battery in both. It will then function correctly as a window covering.

Screenshots

No response

Environment

deCONZ Logs

No response

Additional context

No response

Monofin commented 6 months ago

(deConz logs were too long to attach directly)

Smanar commented 6 months ago

Hello, according to this capture, I think your device don't have Datecode and SwbuidID https://github.com/dresden-elektronik/deconz-rest-plugin/issues/4663#issue-842739755

If I m right afer some tries the legacy code need to skip this part because thoses lines

        // manufacturer, model id, sw build id
        if (!sensor || modelId.isEmpty() || manufacturer.isEmpty() || (swBuildId.isEmpty() && dateCode.isEmpty() && (dateCodeAvailable || swBuildIdAvailable)))
        {

dateCodeAvailable and swBuildIdAvailable need to be false after the first try.

The inclusion is random for tuya stuff with legacy code, and from my memory this code haven't moved since long time.

Now new device are using DDF core, so they haven't this issue, but the problem is it's not possible to make DDF for tuya covering (using the tuya cluster)

Can try the method using deconz. But I m not sure this one is still working, The device need to be included in deconz, need to be visible in the GUI, but not included in the API, so no device at all, no battery sensor and no plug/light. Be sure deconz know the device model ID and manufacture Name, can ask for Basic attribute. And only after set phoscon in permit join, and ask for descriptor (right clic on the node then the first 3 requests). It will trigger an inclusion and as Deconz already know the model ID, it will use a special hack in the code to direclty manage the device as light.

Monofin commented 6 months ago

Pulling the code apart a bit, it looks like the 0x0006 attribute is not being flagged as 'unavailable' - possibly due to the device mis-reporting/mis-using the 0x0006 attribute in cluster 0x0000. This results in the situation where the device has reported (once, probably on first bind) that cluster 0x0000 has attribute 0x0006, but then does not report properly - similar to the Lump devices, which seem to have a special skip over some of these issues.

I think one fix may be to force the flag of the attribute to 'unavailable' when working on the device type...

Monofin commented 6 months ago

By excluding the specific model number from trying to read the date code, I've prevented the infinite loop, which is a start - however, the device still doesn't become a window covering, and remains as a battery.... (!)

Seems to be now suck in this loop... more debugging :-)

12:50:35:466 MAC poll fastEnddeviceProbe() 0x847127FFFECD7AAF 12:50:35:467 don't create binding for attribute reporting of sensor Battery 132 12:50:35:467 skip check bindings for client clusters (no group) 12:50:35:470 don't create binding for attribute reporting of sensor Battery 132

Smanar commented 6 months ago

It need to be marked as not availalble by this line

dateCodeAvailable = std::find(unavailBasicAttr.cbegin(), unavailBasicAttr.cend(), 0x0006) == unavailBasicAttr.cend();

On Legacy code, there is 2 parts one for sensor, one for router. This device is an end device, so it use by defaut the sensor part.

So there is a hack to force the detection

{"_TZE200_rddyvrci", "TS0601", "Moes", "Tuya_COVD AM43-0.45/40-ES-EZ(TY)"},

            if (R_GetProductId(&lightNode).startsWith(QLatin1String("Tuya_COVD")) || //Battery covering
                R_GetProductId(&lightNode) == QLatin1String("NAS-AB02B0 Siren"))     // Tuya siren
            {
                hasServerOnOff = true;
            }

The manufacture core NEED to be 0x1002

The router stuff is done before deconz known the manufacture name and the model ID (so the ProductID) so this check can't work at start. But the "magic" need to happen in the "router" fonction (the fonction void DeRestPluginPrivate::addLightNode(const deCONZ::Node *node) ), your logs are from the "sensor" part, used only for battery.

And now the problem, its not "synchronous" at all, honnestly I can't say in wich one orders all actions are done, I know just sometime it was working and sometime not. From my memory the "router" part is just ignored, and I can't say why.

Now with DDF core it's no more an issue, devices are correclty reconised, but for tuya covering, it's not possible to create a state/open field, it's possible using hack and using instead state/on for exemple, but the DDF core haven't the support for tuya covering yet.

Edit: Do you know if it's possible to include a device using a DDF, then remove the DDF to the device be handled by the legacy code. If yes can be a solution for you. I will ask to others devs.

ebeasant-arm commented 6 months ago

So - I did actually managed to get it included: After a day leaving the device included as 'just a battery', I re-opened the network for joining and hit 'read basic descriptors' and magic happened.

I'm currently trying to debug why, and if it can be made reliable on the first read....

Smanar commented 6 months ago

I'm currently trying to debug why, and if it can be made reliable on the first read

Honneslty, don't spent too much time on it, the support using legacy core is no more supported.

github-actions[bot] commented 5 months ago

As there has not been any response in 21 days, this issue has been automatically marked as stale. At OP: Please either close this issue or keep it active It will be closed in 7 days if no further activity occurs.

github-actions[bot] commented 5 months ago

As there has not been any response in 28 days, this issue will be closed. @ OP: If this issue is solved post what fixed it for you. If it is not solved, request to get this opened again.