OpenZWave / open-zwave

a C++ library to control Z-Wave Networks via a USB Z-Wave Controller.
http://www.openzwave.net/
GNU Lesser General Public License v3.0
1.05k stars 911 forks source link

Power meter sometimes reports wrong values #2333

Open RDols opened 4 years ago

RDols commented 4 years ago

After more then 2 years of deleting wrong values it's getting out of hand. I Have too many meters. So i started hunting for solutions, i don't have to search and delete these faulty datapoints. Until now i found 1 really broken plug, a startup glitch in Domoticz (At this moment not fixed). To find problems i compiled a windows debug version of Domoticz and OpenZWave and running evertything under the VS debugger.

I also created a bobytrap for faulty values in "Meter::HandleReport". and few times a day it triggers. Looking into the documentation of Silicon labs to understand the serial protocol. This is what i noticed when heaving a wrong value (typically rediculous high KWH value).

The message seems ok at first glance. Checksum is ok. Size of the message is ok. But looking closer the size of the meter value is set to 6.

My questions:

Some details.

Regards, Richard

RicardP commented 4 years ago

Very interesting Richard, corrupt data is a big issue of course and I experienced a lot of it as well reported here, even corrupt/added ghost CC's. However, since later versions of OZW I see a less ocassions but happens from time to time. I am also running Domoticz (latest betas normally), but Win10 systems. I am a user only so I cannot provide to your bug tracking, but I am sure your efforts and findings is important for the solution. Nice to hear that the CRC checkup did not fail in your test because that could not be resolved within OZW. Thanks a lot!!

RDols commented 4 years ago

I'm not sure if the checksum never fails. I suspect the aeotec stick drops packets with a wrong checksum. Unfortunately last night i encountered a different error.

checksum is correct. But reported value is wrong, but still in a "valid" range. Only it does not fit the previous values. "fixing" this reveals two bytes are corrupted on the same bit. so the checksum still succeeds. At this moment i have no clue how this can be solved. Also i'm not sure if the data got corrupted during transmittance, or the plug simply reports a wrong value.

RDols commented 4 years ago

All kind of things go wrong in the data. Some can be detected, but not all. I have seen faulty meter values just 200wh off.

Also noticed the newer NAS-WR01Z (0x1027 / 0x0200) are much worse then the older ones (0x1087 / 0x0003)

I'm able to reproduce the problems mentioned in the next two issues:

1724

2062

RDols commented 4 years ago

@RicardP Question. Are you using a Aeotec Stick Gen5?

RicardP commented 4 years ago

@RDols slow response to your feedback's, sorry. I see your results and it is interesting about devices them selves seems being that buggy, randomly reporting bogus data... but I am not convinced :) As previously reported to @Fishwaldo I've seen that bogus data occasions is closely related to bad RF-link quality, and "jumping Nodes", Nodes depending on packets relayed via neighbor/s. Nodes with good RF-link and in direct connection to the Controller almost never report bogus data.

Yes I am using Aeotec Gen5 Stick in my production system - Just because this Stick supports backup/restore! It is not very reliable compared to ZWay.me UZB in my experience. As soon (probably not very soon...) backup is supported by ZWave.me in a "user friendly" way I will switch to their UZB instead.

Fishwaldo commented 4 years ago

Essentially - OZW just reports whatever the devices send. We don't "massage" values etc. In this case, I've seen devices occasionally send weird values.

As you note - The CRC is correct (its checked in the OZW code very early on) so "corruption" at the RF level, while not impossible, is unlikely. I generally see this occuring with a small subset of vendors/devices (its not a universal thing) and based on other weird reporting issues, its, as far as I can tell, related to the SDK version of Z-Wave used on some devices. I've reported it previously to SiLabs and they acknowledged it but closed the ticket with a "please test with the latest SDK version".

The documentation states only sizes of 1, 2 and 4 are valid. Maybe it is good to reject other sizes. Or are there meters that uses this field out-of-spec?

Correct. The spec only does byte, short and int sizes. I'll add something to the code to drop anything else.

RDols commented 4 years ago

I see a lot of things go wrong. Simple things like, wrong sizes, wrong message version (also not checked), wrong metertype. Also noticed when message is really messed up, it's been received serveral times with just milliseconds apart. I even have seen "merged" reports of two different plugs. The AEOTEC stick is not off the hook as a problem source.

I do not know the details of the zwave protocol, or serial protocol of Silicon Labs. I just found the documentation and jumped to the parts i'm intrested in. So when you say CRC. Do you mean a "real" CRC or the more simpler XOR checksum?

What's the point of view of the community concerning specific filtering? In my fork i created a "filter" that rejcts reports of KWh meters with extreme values, or when the previous value (if present) difference too much. It's not elegant, even ugly..but the last two days it catched 100% of the errors left.

RDols commented 4 years ago

FYI,

The newer NAS-WR01Z (0x1027 / 0x0200) plugs are horrible.

RicardP commented 4 years ago

@RDols How do you know power reports are corrupted before sending? In my experience it sounds a lot like bad RF link quality, especially as you say it works only few meters away, from controller i guess? Bad RF/antenna design spiced up with a buggy FW... China crap once again!?

RDols commented 4 years ago

I have 20 "old" NAS-WR01Z (0x1087 / 0x0003) I have 2 Fibaro plugs I have several build-in switches with power reports.

None of the above have this issue. (with the corrupted data)

I have 3 NAS-WR01Z (0x1027 / 0x0200). All sending a "corrupt" report every other report. If no power is delivered, all messages report the same KWh, but sometimes it get stuck "currupt" report, sometimes it's correct. Until the counter is running again. This is not a RF issue, its way to predictable and selective. Ow and it applies also to the previous value in the message.

Note that i'm only talking about the faulty reports, with high negative values.

The Bad RF and "illigally" dialing back the KWh counter i also have seen in the older NAS-WR01Z (0x1087 / 0x0003) plugs.

RicardP commented 4 years ago

The really annoying thing with ZWave is that it grown to be the major wireless automation system but still the owner SiLabs seems not mind about how Manufactures perform in products and quality... I have many different brands in my ZWave network and no one is really living up to the ZWave marketing. Nodes need to be power cycled after some time, and Nodes barely able to communicate/respond (i.e. a lots of timeouts) on short distances, even having two three possible powered neighbors to relay packets. Why is there no real built in track and trace for issues in a ZWave netwok and devices, maybe that is the reason no one complaints over ZWave "no one knows what is going on" ?

I have for sure purchased ZWave products for 3500 USD, about half of those items did not work as they should, but I just replaced by another brand or version meanwhile it is not possible to track down the real cause... It is a BIG SHAME that SiLabs accept all crapy products on the market. They should be more careful with people put their trust (money) in them!

markruys commented 3 years ago

It seems to be a bug in this type of devices. I get these consecutive kWh metering values reported:

       37.86          0x00000ECA
-21474798.43  0xFFFFFFFF80000EDD
       38.26          0x00000EF2
-21474798.11  0xFFFFFFFF80000EFD
       38.40          0x00000F00

So what happens I think is that the the most significant bit is randomly being set. OZW expects a 32-bit signed int, hence the negative readings. According to the docs of this device, the maximum report value is 21474836.47 kWh, which is 0x7FFFFFFF. The only work around to this device bug would be that somewhere in the software stack this most significant bit is reset to 0. Then we would have:

37.86
38.05
38.26
38.37
38.40
olavt commented 2 years ago

I have written my own controller code from scratch and I'm also seeing messages from devices with correct checksum, but with payloads that are invalid according to the specification.

I'm starting to add checks to gather more evidence on this.

Take a look here:

https://community.silabs.com/s/question/0D58Y00009KIx2zSAD/invalid-commandclasssensormultilevel-messages-received

olavt commented 2 years ago

I know have evidence (from running Zniffer) that many nodes in a Z-Wave network will send invalid messages (incorrect payload). This seems to occur a bit on random and typically only affects a small portion (like 0.08% percent of the messages). I have seen this problem across nodes with Z-Wave version: 4.5, 6.4 and 6.7.