Open petergebruers opened 4 years ago
Hi @petergebruers , sorry to bother you .... I've got a problem but I'm not sure is related to this.
I'm going into some strange behavior .... I'm running ozw with domoticz , sometimes happen than on ozw-cp some devices have strange "Current Value" labels.
Example : a couple of FGWPE/F (Fibaro wall plug) same version. On both of them I'm starting at time 0 with same label.
After some days I'm getting different value : see pic 1 (wrong : see air temperature wich does not exists on that device) and 2 (good) of different FGWPE
If I run a Node Refresh from ozw-cp, the wrong label vanish.
This kind of problem result into some random zwave error like this (from a different device type ... but it's the same problem) : _2020-02-11 16:26:19.540 Error: OpenZWave: ValueChanged: Tried adding value, not succeeded!. Node: 122 (0x7a), CommandClass: SENSOR MULTILEVEL, Label: Unknown, Instance: 1, Index: 0
I'm using : Version: 1.6-1004-g71220f43
Regards
The problem might be less scary than the title suggests, because this happened while I was debugging OZW, frequently starting and stopping, adding and deleting nodes and so on... It might not happen very often "in real life" but I think it was worth investigating/mitigating.
At one point, I got a SIGSEGV while testing ozwcp or MinOZW, similar to this:
./MinOZW: line 11: 89769 Segmentation fault: 11
It halted in "vector" code while retrieving the list. That's odd. I added some logging:
That happened when I was retrieving a List value and after adding debugging code I got an implausible size for the list size, something like:
18446744073684291199
That looked a lot like memory corruption, so I decided to find the root cause.
The Value Tyoe is determined after loading ozwcache and I noticed...
That cannot be right... If you look at the elements only...
Should be...
Unfortunately, I cannot reproduce this ozwcache issue, but I did save the ozwcache files.
So It dawned upon me... whenever the definition in ozwcache does not match the C++ code of the CommandClass... Crash.
For example, if OZW does this (which is in every CC)
It casts the Value it has "on the store" to be of type class ValueDecimal...
But if ozwcache has
type="list"
OZW starts writing data with memory layout ValueDecimal (blindly) to a type ValueList which has a different memory layout.I thought... Why not try dynamic_cast to avoid this situation? This lead me to discover another interesting side effect...
In SensorMultiLevel.cpp...
This correctly logged the fact that my Value of type Decimal wasn't in the store, but digging a little deeper I uncovered a potential issue withe the simplified per node store IDs, if this "mismatch" happens.
Because the per node store does not all parts of the ValueID, the ID matches that of a "similar ennough ID". So I've added this "paranoid test" to Node::GetValue
The idea is simple: if we call GetValueStore() with _id, the Store must return a value with the same _id. If the ids do not match, something is really wrong.
The GetAsString() function does not exist in OZW yet, I'll make a PR for that.
Then add similar logging in ValueStore.cpp:
If I leave the "static cast" in, the AddValue will log:
It points to the mismatch between the type read from ozwcache (which got read first) "Type list" and the MultiSensor wants to add "Type decimal" with all other parts of the ID equal, this leads GetValueStoreKey to retrieve the same (simplified) key...
The memory corruption can be reproduced imho with almost any device, by first adding the device, then picking any of the values it reports as decimal and setting that to list. In my case I used a Neo Motion sensor and changed
to
I don't know how to reproduce the ozwcache file issue that lead to the the crash.