OpenZWave / open-zwave

a C++ library to control Z-Wave Networks via a USB Z-Wave Controller.
http://www.openzwave.net/
GNU Lesser General Public License v3.0
1.05k stars 917 forks source link

Ozwcache corruption leading to memory corruption and crash #2038

Open petergebruers opened 4 years ago

petergebruers commented 4 years ago

The problem might be less scary than the title suggests, because this happened while I was debugging OZW, frequently starting and stopping, adding and deleting nodes and so on... It might not happen very often "in real life" but I think it was worth investigating/mitigating.

At one point, I got a SIGSEGV while testing ozwcp or MinOZW, similar to this:

./MinOZW: line 11: 89769 Segmentation fault: 11

It halted in "vector" code while retrieving the list. That's odd. I added some logging:

    ValueList::Item const *ValueList::GetItem() const
    {
        try
        {
            Log::Write(LogLevel_Warning, "m_items.size: %zu, requesting m_valueIdx: %d %s", m_items.size(), m_valueIdx, GetID().GetAsString().c_str());
            if (m_items.size()>255)
            {
                Log::Write(LogLevel_Warning, "Crash Ahead!");
            }
            return &m_items.at(m_valueIdx);
        }

That happened when I was retrieving a List value and after adding debugging code I got an implausible size for the list size, something like:

18446744073684291199

That looked a lot like memory corruption, so I decided to find the root cause.

The Value Tyoe is determined after loading ozwcache and I noticed...

<CommandClass id="112" name="COMMAND_CLASS_CONFIGURATION">
    <Compatibility />
    <State>
        <CCVersion>1</CCVersion>
        <InNif>true</InNif>
        <StaticRequests>4</StaticRequests>
    </State>
    <Instance index="1" />
    <Value type="list" genre="config" instance="0" index="1" label="Ambient Light Intensity Calibration" units="" read_only="false" write_only="false" verify_changes="false" poll_intensity="0" min="8" max="255" vindex="1" size="1">
        <Help>Adapt the sensitivity of the motion detection.</Help>
        <Item label="High Sensitivity" value="8" />
        <Item label="Normal Sensitivity" value="12" />
        <Item label="Lower Sensitivity" value="128" />
        <Item label="Lowest Sensitivity" value="255" />
    </Value>
    <Value type="short" genre="config" instance="0" index="1" label="Ambient Light Intensity Calibration" units="second" read_only="false" write_only="false" verify_changes="false" poll_intensity="0" min="5" max="600" value="30">
        <Help>Adapt the sensitivity of the motion detection.</Help>
    </Value>
    <Value type="byte" genre="config" instance="0" index="1" label="Ambient Light Intensity Calibration" units="" read_only="false" write_only="false" verify_changes="false" poll_intensity="0" min="0" max="255" value="255">
        <Help>Adapt the sensitivity of the motion detection.</Help>
    </Value>
    <Value type="list" genre="config" instance="0" index="1" label="Ambient Light Intensity Calibration" units="" read_only="false" write_only="false" verify_changes="false" poll_intensity="0" min="0" max="0" vindex="0" size="1">
        <Help>Adapt the sensitivity of the motion detection.</Help>
        <Item label="Disable" value="0" />
        <Item label="Enable" value="255" />
    </Value>
    (...)

That cannot be right... If you look at the elements only...

<Value type="list" genre="config" instance="0" index="1" label="Ambient Light Intensity Calibration" ...
<Value type="short" genre="config" instance="0" index="1" label="Ambient Light Intensity Calibration" ...
<Value type="byte" genre="config" instance="0" index="1" label="Ambient Light Intensity Calibration" ...

Should be...

<Value type="list" genre="config" instance="1" index="1" label="Motion detection sensitivity" ...
<Value type="short" genre="config" instance="1" index="2" label="Motion detection ON time" ...
<Value type="byte" genre="config" instance="1" index="3" label="Basic Set Level" ...

Unfortunately, I cannot reproduce this ozwcache issue, but I did save the ozwcache files.

So It dawned upon me... whenever the definition in ozwcache does not match the C++ code of the CommandClass... Crash.

For example, if OZW does this (which is in every CC)

Internal::VC::ValueDecimal* value = static_cast<Internal::VC::ValueDecimal*>(GetValue(_instance, sensorType));

It casts the Value it has "on the store" to be of type class ValueDecimal...

But if ozwcache has type="list" OZW starts writing data with memory layout ValueDecimal (blindly) to a type ValueList which has a different memory layout.

I thought... Why not try dynamic_cast to avoid this situation? This lead me to discover another interesting side effect...

In SensorMultiLevel.cpp...

auto value = dynamic_cast<Internal::VC::ValueDecimal*>(value_base);
if (value == nullptr)
{
    Log::Write(LogLevel_Warning, "SensorMultilevelCmd_Report variable value is nullptr, creating new value");
    node->CreateValueDecimal(ValueID::ValueGenre_User, GetCommandClassId(), _instance, sensorType, SensorMultiLevelCCTypes::Get()->GetSensorName(sensorType), "", true, false, "0.0", 0);
    value = dynamic_cast<Internal::VC::ValueDecimal*>(GetValue(_instance, sensorType));
    if (value == nullptr){
        Log::Write(LogLevel_Warning, "SensorMultilevelCmd_Report adding new value failed, it might exist as a different type");
        return false;
    }
}

This correctly logged the fact that my Value of type Decimal wasn't in the store, but digging a little deeper I uncovered a potential issue withe the simplified per node store IDs, if this "mismatch" happens.

Because the per node store does not all parts of the ValueID, the ID matches that of a "similar ennough ID". So I've added this "paranoid test" to Node::GetValue

Internal::VC::Value* Node::GetValue(ValueID const& _id)
{
    // This increments the value's reference count
    auto value = GetValueStore()->GetValue(_id.GetValueStoreKey());

    if (value->GetID().GetId() != _id.GetId())
    {
        Log::Write(LogLevel_Error, m_nodeId, "Node::GetValue called with: %s but GetValueStore returned: %s",
                   _id.GetAsString().c_str(), value->GetID().GetAsString().c_str());

        value->Release();
        return nullptr;
    }
    return value;
}

The idea is simple: if we call GetValueStore() with _id, the Store must return a value with the same _id. If the ids do not match, something is really wrong.

The GetAsString() function does not exist in OZW yet, I'll make a PR for that.

Then add similar logging in ValueStore.cpp:

bool ValueStore::AddValue(Value* _value)
{
    if (!_value)
    {
        return false;
    }

    uint32 key = _value->GetID().GetValueStoreKey();
    map<uint32, Value*>::iterator it = m_values.find(key);
    if (it != m_values.end())
    {
        // There is already a value in the store with this key, so we give up.
        Log::Write(LogLevel_Warning, _value->GetID().GetNodeId(), "ValueStore::AddValue called to add %s but store already has %s",
                    _value->GetID().GetAsString().c_str(), it->second->GetID().GetAsString().c_str());
        return false;
    }

If I leave the "static cast" in, the AddValue will log:

Warning, Node009, ValueStore::AddValue called to add HomeID: 0xd96a77cd, ValueID: (Id 0x00010000094c4012, NodeID 9, Genre user, CC 0x31, Instance 1, Index 1, Type decimal) but store already has HomeID: 0xd96a77cd, ValueID: (Id 0x00010000094c4014, NodeID 9, Genre user, CC 0x31, Instance 1, Index 1, Type list)
2019-12-18 10:44:55.986 Thread: 0x700003ebd000 Warning, SensorMultilevelCmd_Report adding new value failed, it might exist as a different type

It points to the mismatch between the type read from ozwcache (which got read first) "Type list" and the MultiSensor wants to add "Type decimal" with all other parts of the ID equal, this leads GetValueStoreKey to retrieve the same (simplified) key...

The memory corruption can be reproduced imho with almost any device, by first adding the device, then picking any of the values it reports as decimal and setting that to list. In my case I used a Neo Motion sensor and changed

<CommandClass id="49" name="COMMAND_CLASS_SENSOR_MULTILEVEL">
(...)
<Value type="decimal" genre="user" instance="1" index="1" label="Air Temperature" ...

to

<Value type="list" genre="user" instance="1" index="1" label="Air Temperature" ...

I don't know how to reproduce the ozwcache file issue that lead to the the crash.

candrea77 commented 4 years ago

Hi @petergebruers , sorry to bother you .... I've got a problem but I'm not sure is related to this.

I'm going into some strange behavior .... I'm running ozw with domoticz , sometimes happen than on ozw-cp some devices have strange "Current Value" labels.

Example : a couple of FGWPE/F (Fibaro wall plug) same version. On both of them I'm starting at time 0 with same label.

After some days I'm getting different value : see pic 1 (wrong : see air temperature wich does not exists on that device) and 2 (good) of different FGWPE

FGWPE-1 FGWPE-2

If I run a Node Refresh from ozw-cp, the wrong label vanish.

This kind of problem result into some random zwave error like this (from a different device type ... but it's the same problem) : _2020-02-11 16:26:19.540 Error: OpenZWave: ValueChanged: Tried adding value, not succeeded!. Node: 122 (0x7a), CommandClass: SENSOR MULTILEVEL, Label: Unknown, Instance: 1, Index: 0

I'm using : Version: 1.6-1004-g71220f43

Regards