OpenZWave / qt-openzwave

QT5 Wrapper for OpenZWave
GNU Lesser General Public License v3.0
105 stars 30 forks source link

Weird topic behavior after factory resetting a node #67

Closed kpine closed 2 years ago

kpine commented 4 years ago

I accidentally factory reset a node that was in the network while testing. It looks like OZW responded by auto-failing the node and the removing the failed node. During that time there was something weird going on with the published topics.

First, this node 0 topic was published, then eventually removed:

OpenZWave/1/node/0/instance/0/commandclass/0

Second, for the node that was removed, it has a left over topic and message:

OpenZWave/1/node/5/association/1/,{    "Name": "All Reports",    "Help": "",    "MaxAssociations": 5,    "Members": [        "1.0"    ],    "TimeStamp": 1588206032}

Attached are the logs during the reset.

failed.txt

Fishwaldo commented 4 years ago
kpine commented 4 years ago

Remove Association Topics when Shutting down.

Another data point to add, it seems like the association topic is just not removed when a node in general is removed. I just did a normal "remove node" and everything besides the association topic are gone for that node. Restarting clears it out.

Fishwaldo commented 4 years ago

This should be fixed now. Can you test?

(we also remove the statistics topic when removing a node)

kpine commented 4 years ago

Sure, I'll give it a try within the next couple of days.

kpine commented 4 years ago

So I think everything is looking OK. I tried two sequences, 1) add a node and remove it, and 2) add a node and factory reset it. Both seemed to remove all the node topics.

The only weirdness I saw was with case 2. After the factory reset the node is failed then removed. In MQTT the statistics topic is still still appears, in a strange way. This could just be an MQTT Explorer bug, but I didn't see the same then when I did a normal remove.

image

At the end of the factory reset sequence, I see the following in the ozwdaemon logs:

Finalize PUBLISH: topic: QMqttTopicName("OpenZWave/1/node/10/statistics/")  payloadLength: 0

Does a payload length of 0 means it is removing the topic? If so, then I'd have to give that one to MQTT explorer. Restarting it shows the expected values. The MQTT dump from HA also does not show any remnants of the removed nodes.

I attached the dumps and logs just in case they are of interest. Unless you see anything strange that would explain the MQTT Explorer issue, I'd consider this closed.

mqtt_dumps.zip ozwd_log.zip

kpine commented 4 years ago

I did do something problematic though. I tried the factory reset sequence, and the device send the local factory reset command, which would normally trigger the failed node removal. However, in this case OZW "revived" the node (I don't know why) and did not fail it.

[20200507 0:26:34.774 PDT] [ozw.library] [debug]: Detail - Node: 12   Received: 0x01, 0x0a, 0x00, 0x04, 0x00, 0x0c, 0x02, 0x5a, 0x01, 0xc9, 0x00, 0x6d
[20200507 0:26:34.775 PDT] [ozw.library] [debug]: Detail - Node: 0
[20200507 0:26:34.775 PDT] [ozw.library] [info]: Info - Node: 12 Received Device Reset Locally from node 12
[20200507 0:26:34.775 PDT] [ozw.library] [info]: Info - Node: 12 NoOperation::Set - Routing=true
[20200507 0:26:34.775 PDT] [ozw.library] [debug]: Detail - Node: 0
[20200507 0:26:34.775 PDT] [ozw.library] [debug]: Detail - Node: 12 Queuing (WakeUp) NoOperation_Set (Node=12): 0x01, 0x09, 0x00, 0x13, 0x0c, 0x02, 0x00, 0x00, 0x25, 0xa6, 0x68
[20200507 0:26:34.775 PDT] [ozw.library] [debug]: Detail - Node: 12 Queuing (Controller) Has Node Failed
[20200507 0:26:34.775 PDT] [ozw.library] [info]: Info - Node: 0 Requesting whether node 12 has failed
[20200507 0:26:34.775 PDT] [ozw.library] [debug]: Detail - Node: 255 Queuing (Command) ControllerCommand_HasNodeFailed: 0x01, 0x04, 0x00, 0x62, 0x0c, 0x95
[20200507 0:26:34.775 PDT] [ozw.library] [debug]: Detail - Node: 12 Notification: HasNodeFailed ControllerCommand - Starting
[20200507 0:26:34.775 PDT] [ozw.library] [debug]: Detail - Node: 0
[20200507 0:26:34.776 PDT] [ozw.library] [info]: Info - Node: 255 Sending (Command) message (Callback ID=0x00, Expected Reply=0x62) - ControllerCommand_HasNodeFailed: 0x01, 0x04, 0x00, 0x62, 0x0c, 0x95
[20200507 0:26:34.776 PDT] [ozw.library] [info]: Info - Node: 255 Encrypted Flag is 0
[20200507 0:26:34.776 PDT] [ozw.notifications] [debug]: Notification pvt_controllerCommand  6 1 0
[20200507 0:26:34.776 PDT] [ozw.mqtt.publisher] [debug]: Publishing Event controllerCommand 12 NotificationTypes::Ctrl_Cmd_HasNodeFailed NotificationTypes::Ctrl_State_Starting NotificationTypes::Ctrl_Error_None
[20200507 0:26:34.776 PDT] [qt.mqtt.connection] [debug]: qint32 QMqttConnection::sendControlPublish(const QMqttTopicName&, const QByteArray&, quint8, bool, const QMqttPublishProperties&) QMqttTopicName("OpenZWave/1/event/hasnodefailed/")  Size: 83  bytes. QoS: 0  Retain: f
alse
[20200507 0:26:34.776 PDT] [qt.mqtt.connection.verbose] [debug]: bool QMqttConnection::writePacketToTransport(const QMqttControlPacket&)  DataSize: 119
[20200507 0:26:34.778 PDT] [qt.mqtt.connection.verbose] [debug]: void QMqttConnection::transportReadReady()
[20200507 0:26:34.778 PDT] [qt.mqtt.connection.verbose] [debug]: Received PUBLISH
[20200507 0:26:34.778 PDT] [qt.mqtt.connection.verbose] [debug]: Finalize PUBLISH: topic: QMqttTopicName("OpenZWave/1/event/hasnodefailed/")  payloadLength: 83
[20200507 0:26:34.782 PDT] [ozw.library] [debug]: Detail - Node: 255   Received: 0x01, 0x04, 0x01, 0x62, 0x00, 0x98
[20200507 0:26:34.782 PDT] [ozw.library] [debug]: Detail - Node: 0
[20200507 0:26:34.782 PDT] [ozw.library] [warning]: Warning - Node: 12 Received reply to FUNC_ID_ZW_IS_FAILED_NODE_ID - node 12 has not failed
[20200507 0:26:34.782 PDT] [ozw.library] [critical]: Error - Node: 12 WARNING: node revived
[20200507 0:26:34.782 PDT] [ozw.library] [debug]: Detail - Node: 255   Expected reply was received
[20200507 0:26:34.782 PDT] [ozw.library] [debug]: Detail - Node: 255   Message transaction complete
[20200507 0:26:34.783 PDT] [ozw.library] [debug]: Detail - Node: 0
[20200507 0:26:34.783 PDT] [ozw.library] [debug]: Detail - Node: 255 Removing current message
[20200507 0:26:34.783 PDT] [ozw.library] [debug]: Detail - Node: 12 Notification: Notification - Node Alive
[20200507 0:26:34.783 PDT] [ozw.library] [debug]: Detail - Node: 12 Notification: HasNodeFailed ControllerCommand - NodeOK

The node ID and all its values persist (even after a restart), but as far as I can tell the node is not communicating any status changes. Triggering the tamper or door notifications are never picked up by ozwd. When I try to remove the node, I get an error.

[20200507 0:41:14.793 PDT] [ozw.library] [debug]: Detail - Node: 255   Received: 0x01, 0x07, 0x00, 0x4b, 0x26, 0x01, 0x00, 0x00, 0x94
[20200507 0:41:14.793 PDT] [ozw.library] [debug]: Detail - Node: 0
[20200507 0:41:14.793 PDT] [ozw.library] [info]: Info - Node: 0 FUNC_ID_ZW_REMOVE_NODE_FROM_NETWORK:
[20200507 0:41:14.793 PDT] [ozw.library] [info]: Info - Node: 0 REMOVE_NODE_STATUS_LEARN_READY
[20200507 0:41:14.793 PDT] [ozw.library] [debug]: Detail - Node: 255   Expected callbackId was received
[20200507 0:41:14.793 PDT] [ozw.library] [debug]: Detail - Node: 255   Expected reply was received
[20200507 0:41:14.793 PDT] [ozw.library] [debug]: Detail - Node: 255   Message transaction complete
[20200507 0:41:14.793 PDT] [ozw.library] [debug]: Detail - Node: 0
[20200507 0:41:14.793 PDT] [ozw.library] [debug]: Detail - Node: 255 Removing current message
[20200507 0:41:14.793 PDT] [ozw.library] [debug]: Detail - Node: 0 Notification: RemoveDevice ControllerCommand - Waiting                                                                                                                                                        [20200507 0:41:14.793 PDT] [ozw.library] [info]: Info - Node: 0 WriteNextMsg Controller nothing to do
[20200507 0:41:14.793 PDT] [ozw.notifications] [debug]: Notification pvt_controllerCommand  4 4 0
[20200507 0:41:14.793 PDT] [ozw.mqtt.publisher] [debug]: Publishing Event controllerCommand 0 NotificationTypes::Ctrl_Cmd_RemoveNode NotificationTypes::Ctrl_State_Waiting NotificationTypes::Ctrl_Error_None
[20200507 0:41:14.793 PDT] [qt.mqtt.connection] [debug]: qint32 QMqttConnection::sendControlPublish(const QMqttTopicName&, const QByteArray&, quint8, bool, const QMqttPublishProperties&) QMqttTopicName("OpenZWave/1/event/removenode/")  Size: 66  bytes. QoS: 0  Retain: false
...
[20200507 0:41:25.848 PDT] [ozw.library] [debug]: Detail - Node: 0   Received: 0x01, 0x07, 0x00, 0x4b, 0x26, 0x02, 0x00, 0x00, 0x97
[20200507 0:41:25.848 PDT] [ozw.library] [debug]: Detail - Node: 0
[20200507 0:41:25.848 PDT] [ozw.library] [info]: Info - Node: 0 FUNC_ID_ZW_REMOVE_NODE_FROM_NETWORK:
[20200507 0:41:25.848 PDT] [ozw.library] [info]: Info - Node: 0 REMOVE_NODE_STATUS_NODE_FOUND
[20200507 0:41:25.848 PDT] [ozw.library] [debug]: Detail - Node: 0 Notification: RemoveDevice ControllerCommand - InProgress
[20200507 0:41:25.849 PDT] [ozw.notifications] [debug]: Notification pvt_controllerCommand  4 6 0
[20200507 0:41:25.849 PDT] [ozw.mqtt.publisher] [debug]: Publishing Event controllerCommand 0 NotificationTypes::Ctrl_Cmd_RemoveNode NotificationTypes::Ctrl_State_InProgress NotificationTypes::Ctrl_Error_None
[20200507 0:41:25.849 PDT] [qt.mqtt.connection] [debug]: qint32 QMqttConnection::sendControlPublish(const QMqttTopicName&, const QByteArray&, quint8, bool, const QMqttPublishProperties&) QMqttTopicName("OpenZWave/1/event/removenode/")  Size: 69  bytes. QoS: 0  Retain: false
[20200507 0:41:25.849 PDT] [qt.mqtt.connection.verbose] [debug]: bool QMqttConnection::writePacketToTransport(const QMqttControlPacket&)  DataSize: 102
[20200507 0:41:25.850 PDT] [qt.mqtt.connection.verbose] [debug]: void QMqttConnection::transportReadReady()
[20200507 0:41:25.850 PDT] [qt.mqtt.connection.verbose] [debug]: Received PUBLISH
[20200507 0:41:25.851 PDT] [qt.mqtt.connection.verbose] [debug]: Finalize PUBLISH: topic: QMqttTopicName("OpenZWave/1/event/removenode/")  payloadLength: 69
[20200507 0:41:26.440 PDT] [ozw.library] [debug]: Detail - Node: 0   Received: 0x01, 0x17, 0x00, 0x4b, 0x26, 0x03, 0x00, 0x10, 0x04, 0x07, 0x01, 0x5e, 0x72, 0x5a, 0x80, 0x73, 0x86, 0x84, 0x85, 0x59, 0x71, 0x70, 0x7a, 0x98, 0x2c
[20200507 0:41:26.440 PDT] [ozw.library] [debug]: Detail - Node: 0
[20200507 0:41:26.440 PDT] [ozw.library] [info]: Info - Node: 0 FUNC_ID_ZW_REMOVE_NODE_FROM_NETWORK:
[20200507 0:41:26.440 PDT] [ozw.library] [info]: Info - Node: 0 REMOVE_NODE_STATUS_REMOVING_SLAVE
[20200507 0:41:26.440 PDT] [ozw.library] [warning]: Warning - Node: 0 Remove Node Failed - NodeID 0 Returned
[20200507 0:41:26.440 PDT] [ozw.library] [debug]: Detail - Node: 0 Notification: RemoveDevice ControllerCommand - Failed

It seems like the node and the controller disagree with each other. "Remove failed node" has the same result.

kpine commented 4 years ago

So I used the Si Labs PC Controller software to remove the node. Normal remove failed the same way. Then I clicked the "Is Failed" button (Sends a Failure signal to a node), it turned red in the UI listing, then a "Remove Failed" selection resulting in the node being removed.

Reading the docs, sounds like I could have accomplished the same in OZW using the hasnodefailed command followed by the removefailednode command.

Fishwaldo commented 4 years ago

The only weirdness I saw was with case 2. After the factory reset the node is failed then removed. In MQTT the statistics topic is still still appears, in a strange way.

Looks like a Explorer Bug

Does a payload length of 0 means it is removing the topic?

Being a bit of a noob at MQTT, I can "assume" yes, but not sure if there is a flag or something in the payload that indicates it should be deleted.

the device send the local factory reset command, which would normally trigger the failed node removal. However, in this case OZW "revived" the node (I don't know why) and did not fail it.

Clarification - in the logs, it sends a "Is Node Failed" command that "fails" - and I suspect a timing issue here, as the Controller still believes the node is alive. (from memory there are some timeouts/retries related to when a controller marks a node as failed - so it would make sense that it isn't marked immediately as failed)

Reading the docs, sounds like I could have accomplished the same in OZW using the hasnodefailed command followed by the removefailednode command.

Yes it should be the same. Maybe I need to add a "timeout" to the HasNodeFailed commands on receiving the Locally Reset command - But I'm sure users would still be confused that its not "removed" immediately.

Not sure exactly how to approach this... other than document the behavior...