Selecting HEAL (maybe) on ozw-admin nuked my Aeon Labs G5 ZStick twice in two months

Madelaide commented 3 years ago

This issue occurred for the second time following an update to OZW 0.7.0, however, perhaps that was coincident.

Selected heal from OZWadmin (mac app) for a node that was moved and stopped being visible in the UI. OZW addon goes into a state there queries a few nodes but they timeout and then repeats "ping response" requests. Sorry this is a poor report, both times it has happened I have frantically tried all avenues to get the system going again; including 1) snapshot 2) vm snapshot 3) new vdi image and reinstall frmo scratch (did this the first time a month ago, very painful) 4) tried domoticz with the same result, the stick was nuked, BUT, it WORKED from PC controller software from Silicon Labs 4.1) EVERYTHING worked, i could turn nodes on off, all off, all on etc. 4.2) but connected to the hassio vm, or domoticz vm was no go 5) flashed Zstick from the backup i made after the last event (3 weeks old), and it seems to have fixed it 5.1) even let me use the snapshotted Hassio and the devices so far seem to be there

Something obviously gets into a bad state with the zstick. I did try reapplying the zstick image I took after the lockup to my spare zstick. It had the same issue.

This makes me extremely hesitant to change anything on my network now.

Sorry again for this poor quality Issue report.

Madelaide commented 3 years ago

Update: When it fails, the log, in both HA and ozwadmin, usually shows nothing apart from timed out message pings, but IF it does show something, it only shows messages with hex, not the usual expanded useful text following ie (but imagine different node and hex) [20201124 12:00:49.467 ACDT] [ozw.library] [debug]: Detail - Node: 5 Received: 0x01, 0x0b, 0x00, 0x04, 0x00, 0x05, 0x05, 0x70, 0x06, 0x0d, 0x01, 0x02, 0x88 [20201124 12:00:49.467 ACDT] [ozw.library] [debug]: Detail - Node: 5 Received: 0x01, 0x0b, 0x00, 0x04, 0x00, 0x05, 0x05, 0x70, 0x06, 0x0d, 0x01, 0x02, 0x88 [20201124 12:00:49.467 ACDT] [ozw.library] [debug]: Detail - Node: 5 Received: 0x01, 0x0b, 0x00, 0x04, 0x00, 0x05, 0x05, 0x70, 0x06, 0x0d, 0x01, 0x02, 0x88

rrozema commented 3 years ago

You most likely have a very busy network and starting a network wide heal creates a storm of messages on top of the already many messages going through, resulting in no messages coming through anymore or only 'damaged' messages coming through to your controller. If this happens again, try the following to see if I'm correct: just switch your controller off for an hour or so (make it powerless so it doesn't respond to any requests). Then after that time see if you can get your network back into operation after most nodes have given up on getting their messages through to the controller and the network has gone mostly silent again.

This is of course only a work-around for getting back control over your metwork. The real solution is fixing the topology of your network: find which node(s) are causing the heavy load on your network and make them less 'chatty'. Fix the routing, reconfigure nodes to send less -or better yet- no unsollicited messages at all. There are advanced methods of finding your 'problem node(s)' like using a zniffer. Those will work excellentky and fast, but getting at a zniffer can be hard and interpreting a zniffer's output can be intimidating.

Common sense can get you a long way too. You just need to be observant to see what you can do to fix things:

If you've got one or more 'dead nodes' in your node list: remove them from your network so that the other nodes don't waste bandwith trying to connect to those dead nodes over and over again.
If you've got a battery powered device that drains it's battery very fast, faster than other devices of similar type, that could be a symptom of the device being broken or misconfigured: It can be worth while excluding it from your network for a while to see if it's not causing your problems by constantly broadcasting and draining the battery quickly in doing so.
consider if you really need so many unsollicited messages: Do you really need the power, voltage, usage, amperes, etc of every switch you own every 10 seconds or is it sufficient to just switch most devices off and on at times and disable automatic power reports and such in most of your switches.
if a device can't be configured to not send unsollicited messages, set it to repeat only after a very long time.

The poor-man's method of fixing the routing is to heal only a single node at a time: start with healing the powered nodes only and of those start with the one closest to your controller and then work your way out to the powered nodes furthest away. Only after you've made sure all powered nodes have working access paths to and from the controller, you may repeat the same procedure for your battery powered devices. Unlike with the powered devices, you will most likely have to wake each battery powered device up before it'll accept the heal command, so these battery powered devices are going to be very cumbersome to heal. Many times however the battery powered nodes start behaving better automatically once the powered devices have proper routes, so you may get lucky there.

Madelaide commented 3 years ago

Thanks for the feedback, I have moved on to zwjs as ozw is effectively dead.

Much happier network now.

Madelaide commented 3 years ago

Ozw dead. Closing.

OpenZWave / open-zwave

Selecting HEAL (maybe) on ozw-admin nuked my Aeon Labs G5 ZStick twice in two months #2479