OpenZWave / open-zwave

a C++ library to control Z-Wave Networks via a USB Z-Wave Controller.
http://www.openzwave.net/
GNU Lesser General Public License v3.0
1.05k stars 918 forks source link

OZW does not complete Node Interviews. #1935

Open RicardP opened 4 years ago

RicardP commented 4 years ago

Hello folks!

https://github.com/domoticz/domoticz/issues/3597

Any idea? :)

Best regards, Ricard

petergebruers commented 4 years ago

Hi. This is not a "known issue". So it might be specific to your situation or you have discovered a new bug. To find out, we need a detailed log, of a fresh start of your network. Pleas follow these steps:

BTW OZW might post new data to Domoticz, when you enable accept new hardware you might get new devices (because OZW 1.6 supports more features)

RicardP commented 4 years ago

Hello @petergebruers thanks! As usual, when showing something is wrong there is no wrong when turning on debug logging... Or at least it was not any issues at first when cache file was restored. Nodes looks initiated ok, but Node 78 does not show same data as Node 67 even it is same type of device (Fibaro Wall Plug). I performed Refresh Node Info but same result. Another thing that happened was that Domoticz Web/GUI freeze at 14:55, same time logging stopped and 14:57 GUI responded again as well logging was continued. Hmmm.... I am thinking that the missing data from Node Interview is most likely caused by Domoticz/GUI freeze I described, if/when it happens during Node Interview process, The question is if the freeze is caused by OZW or Domoticz...!?

OZW_Log.zip

Fishwaldo commented 4 years ago

@ricardp can you explain what is different between nodes? And please also confirm the firmware version of each node is the same. Vendors often change or implement new features between firmware versions, so that could be a explanation.

As for the freezing - I would hesitate a guess that’s something in the App, or your hardware getting very busy. Everything is asynchronous in OZW, but in Notification callbacks there is a possibility that a application can block OZW from performing certain functions.

Or there is a mutex deadlock in OZW, but those cases would usually result in OZW hanging indefinitely and not only for a short time. So for that issue you might want to see if there is a way to turn up the app logging and discuss on the domoticz forum.

petergebruers commented 4 years ago

I have seen this before, but on a completely different system and I don't know what causes it

2019-09-07 14:55:37.747 Info, Node024, Sending (Poll) message (Callback ID=0xa7, Expected Reply=0x04) - MeterCmd_Get (Node=24): 0x01, 0x0a, 0x00, 0x13, 0x18, 0x03, 0x32, 0x01, 0x00, 0x25, 0xa7, 0x4c
2019-09-07 14:55:37.747 Info, Node024, Encrypted Flag is 0
2019-09-07 14:57:41.243 Detail, Node024,   Received: 0x01, 0x04, 0x01, 0x13, 0x00, 0xe9

It is as if the system freezes for 2 minutes.

Afther that you'll notice things like "ERROR: ZW_SEND_DATA could not be delivered to Z-Wave stack" and "CAN received...triggering resend" which hints at either a bad case of de-synchronization between controller and OZW (serial protocol level) or something really weird going on with that controller.

I can deduce from the COM4 in your log, you are on Windows and you use Domiticz. That other case I was working was a Intel NUC running ubuntu or debian and Home Assistant. Not much in common! I have to check if you're using the same controller.

Note to self: I know on linux/mac the COM port does not get and exclusive lock (it is complicated), which can cause issues if some software sends bytes or reads bytes. I have to check if this can happen on Windows but I think the O/S always gets an exclusive lock. So that cannot be it.

akamming commented 4 years ago

i have the same issue, and also using domoticz (latest beta). i used the workaround to upgrade to ozw 1.6 (giving the refresh node info command on every node which was not initialised correctly)

i am using a pretty default domoticz installation (on a rpi 3b, debian stretch which is up-2-date and a zwave.me uzb stick)

To make sure it was not a domoticz issue, i compiled the OZW control panel and let it initialise, which gave me the same results. I suspect a lot of users have this issue (if i read the forum on people with issues on domoticz/ozw 1.6, which all fix the issue with doing a lot of refreshing of the nodes). So i agree with @RicardP this should be fixed to enable users to do an upgrade from ozw 1.4 to ozw 1.6 without any major issues.

Let me know if you it is useful to have another case with debug logging. Cause in this case i will setup the same situation (OZW Control panel doing an initialisation) again and create a debug log file

petergebruers commented 4 years ago

Let me know if you it is useful to have another case with debug logging. Cause in this case i will setup the same situation (OZW Control panel doing an initialisation) again and create a debug log file

Yes, please, if you find the time, build OZW and ozwcp from source.

If you already have done that, remove your ozwcache file, the one in the folder with the ozwcp executable I mean. Let ozwcp run without doing anything, for about 1 minute per 3 devices. If you have 60 devices, wait 20 minutes. Then ZIP the OZW_log.txt and your ozwcache and drag and drop it here. Please do not paste as text.

Keep in mind we are not talking about battery powered devices, if you think you have trouble with interviewing them as well, please wake them up (or some of them) so their interview gets included in that log.

At the end of the test, copy the name of your serial device, refresh the browser to clear the page, then paste the name of the serial device an click "Initialize". That should populate the device list again. Please keep in mind that "ozwcp" is in maintenance mode and is not at all guaranteed to be bug free I would not jump to conclusions... But it is interesting to use it to generate that OZW_log.txt.

As an alternative, if you are not interested in a nice GUI and battery devices, you might as well run "MinOZW" in your open-zwave root to record that OZW_Log.txt

akamming commented 4 years ago

Here you go!

Controlpanellogs.tar.gz

as an example: Look at node 11. At the end of the node interviews has current values of switch, kwh, kvah and W and is not usable.

When i give the refresh node info (also captured in this logging) it has the normal values and is controllable again.

A suggestion to fix. I'm not an expert, but i see a lot of "CAN Received, triggering resend" in the logging, which does not appear when just 1 node is refreshed. There is not other controller on the system (i checked: Domoticz was definitely down) so my guess is the commands and responses to the different nodes get mixed up sometimes. I know it's not a final solution (you have to find the root cause, and i think OZW should be be able to check itself that something went wrong instead of reporting a ready state), but if the initialisation routine can be changed to only 1 node at a time being queried, you probably would probably already have a much better results.

Fishwaldo commented 4 years ago

Hi, The issue here is related to the interview order that I'm working on in the Dev Branch. In essence, the Device is sending reports before the interview is complete (in particular, before we get the Version of the CC that device supports) so the initial values exposed might not be upto spec with regards to what the device actually exposed.

A suggestion to fix. I'm not an expert, but i see a lot of "CAN Received, triggering resend" in the logging,

Thats indicating network congestion or interference. Just looking at the end of your Log, you have some very "Chatty" devices (Node 15, 17, 47 etc - Basically what has the Meter CommandClass) and sending multiple updates per minute. You need to turn down the frequency of reports on these devices. Z-Wave, at best, is 100K bandwidth and drops off with Non-ZW+ devices, of when you have multiple hops/retries etc

RicardP commented 4 years ago

@Fishwaldo Sorry for late reply. There seems not being a difference between the nodes in terms of model or checking the node reported application version (FW?) in control panel.

I have experienced issues with those Fibaro plugs for a long time, after 2-3 weeks of up time they can start behave weird and causing instability of the whole ZWave/OZW, causing OZW unresponsive for a while, like we talk about as well in this thread... After a power cycle of the "hanged" plug it works normal again for 2-3 few weeks.

Below I provide some material that might say you something about the plugs 67 and 78 in my network.

Fibaro Wall Plugs FGWPEF.zip

ozwcache_xxxx.zip

Mesh

RicardP commented 4 years ago

@petergebruers sorry for late repsonse.. Yes running Windows -10 on a two year i3-NUC Asus, SSD and 4GB ram. Only running Domoticz, Run as Service. So it should hardly idling... :) However as you suggest it is likely a serial port issue/bottle neck, controller or windows... Ive seen the "ERROR: ZW_SEND_DATA could not be delivered to Z-Wave stack" in my log frequently since forever, never known if it was a "normal" error :) So, wondering how to pinpoint this very serious interface issue? I am not a coder so I cannot be of much help like that but if you might want to "TeamViewer" my system I can arrange login credentials at an agreed time.

Cheers!

Ittiz commented 4 years ago

Just FYI I have the same issue, the vast majority of the time OZW never completes the interview process of the devices, this only started to happen within the last week or two. I have a large complicated network like RichardP's. Mine is a mix of old Z-wave and Z-wave+ devices.

On a side note my sever is a Debian server, Dell T420 with dual CPUs and 64gbs of ram. That should be plenty enough horse power. I also use Domoticz.

akamming commented 4 years ago

@Fishwaldo

Just looking at the end of your Log, you have some very "Chatty" devices (Node 15, 17, 47 etc - Basically what has the Meter CommandClass) and sending multiple updates per minute. You need to turn down the frequency of reports on these devices.

How do i do that? the pattern is that alle 3 devices you mentioned are NEO Coolcam powerplugs. Current parameter settings are an update every 300 seconds (setting 2) or a change of 30% to the reported power value (setting 6). So that should not be a lot of updates.

Fishwaldo commented 4 years ago

How do i do that? the pattern is that alle 3 devices you mentioned are NEO Coolcam powerplugs. Current parameter settings are an update every 300 seconds (setting 2) or a change of 30% to the reported power value (setting 6). So that should not be a lot of updates.

usually via a configuration option for each device, but if you look at the Log files for METER recieved lines, you will see its a LOT more often than 300 seconds.

Fishwaldo commented 4 years ago

@Ittiz how do you determine its not completed? Mains powered devices should quickly progress through to QueryStage_Complete, and Battery Powered Devices should sit at either QueryStage_Probe or QueryStage_CacheLoad until they wakeup to complete the interview process.

Fishwaldo commented 4 years ago

@RicardP without logs I can't tell anything. The only thing I see is that the first node has WATT's option, and the second doesn't. That matches up with the same reports as @akamming reported that the dev branch aims to fix.

Ittiz commented 4 years ago

@fishwaldo I use domoticz and it warns me it hasn't initialized all the nodes yet. On top of that some nodes don't seem to communicate any more. Nothing changed in my zwave network, I updated domoticz, which had the newest version of openzwave in it.

RicardP commented 4 years ago

@Fishwaldo there is as well a log... :)

https://github.com/OpenZWave/open-zwave/issues/1935#issuecomment-529108411

I have performed Refresh Node Info on both 67 and 78 but still 67 does not show the KWh - totally consumed power. If you cannot see anything unexpected in the log I guess there is an issue with 67 firmware/some corrupt internal data. Could try reset it and re-include later on to see if that helps.

Fishwaldo commented 4 years ago

@Ittiz I have no idea how domoticz works in this case. Maybe @petergebruers can chime in, otherwise log files are needed.

Fishwaldo commented 4 years ago

@RicardP As mentioned its a known issue and its been worked on in Dev branch. No ETA when I'll have the fix as its rather intrusive change.

petergebruers commented 4 years ago

I have to be honest... Not to create false illusions... I do not know where to start looking. Let me explain. I am not a newbie to Z-Wave and I have tried to help about 20 users in 5 years on the Fibaro forum, all in the category of "unexplained delays and communication failures". Some I could help, in many cases I could reduce the symptoms. But certainly not all of them are happy and a few changed to another type of gateway. Most of them have 50, probably 100 nodes and some even more.

To begin with, not all of your problems might share the same cause.

I would like to stress what @Fishwaldo already said: "You need to turn down the frequency of reports on (these) devices.". It is always a good idea to reduce the traffic, and I would say "Power Report" is certainly one to look at.

I have tried to reproduce your issues but my OpenZWave "production" network is small, and I see no issues. Here is a list of my devices:

012 (0x0c)  FIBARO System FGKF601 Keyfob+   FIBARO System
011 (0x0b)  FIBARO System FGWPE/F Wall Plug FIBARO System
010 (0x0a)  FIBARO System FGWPE/F Wall Plug FIBARO System
009 (0x09)  FIBARO System FGD211 Universal Dimmer 500W  FIBARO System
008 (0x08)  FIBARO System FGWPE/F Wall Plug Gen5+   FIBARO System
007 (0x07)  FIBARO System FGWPE/F Wall Plug FIBARO System
006 (0x06)  Shenzhen Neo Electronics Co Ltd NAS-WR01ZE Power plug 12A+  Shenzhen Neo Electronics Co Ltd
005 (0x05)  FIBARO System FGS221 Double Relay Switch 2x1.5kW    FIBARO System
004 (0x04)  FIBARO System FGRM222 Roller Shutter Controller 2   FIBARO System
003 (0x03)  Qubino (Goap) ZMNHIA2 Flush on/off thermostat   Qubino (Goap)
001 (0x01)  Controller  Z-Wave.Me ZME_UZB1 USB Stick    Z-Wave.Me

There is one Neo Wall Plug and it does report often...

You might think I did not test properly, but because I was debugging Roller Shutter 2 support on Domoticz, I have restarted and even deleted ozwcache a lot. Dozens of times.

Not only that but...

Interview never failed. Maybe the only thing worth noting is the controller, it is not "the usual Aeotec" but a ZME_UZB1.

That is why I say... "I have to be honest... Not to create false illusions... I do not know where to start looking."

Fishwaldo commented 4 years ago

@petergebruers looking at the log files, we are getting METER reports in before the interview process for the METER Class completes (in particular the Version Check). The Meter Class is creating ValueID's based on Version 1 of the Meter Spec, and then later we are getting the Version Reports and Capabilities but its all getting confused.

In this case, if the users were able to start up OZW without a device sending a METER report before the Version Check is done, then the issues wouldn't appear.

Hence, this is another case of the same issue I've tried to fix in Dev branch.

I hope I can get back to it this week. I did try another approach last week, but totally screwed it up again!

petergebruers commented 4 years ago

@petergebruers looking at the log files, we are getting METER reports in before the interview process for the METER Class completes (in particular the Version Check). The Meter Class is creating ValueID's based on Version 1 of the Meter Spec, and then later we are getting the Version Reports and Capabilities but its all getting confused.

Okay, a variant on the "multilevel CC", I understand. But that only happens right after first interview (only after inclusion, refresh node, or delete cache), the problem should no longer be there after a restart, correct? I think @RicardP still has issues after a restart, I would expect the Version to be written by that time. Maybe that is a different problem, or maybe I am just confused.

I did try another approach last week, but totally screwed it up again!

Thank you for trying... I am sure it is not an easy task and the process is fragile as well...

BTW I will have another look at the available log files, to see for example if any time-outs cause mixup of data.

Fishwaldo commented 4 years ago

I think in this case, it’s writing out a bogus Value to the config file in addition to the “valid” data and this getting confused. Hence why a restart is not helping. (Or, the internal state is a bit messed up)

petergebruers commented 4 years ago

Might be able to test that if I can find a node in my spares with that version. Or make or simulate one. On the other hand, the problem might be fixed after your update... I'll wait and do something else.

@Fishwaldo while going through the log of RicardP I noticed 48 times "Dumping queued log messages"

Why so many and what is its purpose? I have been looking at the LogImpl.cpp and it does seem to print the same information as the lines above that, minus the node ID but printing a thread ID instead.

For instance, when you see this:

Dumping queued log messages

2019-09-07 14:45:47.375 8292 
2019-09-07 14:45:47.375 8292 Sending (Query) message (Callback ID=0xba, Expected Reply=0x04) - MultiChannel Encapsulated (instance=2): SensorAlarmCmd_SupportedGet (Node=21): 0x01, 0x0d, 0x00, 0x13, 0x15, 0x06, 0x60, 0x0d, 0x01, 0x02, 0x9c, 0x03, 0x25, 0xba, 0x9c
2019-09-07 14:45:47.375 8292 Encrypted Flag is 0
...

A few lines above you'll find

2019-09-07 14:45:47.375 Detail, 
2019-09-07 14:45:47.375 Info, Node021, Sending (Query) message (Callback ID=0xba, Expected Reply=0x04) - MultiChannel Encapsulated (instance=2): SensorAlarmCmd_SupportedGet (Node=21): 0x01, 0x0d, 0x00, 0x13, 0x15, 0x06, 0x60, 0x0d, 0x01, 0x02, 0x9c, 0x03, 0x25, 0xba, 0x9c
2019-09-07 14:45:47.375 Info, Node021, Encrypted Flag is 0

Is it to facilitate multiple loggers? Is that thread ID there to debug potential threading issues?

Ittiz commented 4 years ago

Just a note, since @Fishwaldo noted the issue could be caused by the meter command class I did a test. I only have one device that uses that command class (a Qubino switch) that is sadly connected to the mains. Anyway to test to see if the issue would resolve I turned off the circut breaker to the switch. Shut down Domoticz (shuts down OZW too) then deleted the cache file and restarted everything. The network was then able to initialize. I then turned on the breaker to the switch again. Besides all the other issues I've been having since Domoticz updated to OZW 1.6 it works again.

gizmocuz commented 4 years ago

@Fishwaldo , just to chime in on the previous question how 'Domoticz' is checking if all nodes are queried, it is doing so by expecting one of the following notifications to be received (and if not, not all nodes are queried)

case OpenZWave::Notification::Type_AwakeNodesQueried:
case OpenZWave::Notification::Type_AllNodesQueried:
case OpenZWave::Notification::Type_AllNodesQueriedSomeDead:
Fishwaldo commented 4 years ago

@gizmocuz each individual devices "interview" stage is represented by the QueryStage. Do you display that at all?

That can sometimes help determine which device may still be sleeping/dead or stuck in a part of the interview process.

gizmocuz commented 4 years ago

@Fishwaldo , do you mean the notifications Code_Awake, Code_Sleep, Code_Dead, Code_Alive, Code_Timeout ? Yes i do handle them

Fishwaldo commented 4 years ago

No. I’m talking about this method: http://openzwave.com/dev/classOpenZWave_1_1Manager.html#a30fd64eb8fa178c24a7dc7c2d4d3714f

gizmocuz commented 4 years ago

@Fishwaldo , via the control panel this information is available yes

image

Ittiz commented 3 years ago

I tried to upgrade again today and I can confirm this bug still exists. OZW hangs before it can complete the interview. I'm still way back on OZW 1.4. It's been preventing me from upgrading for year an and a half. I'm sure others have the issue as well. Below is the OZWlog from attempting to start up:

2020-11-14 11:58:13.324 Always, OpenZwave Version 1.6-1545-g1c7b6202 Starting Up 2020-11-14 11:58:13.332 Always, Using Language Localization 2020-11-14 12:00:35.656 Always, *** 2020-11-14 12:00:35.656 Always, * Cumulative Network Statistics *** 2020-11-14 12:00:35.656 Always, * General 2020-11-14 12:00:35.656 Always, Driver run time: . . . 0 days, 0 hours, 2 minutes 2020-11-14 12:00:35.656 Always, Frames processed: . . . . . . . . . . . . . . . . . . . . 105 2020-11-14 12:00:35.656 Always, Total messages successfully received: . . . . . . . . . . 105 2020-11-14 12:00:35.656 Always, Total Messages successfully sent: . . . . . . . . . . . . 14 2020-11-14 12:00:35.656 Always, ACKs received from controller: . . . . . . . . . . . . . 13 2020-11-14 12:00:35.656 Always, * Errors 2020-11-14 12:00:35.656 Always, Unsolicited messages received while waiting for ACK: . . 2 2020-11-14 12:00:35.656 Always, Reads aborted due to timeouts: . . . . . . . . . . . . . 0 2020-11-14 12:00:35.656 Always, Bad checksum errors: . . . . . . . . . . . . . . . . . . 0 2020-11-14 12:00:35.656 Always, CANs received from controller: . . . . . . . . . . . . . 1 2020-11-14 12:00:35.656 Always, NAKs received from controller: . . . . . . . . . . . . . 0 2020-11-14 12:00:35.656 Always, Out of frame data flow errors: . . . . . . . . . . . . . 0 2020-11-14 12:00:35.656 Always, Messages retransmitted: . . . . . . . . . . . . . . . . . 1 2020-11-14 12:00:35.656 Always, Messages dropped and not delivered: . . . . . . . . . . . 0 2020-11-14 12:00:35.656 Always, ***** 2020-11-14 12:00:35.656 Error, Node021, ERROR: Dropping command, expected response not received after 1 attempt(s). Command: "NoOperation_Set (Node=21): 0x01, 0x09, 0x00, 0x13, 0x15, 0x02, 0x00, 0x00, 0x25, 0x0c, 0xdb" 2020-11-14 12:00:35.656 Always, 2020-11-14 12:00:35.656 Always, Dumping queued log messages 2020-11-14 12:00:35.656 Always, 2020-11-14 12:00:35.656 Always, 2020-11-14 12:00:35.656 Always, End of queued log message dump 2020-11-14 12:00:35.656 Always,