Idea regarding the startup times for OZW

OpenZWave / open-zwave

a C++ library to control Z-Wave Networks via a USB Z-Wave Controller.

http://www.openzwave.net/

GNU Lesser General Public License v3.0

1.05k stars 916 forks source link

Idea regarding the startup times for OZW #1850

Closed kdschlosser closed 5 years ago

kdschlosser commented 5 years ago

I ran a test with 13 nodes. and it takes 10 seconds for startup. and these are all line power devices and no failed nodes. I have a total of about 80 devices. it would take close to a minute and a half to load if everything replied as it should and no packets got dropped. with just 13 nodes i have a packet drop occasionally. this extends the time by 5 seconds. with 80 nodes i would expect to have 3-4 dropped packets each startup extending the startup to almost 2 minutes.

There seems to be a 5 to 6 second timeout for devices that are asleep or for failed nodes. The number of retries can be adjusted but it should be set to 3 optimally. so 3 retries for a failed or sleeping node adds 15 seconds to the time. if you have 10 of those devices then that's an additional 2 minutes 30 seconds.

I do know that the communications are half duplex. But because of the mesh nature of the network is it not possible to have the controller send out "discovery" packets on a timed bases and handle the responses to them as they come in? instead of sending a discovery and stalling the program until either a reply is received or the timeout period expires?

There is also the user config file. this is supposed to have a cached copy of the network. I do understand the initial setup of a network will take a while. But the information that is received is then cached into that config file.. You would think this would speed up the startup process to be almost instant. This is however not the case.

Based on the logs this a simplification of what i see the startup prodess as. Once the DriverReady notification is sent then the network is operable. to a point. Anything related to the controller or the network it's self and not nodes on the network operate. after that notification happens then all of the nodes are added. and the values for each node are also added. this is fantastic.. this all happens really fast.. But after the nodes and values are finished loading (this is only cached data) the program then begins to query the nodes one at a time to do whatever it needs to do.. node validation I would think. This is when the slowdown occurs..

My thought is this. load the cached data. place the validation of the nodes in a queue. if a query is made against an unvalidated node then move that node to the front of the line to be validated. once it is validated then perform the query. if the validation fails an exception gets raised or return something telling that the validation failed. If the node is validated and information needs to be updated or a query has to be performed. then place the query at the front of the line. when that is done then go back to validating the rest of the nodes.

This would make the network usable right after the cached information was loaded. the queries would be a slower to some extent. but at least the user is not sitting there for a really long while.

Documentation to the front end authors letting them know not to make any queries to a node from a NodeAdded notification. If the user selects a node to display any information then it would be ok to ask for information. once one of the AllNodesQueried notifications comes in then the application can update any information in the UI they want.

The missing pieces in this are

Validation queue
A mechanism to change the order of the queue
A notification when all of the cached data has been loaded
The "NodeAdded" needs to be moved to after the values are loaded for the node.

Things that are good to go.

The "NodeQueriesComplete" notification
The of the "AllNodesQueried" notifications

This is an over simplification of what would need to be done I am sure. It's a basic outline/idea.. I do not know if this is possible without a huge rewrite of the program.

Just a thought on how to give the illusion of a faster startup

petergebruers commented 5 years ago

Hi! You're delving into a tricky part of Z-Wave and a lot of it has to do with Routing and dead nodes.

I can tell you, it is interesting to see how Z/IP handles "non responsive nodes" and I'll come to that.

I ran a test with 13 nodes. and it takes 10 seconds for startup. and these are all line power devices and no failed nodes. I have a total of about 80 devices. it would take close to a minute and a half to load if everything replied as it should and no packets got dropped. with just 13 nodes i have a packet drop occasionally. this extends the time by 5 seconds. with 80 nodes i would expect to have 3-4 dropped packets each startup extending the startup to almost 2 minutes.

Reports on fora vary wildly, it depends a lot on the NOP sending and wether you've changed (reduced) the default 10 s timeout (options.xml)

There seems to be a 5 to 6 second timeout for devices that are asleep or for failed nodes. That is what you observe but that can be longer and it depends on how you send packets.

Right now OZW enables 3 options: "lwr" and "routing" and "explorer". Each option leads to a certain number of tries and thus delays, and the "routing" part depends on the number of neighbors. You'll get about 5 seconds with that option enabled and 1 neighbor iirc but it can be a lot longer!

All this is handled by the protocol and hidden from OZW, but if your controller supports "extended status information" then you'll see LOG lines like this, exposing quite a bit of detail about routing.

Node029, Extended TxStatus: Time: 30, Hops: 0, Rssi:  -75 --- --- --- ---, AckChannel: 0, TxChannel: 0, RouteScheme: Last Working Route, Route: 0 0 0 0, RouteSpeed: 100K, RouteTries: 1, FailedLinkFrom: 0, FailedLinkTo: 0

During that TX time, the network is busy.

It is a very common misconception that when the network seems to "stall", nothing is going on and you might as well start up another transmission. In fact, with Zniffer you can see sending to a dead node causes an avalanche of packets at different speeds and through different repeaters.

I can send you a PM with the routing algorithm, it is too long to post here.

But because of the mesh nature of the network is it not possible to have the controller send out "discovery" packets on a timed bases and handle the responses to them as they come in? instead of sending a discovery and stalling the program until either a reply is received or the timeout period expires?

No, but you can do it like Z/IP but this requires a "fast" and a "slow" queue and I bet it is not a panacea. Under normal operation, you could enable only TX option "LWR" and disable routing and disable explorer. Unless conditions change (eg people move nodes or unplug a repeating node) the Last Working Route should be valid, right? Doing it like this guarantees "fast fail" because the protocol will try LWR/NLWR possibly direct, I have to check the details but in any case it is guaranteed to fail "fast". How fast? I would say in < 0.5 s but I'd have to check.

Then... if it failed... you reschedule it on the "slow" queue and on this queue you do enable routing and explorer. If it still fails it is really dead. If it does not, you've got a new LWR/NLWR and it can be handled by the fast queue again.

I don't know all the details about these queues and I still have some questions, eg does it intentionally delay the packets on the slow queue? If it doesn't, having 2 queues may not solve anything, it depends on how many commands the gateway software schedules per unit of time.

Also, let's not forget devices report state changes and eg power consumption after a state change and if that happens during a long TX from the controller to a dead node, the network is busy. Lot's of lovely details...

You would think this would speed up the startup process to be almost instant.

Don't think "instant" is possible without paying a price. It depends on how "reliable" you want your gateway software to be. Suppose you have shutdown your controller for maintenance. Then toggled a few light switches. Then turn on the gateway... I expect it, at a minimum, to query the state of all devices and get the actual on/off state.

My thought is this. load the cached data. place the validation of the nodes in a queue. if a query is made against an unvalidated node then move that node to the front of the line to be validated. once it is validated then perform the query. if the validation fails an exception gets raised or return something telling that the validation failed.

This is what I call "the law of conservation of misery" - it is equivalent to the well known "law of conservation of energy".

The law says: "the amount of misery is constant, but you change it from one form to another".

What I mean is, at some point, you are going to hit a malfunctioning node. When that happens, you'll have "a stall". You can postpone the "get" of data, but when is the ideal time to do it? When you do it, it takes time.

Suppose you have a scene "turn off the light if it is on for 10 minutes". Do you expect OZW to get the on/off state at the start of OZW? Or only when something requests the state? This kind of scene might not ever "query" the node, it might trigger on value change only. Of course, timing 10 minutes after a reboot of your gateway is also incorrect, you don't have a clue how long the gateway software was down.

I am not saying there is room for improvement but at the same time I think changing queues in OZW or any other library is hard.

This is an over simplification of what would need to be done I am sure. It's a basic outline/idea.. I do not know if this is possible without a huge rewrite of the program.

It is not easy, changing the order of execution will break things.

My main system is a HC2 and a power on to "operational" takes about 2 minutes, for about 45 mains powered nodes and 45 sleeping nodes. After that, it still queries the network but does not show large delays. It is hard to tell what choices they have made or how their TX queue works.

Just a thought on how to give the illusion of a faster startup

I appreciate that.

Although I've worked with about a handful of very clever guys (on the Fibaro forum, before I started working on OZW), some owning the SDK and building their own Z-Wave devices, you are the first to look at the communication protocol in such a detail...

Fishwaldo commented 5 years ago

Ok. I'll start my reply with two goals of OpenZWave (I've stated before). Remove the "How it works" aspect of ZWave from a Application, and although I'm interacting with Developers, its written with End Users in Mind (ie, users of the HA applications that use OZW).

So - Your thinking as a Developer and not a user. Its ok. I get frustrated with OZW as a developer as well sometimes :)

Why do I say this := How often does a End User (after setting up his Home Automation) restart his app? Once a week? Once a month? Its not dozens of times a day or hour thats for sure.

As a developer I can understand there the startup times seem like something that should be fixed.

The counter to all that (That I've said all along) - I value accuracy over startup times. And honestly, for every 1 user that complains about startup times, there are 10 users complaining about incorrect results.

This is why we only move to a certian point when loading the cache, and then go on to query the devices for the remainder of the values.

If you look at the OZW code, you see we have things like Dynamic and Static Value States.

Static Value's generally are "command class configuration" items, such as the Number of UserCodes, number of Scenes etc that are supported. These values, rarely, if ever change on a device.

Dynamic Values are anything that represents a state of the device. The light Switch, Sensor, Individual User Code etc. This is what most applications/users are interested in.

Currently, in OZW, we cache everything Static (and the Dev branch moves a few things around there as well, such as Neighbors and Association Group Members etc). When you start a app with ozwcache present, you will notice it "skips" most of the Static Queries etc and goes on to query the Dynamic Values. Depending upon the complexity of devices you have, this does save a significant amount of time (I have a the Aotec Doorbell, it has 8 endpoints (including root) and each endpoint has SoundSwitch CC. Each SoundSwitch CC has 30(!) different chimes you can choose from. Interviewing that device means at least 500+ packets just for SoundSwitch Alone. Turn on Security and that goes to a few thousand. It will typically take 2-3 minutes with the network flat out to interview that single device.

With a ozwcache present - Our interview for the SoundSwitch upon startup gets done in maybe 30 packets and maybe 20 seconds.

Regarding Timeout and Retries - We used to retry in OZW long ago. in 99% of times, subsequent retries did not succed, so it just "prolonged" the startup time of OZW. Of course, as Peter mentioned, the protocol itself does retries so there is no need for us to also do them. Adjusting timeouts - you can alter that via the Options Class - we used to do 40(!) seconds. 10 is the sweetspot when we take into account potentially having to do multiple hops and incorporating the Z-Wave Protocols inbuilt retry mechanisms. You can play with it, to fine tune your Network

A general comment on "dropped" packets etc - If your getting the same ones everytime you start your network, its most likely the device doesn't support it (or is buggy). The fix is not to play with timeouts, or retries etc, but update the config file for that device (usually GETSUPPORTED will fix things like that). In those cases, the problem is not at the network level, but the device level.

And finally - As noted above about loading everything from cache - I use this extreme example - If I shutdown my HA software and it saves a ozwcache with all the values (lets say Temp Sensor at 16 degrees which triggers my HVAC to warm up the place). I restart the next day, and load up all the values from the cache - Its now 30degress, but the HA software see 16 in the cached value, and triggers off my HVAC to heating.... you can imagine the complaints from the wife and family right. Even if its for a few seconds till we manage to actually refresh that value from the device, I might cycle my HVAC.

Introduce complex automations like sensors and door locks, and the results could be worse - power failure, restarts, sensor triggers the door to unlock while I'm not at home, because the sensor had a "old" value.

Hence - Accuracy over speed :)

As mentioned, the Dev tree moves the interview process around a bit and tries to cache more, but I see complaints on the horizon - "Why is my network map outdated/Empty?" or why did the association I just added using OZWCP not showing up in HASS now? (simple answer - Because HASS didn't request a refresh of this info - but the end user is still grumpy).

Fishwaldo commented 5 years ago

Just a comment on the having multiple packets "in flight" rather than waiting for ACKs etc: 1) Your dealing with end devices running ancient (except for the 700 series) Micros that probably are clocked in Single Digit Mhz range 2) Your running on a network that tops out at 100Kbps half duplex

Your not going to get much more "speed" out of Z-Wave.

Additionally - if you look at Logs of a heavily loaded Z-Wave Network - you will see lots of CAN messages - That means collision. I suspect we would hit that issue trying to send multiple transactions in at a time (and then have to retry anyway!) 1) Send Packet 1 - Wait (or don't Wait) for the Controller ACK, but dont wait for the RF ACK message 2) Send Packet 2 - which happens just as the device sends a reply to Packet 1 - CAN message. We miss both the reply and we have to resend out packet 2 3) Device (I assume) also has to retry reply to packet 1.

kdschlosser commented 5 years ago

I am trying my best to look at this from a user perspective. It is very hard to do being a developer (especially when booting over and over again). I do agree there is a balance between speed and accuracy and understand that there are "dynamic" values to the nodes. I was almost certain that you already would have all of the "static" information on a node cached. This would make sense to do to reduce the number of packets that are sent. The speed in which the ZWave network is set to I do not believe will ever change. it is a balance between range and power consumption. faster speeds = higher frequency and to achieve the same range would cause higher power consumption. so for the sake of battery operated devices everything is going to stay where it is unless they decide to bump up the power consumption on the line voltage devices and run 2+ frequencies.

What I was trying to get at was this. make the static information available to the user/front end immediately after the cache loads. This would allow the front end to populate it's list of devices. and show the user the available devices and the information that is not dynamic. the front end then can move along and do whatever else it is that it does. In most cases the front end is not only for zwave it does a whole variety of things. the startup is typically all done via the main thread or a single thread and the main thread waits for the startup to finish before letting the user interact with the front end. I know ideally we never want to have to reboot the thing, But the reality of it is this.. it will need to be rebooted less often hopefully. There are cases when there may be a glitch in the program that may cause an issue which will require more often reboots. MicasaVerde Vera is a great example of such a glitch. their glitch is the whole program but often reboots are needed.

I have not come across a use case with OZW that would require a restart of OZW. maybe you can shed some light on that. If there is a need to restart OZW while a front end is running getting the devices to populate pretty quickly and being able to return the user to having control again as fast as possible would be a good thing to do.

This is kind of what I am envisioning...

say you start a front end and you are normally given a list of devices that show the name, location and some kind of an image of the device. that image will normally depict the state of the device in some manner.

with the current way things run a user would have to wait 3, 4, 5+ minutes for OZW to fully load before being given access to this list of devices. My question would be why? To display this list there is only one bit of information that is not available which would be the state. the name and location are static pieces of information as well as the device type to know what icon to display.. So add a new icon show a busy symbol on it to indicate the network is working. and when that device has loaded fully then the busy icon can be removed. If one of the icons is clicked to either change state or to open some form of a dialog any dynamic information could be populated at that point. if a failure occurs in updating the node then the front end can display the failure instead of the information. This would require some mechanism to be able to intervene in the nodes being updated by OZW on startup.

I am all for the accuracy just as you are and that accuracy can still be had while giving the control to the user. A great example of the type of setup I am talking about is the Microsoft Windows Startup. even tho control has been given to the user there are things that are still loading. if a user requests a piece that is still loading then that piece gets priority. I know I know Windows sucks and most of you do not use it, but for the sake of an example don't bash me to hard.. :smile:

This is the reason why I stated the "Illusion" of a faster startup. Because the startup is really taking the same amount of time. this we cannot change for reasons you pointed out. But to the user the release of control to them if only to see a list of busy icons (LOL) would be sufficient.

An example of a "stalled" startup would be HASS. When the startup of HASS takes place I believe it is done in a single thread. and each thing gets loaded consecutively. So when the ZWave portion is loaded the program waits for python-openzwave to send a notification that the network is ready. which takes place when one of the AllNodesQueried notifications is received from OZW.

because the cached data for values are not loaded until after the NodeAdded event takes place any cached data for the values cannot be accessed until after a node has been queried for all of it's dynamic information. I do not know how much information is available that could be considered static when dealing with the values. maybe the labels and id's would be all. IDK I have not really looked into it.

Even if the adding of some form of a queue that can be manipulated based on user interaction is not possible or if you feel it should not be done. moving the notification for the node being added until after the cached values have been added and possibly adding a notification after all cached information for all nodes has finished would allow the front end to collect a very basic set of information to allow the program to continue giving that "illusion" of a faster startup. because at that point the front end would no longer have to wait for the startup of OZW to finish.

kdschlosser commented 5 years ago

That means collision. I suspect we would hit that issue trying to send multiple transactions in at a time (and then have to retry anyway!)

I have a question regarding this. The purpose to the mesh and neighbors I thought would be so if a collision happened or if a node cannot be contacted then an alternate route would be used. am I wrong in this understanding of what the mesh network is supposed to do?

If I am incorrect maybe you can offer some information on what the purpose to the mesh network actually is. If the network as a whole can only have a single packet active on it at a time then a mesh network topology would really serve no purpose.

I did also say to send the "discovery" packets at intervals.. not back to back. add a delay between sends something on the order of 100ms.. this can be fine tuned for optimization.

Fishwaldo commented 5 years ago

What I was trying to get at was this. make the static information available to the user/front end immediately after the cache loads.

It is if it makes sense (Sleeping Nodes are different I'll get to that in a minute).

We progressively introduce the node. You have the following events issued as the interview progress progresses. Type_NodeAdded Type_NodeProtocolInfo Type_NodeNaming Type_EssentialNodeQueriesComplete Type_NodeQueriesComplete

If a Application is waiting for Type_AwakeNodesQueried or Type_AllNodesQueriedSomeDead or Type_AllNodesQueried before displaying anything thats a application issue.

Also - As we create each new ValueID - You get the Type_ValueAdded signal as well which can happen as soon as NodeAdded

ozwcp, and the ozw-admin app (if you are game to compile in progress work) demonstrate the above. We dont wait till AllNodesQueried signals to display anything.

The purpose to the mesh and neighbors I thought would be so if a collision happened or if a node cannot be contacted then an alternate route would be used.

RF (and in this case, specifically Half Duplex) means that there can only be 1 node transmitting at any one time on the spectrum. If two nodes transmit at the same time (even if they are not in direct range, but can interfer with each other) then you get a collision.

Mesh Networks dont solve this. They are used for routing and bouncing messages around. Not "Retransmissions"

Fishwaldo commented 5 years ago

Regarding Sleeping Nodes - The reason they sit at CacheLoad after restarting is we are waiting for a few packets to verify they are the original node that was there when we saved the cache. So goes back to the Previous point about accuracy rather than speed.

petergebruers commented 5 years ago

I have a question regarding this. The purpose to the mesh and neighbors I thought would be so if a collision happened or if a node cannot be contacted then an alternate route would be used. am I wrong in this understanding of what the mesh network is supposed to do?

Z-Wave implements "listen before talk" to avoid collisions.

All post you'll find on the internet will give a very simplistic view of Z-Wave routing, often saying not much more then "it uses mesh technology".

There are 3 ways to send packets: direct, routed and "explorer". Z-Wave is source routed and can use up to 4 routes in case the packet does not get ACKed. Those routes come from the controller or SUC/SIS based on "neighbors" info. If routing fails, it can try "explorer" which uses broadcasting.

Routing also tries different speeds because the repeaters can be 40 k / 100 k or 9.6 if they are ancient.

Also, please do not confuse "serial ACK" and "Serial CAN/NAK" with Protocol ACK, those are different things. The serial protocol implements its own synchronization mechanism.

If the network as a whole can only have a single packet active on it at a time then a mesh network topology would really serve no purpose.

Z-Wave does not have channels like Wifi. If a packet is traveling "through the mesh" the radio channel is in use, you cannot start a second transmission.

Also repeaters and devices have no buffers to store and forward packets, they are not like internet routers.

I did also say to send the "discovery" packets at intervals.. not back to back. add a delay between sends something on the order of 100ms.. this can be fine tuned for optimization.

There is no "discovery". There is no way to trigger discovery. You tell the controller "send this packet to node 6

4.3.3.1 ZW_SendData

The so called "mesh" functions are completely hidden , yet you can control a part of it thtough txOptions

TRANSMIT_OPTION_NO_ROUTE = "Only send this frame directly, even if a response route exist" TRANSMIT_OPTION_AUTO_ROUTE = "Request retransmission via repeater" TRANSMIT_OPTION_EXPLORE = "Transmit frame as an explore frame if everything else fails."

About 99.9% of all people I’ve met, invariably think “routing” is like the stuff on the internet and they conclude from that all sorts of invalid stuff like “oh, my router is congested”. You don’t fall into that trap because we have discussed this… To begin with: no, there is no such thing as a “router”, it is a “repeater” and no, it is not congested because Z-Wave can handle only “one think at a time” so there is nothing parallel, you cannot have 2 packets “in flight” and there is no such think as different channels that can be used in parallel (on a Z-Wave network)… If you say something is congested, then the complete network is congested.

ROUTING is handled by “the protocol”. There is a small impact by the gateway software, but for sake of simplicity you can assume that the Z-Wave chips handle this. It does not matter if you run Hass, Domoticz or Fibaro.

If the “protocol” decides “X” is a repeater, then it has used “a sensible algorithm” to come to that conclusion.

And, please do not look only at the “neighbors”, that is not the real routing algorithm. Neighbors are only CANDIDATEs and you don’t even get to see what the node has selected!

You’ll probably want a Zniffer to see how that works on your network, that is the only tool. Well, in OZW 1.6 the “NOP” tester with extended status info is integrated so that version is able to tell you wich repeaters were involved…

The following text might cause a headache...

The routing algorithm in controllers store information about successful attempts to reach a destination node avoiding repetition of previously failed attempts.

This applies to a controller and most parts also to a device.

The LWR list comprises of 232 destination nodes having up to two routes each, called the LWR and the Next to Last Working Route (some older slaves do not have 232 destinations, but only 5 or so, I have to look it up)

If a LWR fails; it is ‘exiled’ to become the NLWR and the current NLWR (if present) is tried as the next route. If a NLWR fails it is removed.

The Application can, instead of having both the LWR and the NLWR entries for a destination node, overload the LWR with an Application Priority Route (APR) -> This sounds interesting but I don’t know of any commercial gateway that actually can set APRs. But the “PC Controller” published by Sigma/Silabs is able to do that.

The routing attempts depend on the Z-Wave library and transmit options used in the node, for details refer to section 3.10. Without going to much in detail you can expect Z-Wave Plus devices to use the options “direct” (which is exactly what it says, no routing at all) and “routed” (which we’ll see later) and “explorer frames” which is a special kind of broadcast.

Also bear in mind, the routing algorithm will try different speeds in certain cases: 9.6/40/100 kbit/s

The source routing algorithm does not alter the topology due to failed attempts or store any statistics regarding link quality.

I guess you have figured that out, but to be on the safe side, the 3 types of routing have different frames

“direct” has no room for routing information.

“routed” has room for 4 repeaters (hops) and the payload. Hence it is called “source routed” because the source == sending node sets the route upfront. That’s why we call the hops “repeaters” and not “routers”. If users on a public forum call the hops “routers” then you know 99% for sure they are clueless about Z-Wave routing.

“explorer” has room for 4 repeaters (hop) and status bits and the payload. It uses the same routing principle but can discover routes. It uses “broadcast” to propagate through the network. It is mandatory on Z-Wave Plus but predates it!

Contrary to what you may think, although routing and “explorer frames” are powerful, Silabs recommends NOT to enable them on a controller by default, only enable them when a node has issues and then do that on a separate, low priority queue. The reason is that if a node goes dead all those alternatives take a lot of time and can easily block your network for 5 seconds and more (3.5 s for explorer alone)…

# The routing attempts done by a static controller to reach the destination node are as follows
# (note APR = Application Priority Route, AFAIK not used by any controller but settable through eg Silabs PC Controller):
#  If APR, LWR and NLWR all are non-existing and TRANSMIT_OPTION_ACK set. Try direct when neighbors with retries.
# If APR exist and TRANSMIT_OPTION_ACK set. Try the APR. If APR fails then try LWR if it exist and if it also fails
#  then remove the LWR and try direct if neighbor.
# If APR do not exist, LWR exist and TRANSMIT_OPTION_ACK set. Try the LWR. In case the LWR fails,
#  ‘exile’ it to become NLWR and try old NLWR if it exist. if the NLWR also fails, remove it and try direct if neighbor.
# If APR do not exist, LWR do not exist, NLWR exist and TRANSMIT_OPTION_ACK set. Try the NLWR.
#  In case the NLWR fails remove it and try direct if neighbor.
#  If TRANSMIT_OPTION_ACK and TRANSMIT_OPTION_AUTO_ROUTE are set then calculate up to two routing
#  attempts per entry/repeater node. If enough routes exist try a settable maximum number of tries.
# If TRANSMIT_OPTION_ACK and TRANSMIT_OPTION_NO_ROUTE are set, then direct with retries.
# If TRANSMIT_OPTION_ACK and TRANSMIT_OPTION_EXPLORE are set then issue an explore frame as last resort.

On a controller, you have this large table of possible routes, the one users turn into a (rather misleading imho) map. Only a controller (that is the Z-Wave chip, the main controller is node 1) can answer the question “how to get from node A to B”!

I prefer a matrix representation of the data, because that is what it is. Suppose you have 4 nodes, this gives you “neighbors” and routing info:

  1 2 3 4
1 * * *
2 * * *
3 * * * *
4     * *

This tells you node 4 is far away and the only possible route from the controller is 1 <-> 3 <-> 4

Slaves do not store neighbor information like that, they depend on the CONTROLLER feeding them up to 4 possible routes. This is what happens when you “update the mesh on a node”. First, the node reports its neighbors to the controller. Then the controller calculates 4 possible routes and sends them to the slave.

So in this case, the slave would receive “1 <-> 3 <-> 4” and that is the only route it has and will try to reach the controller.

You cannot see those routes, and AFAIK Zniffer can capture those “routing hints” as they are sometimes called but it does not decode them.

This is what it prints after a mesh update, when the controllers sends routing info to node 3 (it is a small test network)

PROTOCOL, SR Cache assignment: targetID: 1, cacheIndex: 1, SRlen: 1, hops: 05, selection byte: 10 It says, for node 1 = controller destination, in the 1st (of 4 possible routes) cache, set this route: “node 5”

I have not yet reverse engineered the meaning of the selection byte.

I know this information is correct because I positioned (and powered) my nodes in a way that node 3 and 5 would be neighbors.

BTW don’t buy a Zniffer unless like the technical details and want to spent a vast amount on learning about Z-Wave. It comes with zero documentation.

kdschlosser commented 5 years ago

@petergebruers

OK so correct me if I am wrong in my assessment of how this works.

If you plunked the ZWave protocol into the OSI model I am going to discuss layer 3 at this point. ZWave is a Broadcast network protocol even tho there are ACK frames. Any node within receiving distance of the sender gets the packet then opens it up and sees if it is addressed to them or not.

You did state that there is "routing" information contained within the packet. I think... LOL.. alot of information above I am still digesting. OK so back to the routing information. If there is routing information within the packet then the intended destination would be at the end. there for the devices are routers in the loosest sense of the term. They do no decide the route they process it and carry out it's instruction and repeat the packet to it's intended target. Because the packets are repeated but with a destination other then it's original it can be considered a route.

I think you may have mis interpreted me using the term "discovery". i am using that term not as I am trying to discover a new device. sending a packet to a destination to see if it is alive, awake, or otherwise can be classified into being a discovery packet. the program sends out some kind of a packet to be able to determine the nodes power state an availability.

all of the information makes complete sense knowing that the network can have only a single packet in flight at a time. I do believe there can still be congestion. this would fall into the realm of a whole slew of nodes waiting for the network to become available, and I am sure there is a timeout to how long it will wait. and after that timeout occurs it simply does not send the information.

Thank you to the both of you with the detail I have gotten it explains why the controller sits and waits for a response. Because that is the only way to not cause collisions and or not have a whole mess of nodes waiting and ultimately not respond to the query that was sent. The ZWave protocol is like the kid riding the yellow short bus while wearing a helmet and licking the window. But seriously The protocol is being used way past it's original design. The complexity of it over the years has given it a pretty robust API and allow a plethora of devices with a massive amount of ability. the problem is the network has been pushed past what it was originally designed for. It is "maxed" out + some.

@Fishwaldo Thank you for the time to help me gain a better understanding of how this whole process works. With the information you have given I will be able to produce a notification for a python application to know when the node and all values have been cached. This is going to make it easier for python-openzwave to be integrated and will help to remove the slow startups. It is painful to say this but I do have some more questions

so my next question is this. are the order in which the notifications fired static? With python-openzwave the values are added one at a time to the python representation of a node. Ideally it would be best to notify the user that the node has been completely loaded instead of having them try to figure out when the node is fully loaded from the cached data on their own. In python-openzwave there is a notification that gets sent called SIGNAL_NETWORK_READY this occurs when OZW fires the AllNodesQueried notification. The name of the notification in python is very misleading. It would make one think that they have to wait until that event gets fired to be able to do anything. This is not the case. Hence the slow startups for applications running python-openzwave, I want to make python-openzwave really easy to understand and also to use. I believe that adding a few notifications on the python side of things can really clear up some of these issues. This is what I would like to do/see. and I am open to input on this, the more ideas and suggestions the better.

add notifications for when a node is fully cached. that includes all values. then a user would be able to iterate over the node and obtain any static information needed.

There is a notification for the node query being completed. adding a new one with a different name to indicate that the node has been contacted and is available.

Add a notification for when all nodes have been loaded using their cached information.

there is a notification already for all of the nodes being queried. this notification causes pyOZW to send the NETWORK_READY. maybe we need to rename this notification. ready to me means that the network is not ready there for not usable before that notification gets fired. Which now I know that is not 100% correct. I believe this is the line of thinking that is taking place. I may be wrong. Input from any authors would be a big help in determining if that is the case.

I am trying to identify the notifications that can be used to indicate that a node has completely loaded all of it's cached data.

This is what I am seeing in the log.

2019-06-18 01:45:19.465 Detail, Node001, Notification: NodeAdded
2019-06-18 01:45:19.471 Detail, Node001, Notification: NodeProtocolInfo
2019-06-18 01:45:19.472 Detail, Node001, Notification: EssentialNodeQueriesComplete
2019-06-18 01:45:19.472 Detail, Node001, Notification: ValueAdded
2019-06-18 01:45:19.473 Detail, Node001, Notification: ValueAdded
2019-06-18 01:45:19.474 Detail, Node001, Notification: ValueAdded
2019-06-18 01:45:19.475 Detail, Node001, Notification: ValueAdded
2019-06-18 01:45:19.475 Detail, Node001, Notification: ValueAdded
2019-06-18 01:45:19.476 Detail, Node001, Notification: ValueAdded
2019-06-18 01:45:19.477 Detail, Node001, Notification: ValueAdded
2019-06-18 01:45:19.478 Detail, Node001, Notification: ValueAdded
2019-06-18 01:45:19.478 Detail, Node001, Notification: ValueAdded
2019-06-18 01:45:19.479 Detail, Node001, Notification: ValueAdded
2019-06-18 01:45:19.480 Detail, Node001, Notification: ValueAdded
2019-06-18 01:45:19.481 Detail, Node001, Notification: ValueAdded
2019-06-18 01:45:19.481 Detail, Node001, Notification: NodeNaming

If the order of the notifications is static then I can use the NodeNaming notification. if and only if this notification gets fired for every node. If it does not then I would have to use the next node added to indicate that the previous node (if any) has loaded completely. The only issue with that is there is no notification that identifies when the last node being added has completed until after the query for the first node has completed. if there happens to be an issue in communications with that node it can be a while before that notification takes place.. Not ideal.

I also do not have a way of knowing that all of the cached data has been loaded until the first node has been queried. so generating a notification informing the user of this can get stalled because of the above scenario taking place. Not ideal.

I think there is a lot of confusion/not enough documentation in pyOZW on what notifications are fired and when they are fired and what exactly is available when it is fired. I believe that if this gets a small facelift to make it easier for an author to implement and add to the documentation showing the correct use we will find that the overall performance of an application is going to increase leading to a significant increase in happy users.

So here goes yet another question. I know I have to be killing you guys over there,, LOL. When OZW loads the information that is used to create the nodes and values is from cached information. once a value has been added to the node am I able to alter that value before the physical node has been queried by the controller? If so what takes place when this happens? does it wait to send the new information until after the Type_NodeQueriesComplete for the node happens. or does it wait until all nodes have been queried and then sends it? Or does the change simply get ignored? or maybe an error occurs?

I did want to make mention that I have been updating the documentation portion of pzOZW and providing code examples and doing some rewording and changing the verbage used in an attempt to give a better understanding. But as you can see I myself need an education to be able to do this. Hence all of the questions. I want to make sure that I fully understand it. So I am sorry for you having to repeat yourself. explaining something in more then a single way will ensure that my understanding is correct. It's how my brain works I guess.

Thank you both for taking the time to go over this with me.. I do appreciate it very much.

kdschlosser commented 5 years ago

There are a whole slew of notifications that are available. I think to many in fact. It's confusing to me and I have been staring at them for the past week!

The notifications should not be removed. and should remain in place for advanced integration. In the documentation I want to focus on a ccore set of notifications

node notifications

NodeCachedDataLoaded: fired after all cached data for a node including values has been loaded.
AllNodeCachedDataLoaded: fired after all nodes have had their cached data loaded.
NodeContactSuccess: fired after the node has been queried
NodeContactFailed fired if a node was not successfully queried

network notifications

NetworkStarted fired when the network first gets started
NetworkStartupComplete: fired after all nodes have been queried
NetworkFailed: fired if the network has a failure
NetworkStopped: fired after the network has stopped.

value notifications

ValueChanged: fired if a value changes.

These are the notifications that would be mainly used in an application to correctly integrate with pyOZW.

petergebruers commented 5 years ago

@kdschlosser I am always very happy to talk to anyone who wants to understand routing. And help m, I still have not all the details. I would like to apply this knowledge to situations that seem to function "less than optimal", but that still is a bit difficult.

To an end user, it is pure magic and in 99% of all cases simply works. When things go wrong, without knowing how routing works, people jump to conclusions because they understand how the internet works. It is not their fault, there is no document at an intermediate level explaining all this. And they have no tools. Luckily, if you have between 5 and 20 nodes I'd say it rarely causes trouble. It is the bigger networks and the really small ones.

I can highly recommend to have a look G.9959 , the Z-Wave specification for MAC and PHY, it is e free download, but it is not an easy read.

https://www.itu.int/rec/T-REC-G.9959-201501-I/en

ZWave is a Broadcast network protocol even tho there are ACK frames. Any node within receiving distance of the sender gets the packet then opens it up and sees if it is addressed to them or not.

I would argue, broadcast is ambiguous in this context. If you mean by broadcast: it uses (non-directional) RF so every node in reach can "hear" the transmission then "yes, it is a brodcast network". If you mean, the protocol uses "packets destined to all nodes" then I'd say they are rarely used. Though it is possible to broadcast to address 255 and also do multicast.

You did state that there is "routing" information contained within the packet.

To be more precise, there are 3 different packets type (for data transfer, there are actually more) - the difference is the length and information contained, and the standard packet carries no routing info. Normal packets are fully described in G.9959. The protocol can see the difference between the types because of a few bits in the header.

1) Direct as the name implies has NO routing information. 2) Routed has almost the same header, then adds the hop count, the current hop and up to 4 slots for repeaters. The data is still at the end. Because ACK has to travel back through the network, there is a "routed ACK" and because intermediate hops can fail there is also a "routed fail packet" to signal that. 3) Explorer again has almost same header, but adds timers and hops and uses broadcast. Zniffer can decode them but I haven't found a full description of them yet. Explorer frames contain the payload at the end. NWI adds capabilities to explorer, to transport "a foreign HomeID" because that is what is needed to include a device (a blank device has a random HomeID an NodeID 0 and gets both assigned from the controller when you include it). The description of explorer algorithm is terse, it is in INS13954 Z-Wave 500 Series Appl. Programmers Guide v6.81.0x

3.4 Z-Wave Routing Principles

"The explorer mechanism builds on AODV routing with adjustments for source routing and memory footprint. Explorer frames implement managed multi-hop broadcast forwarding and returns a working route to sender as result. The application payload piggybacks on explorer frame to reduce latency."

Notice the word "broadcast" in this context means "sent to all nodes", so not only in the RF sense.

So you asked about "sending discovery intervals" and there is no such thin because explorer frames get used when routing has failed (if you set the transmit options, see Zw_SendData discussed before) and the "exploration" also caries the data because that avoids having to ACK then resend the packet using normal routing. At then end of this, both nodes will have learned a new Last Working Route. LWR and NLWR are stored in NVM but powering of a device can force it to try a direct connection first.

If by discover you meant "discover possible routes the old way" then that has been tried and has cause issues for some though I do not 100% understand the cause. To get updates of the normal routes, we are not talking about "route discovery in the mesh/explorer sense", you would initiate an "neighbor update". This is a command sent from the controller and asks the node node to "scan the neighbors" by sending them packets, at reduced power, and recording the response. That information is collected by the controller to build that communication matrix (see previous post) and provide the device with up to 4 possible routes for then non-explorer routing mechanism.

Neighbor discovery takes time, and depends on the number of nodes in the network. It actually causes a stall of the network. And it also erases LWR so far away nodes need a routing attempt after doing that.

IMHO it still does make sense to do this in a controlled way, on one or 2 nodes, eg because something got rearranged and they may have selected longer, non optimal routes. Last week I scanned my network and saw a 3-hop count on a node that is not really out of reach. I updated neighbors and it is using a direct connection now. I've talked to a few people but nobody was ably to explain why we sometimes get this "long routes" in out networks.

I highly recommend that chapter "3.4 Z-Wave Routing Principles", I'v based my previous explanation about source routing on that part of the docs (+ testing and observation).

there for the devices are routers in the loosest sense of the term. They do no decide the route they process it and carry out it's instruction and repeat the packet to it's intended target. Because the packets are repeated but with a destination other then it's original it can be considered a route.

Exactly, do please call them "repeaters". There are no "routers like you know" in Z-Wave. There are "routed packets" and "routing slaves" which means "they can send routing packets" but there are no routers in the internet sense and there is no routing table as in a router. There is in fact a routing table in each device, and the route is in the packet, hence the name "source routed".

Because the packets are repeated but with a destination other than it's original it can be considered a route.

I will often say it like this. If node 6 is far away and needs to go through node 3 to reach the controller, I say that node 3 forwards the packet to the controller. I don' use "routes" because it does not make any routing decisions (number 6 has prepared the packet and it has nod 3 in it). It just looks at the hop count, adds one, calculates checksum and sends the packet to the ID in that slot. The numbers "6, 3, 1" form "a route".

I think you may have mis interpreted me using the term "discovery". i am using that term not as I am trying to discover a new device. sending a packet to a destination to see if it is alive, awake, or otherwise can be classified into being a discovery packet. the program sends out some kind of a packet to be able to determine the nodes power state an availability.

Indeed, discovery sounds like "trying to discover routes" and Z-Wave has neighbors and explorer for that.

Discovering if a node is alive is possible. But for battery devices, you can only check the time it last was awake or last send data. IIRC the spec says (recommends) if you don't hear from a node within 3x its wake up interval, then you can assume it is dead.

For mains, you can send a NOP and I would say that is not normally needed, though there is one use case.

This was a real question from a user: "how can I detect if the cleaning lady unplugged a Wall Plug and forgot to put it back". Since Z-Wave does not have keep-alive it would make sense to send a NOP at regular intervals, in this case. Depends on what it's used for. If it is a fridge (would that be a real case) a NOP per 15 minutes would work I think. In this example, you can also simply send "turn on" and check if the node turns dead.

I do believe there can still be congestion. this would fall into the realm of a whole slew of nodes waiting for the network to become available, and I am sure there is a timeout to how long it will wait. and after that timeout occurs it simply does not send the information.

I also believe congestion can happen... It is very hard to test. On my HC2 network I do 2 things to avoid the problem: when switching many devices, put a 150-250 ms delay between commands. And lower the number of reports (Power, Lux, Humidity) by tuning the parameters. Metered devices report state and energy when you switch them so cause more traffic than the older and dumber devices.

I think it depends on the device (how the programmer wrote it and possibly the device) but IIRC the spec forbids devices to "keep trying". This is annoying if a "breach" from a motion sensor controls the lights. When people report these issues, the timestamps and RTT in the OZW log can help, but you don't always know how long that packet was "in flight". A device might be slow, or low on battery.

The complexity of it over the years has given it a pretty robust API and allow a plethora of devices with a massive amount of ability. the problem is the network has been pushed past what it was originally designed for. It is "maxed" out + some.

At the moment, having mostly 500 series devices or 300, you can use as rule of thumb "50 data packets per second max" because per 10 ms you can send a packet and then 10 ms later have the ACK. Turning on a simple relay for example will do that. But a newer device will report state and meter values. Adding security multiplies the number of packets... If you have routed connections because of the repeater action and routed ACK it drops again.

It really depends on a few factors but I fully agree, you cannot say "let's install a 100 node network, accept the defaults, write some automation" and expect it to run 100% smoothly. And it's not all Z-Wave's fault. If you write 30.000 events per day, all temperature and power reports, and never purge, you'd better use something like influxdb or your database performance will become a bottleneck.

@Fishwaldo @kdschlosser as far as I know, HASS devices are available before the network is ready, so I think it correctly implements the idea to show data to the user asap, however...

Quite a few people, including me, run this simple script:

  alias: Z-Wave network is ready
  trigger:
  - event_data: {}
    event_type: zwave.network_ready
    platform: event
  action:
  - data:
      message: Z-Wave network is ready...
      notification_id: zwave
      title: Z-Wave
    service: persistent_notification.create

You would refrain from controlling a Z-Wave device, before seeing that notification.

The reason for this? If you try to control a device before "zwave.network_ready" the response can be sluggish, and in fact you'll probably toggle a switch a few times before it responds. Leading to frustration.

Partly this is because earlier versions of HASS needed restarts to change some settings. This is being worked on and many things have been move to either GUI editors or can be reloaded without restarting HASS. Also HASS has a 14 day update cycle. I think those 2 elements may explain why HASS users are complaining about the startup times. More than users of other products based on OZW.

They expect a lot from the Z-Wave to mqtt component. I have no opinion yet, I do understand how that would solve their "restart" problem but I have no clue if there are any downsides to the mqtt approach.

kdschlosser commented 5 years ago

The reason for this? If you try to control a device before "zwave.network_ready" the response can be sluggish, and in fact you'll probably toggle a switch a few times before it responds. Leading to frustration.

Now here to complicate this a bit more... Is the response being sluggish because the node has not yet been queried and a command is to be sent? or has the node been queried and it's slow because other nodes are being queried before the command is sent?

This may be something worth investigating if it has not yet been done. If the case is it is slow because it is not been queried yet, but is substantially faster after it has been but while other nodes are still being queried. we can use a notification from OZW to set a flag on a node that the application can watch for.. keep the setting/getting of dynamic data on a node grayed out until it has been queried.

These are the things I am trying to get a better clarification/understanding of in order to squeeze every drop of performance we can get while providing the high accuracy and the best possible user experience. This is a hard task to do with so many hands in the cookie jar and not one person writing the code for all components. With constructive communication between the parties involved and people like yourself (people that do not have an issue with keying out a book in order to provide great detail) the best possible outcome can be had.

I am not sure exactly why HASS would add another layer to their zwave.. I would have thought they would fix the problems that currently exist and not add onto something that is not working at optimum.. It is what it is.. Me personally I would correct the current zwave and then extend/add on to it.

petergebruers commented 5 years ago

Now here to complicate this a bit more... Is the response being sluggish because the node has not yet been queried and a command is to be sent? or has the node been queried and it's slow because other nodes are being queried before the command is sent?

To be honest... I have not investigated this. Not that I do not want to, but I only have a very small test network and I wanted to understand all the "query stages" in Node.cpp because that is one part, I also wanted to be certain of how OZW responds when it sends a NOP to a dead or sleeping node and much more. I don't want to jump to conclusions. Sending NOP to a dead node will take a long time to time-out, because the protocol tries all those different routing and speed options. In fact, it can take longer to time out than the 10.000 ms used in OZW. There is a lovely state diagram in on of the docs that explains the possible outcomes of ZW_SendData if you are interested, it says the unlikely, but possible timeout worst case is a whopping 65s. See Justin's remark about changing and reducing that from 40 s on previous versions.

If I am allowed to guess, I would say the perceived delays are due to "a mix of factors". I don't know if HASS code adds to that, it is possible, I have not investigated anything yet.

These are the things I am trying to get a better clarification/understanding of in order to squeeze every drop of performance we can get while providing the high accuracy and the best possible user experience.

I have been trying to understand, with enough precision, since august 2018 and I am still learning. I am an old man, I am not as fast as I used to be. I keep trying.

If I ever have to move my 90 nodes from Fibaro HC2 (because it dies or some other compelling reason) I would like to be prepared and hope it runs smoothly with OZW :)

I am not sure exactly why HASS would add another layer to their zwave..

It is mandatory. Hass designers want a "component" part that is included with the main home assistant code (repo). That part must not interface with anything directly. It has to interface with a package that is published on PyPI. The idea is that the hass part is stable (and does not block due to I/O for instance) , and package maintainers can publish improvements independently.

Fishwaldo commented 5 years ago

so my next question is this. are the order in which the notifications fired static?

No Guarantees on the order they are fired except for ProtocolInfo -> EssentialNodeQueries, -> NodeQueriesComplete. 99% of cases, during startup, you get a NodeAdded (and if you are missing a cache, maybe a NodeNew). NodeNaming depends upon the NodeName CommandClass.

ValueAdded are fired by individual command classes as they are added to a Node, and its capabilities etc are discovered. They can happen any time after ProtocolInfo.

Ideally it would be best to notify the user that the node has been completely loaded

NodeQueriesComplete is what you want then. Everything (except Configuration CC) is read. (The reason we don't do ConfigurationCC is that some nodes have dozens of configuration options. it would slow things down. You have to call RequestAllConfigParams to get that (or refresh individual ValueID's for the ConfigurationCC.

add notifications for when a node is fully cached

That happens at NodeQueriesComplete.

If the order of the notifications is static then I can use the NodeNaming notification.

Nope. Not all devices support it. Plus your missing the NodeQueriesComplete Notification in your list there.

If it does not then I would have to use the next node added to indicate that the previous node (if any) has loaded completely

Nope - Can't do that either... sorry. Example - Battery Device goes to sleep half way through interview process (it happens). We move onto the next node after timing out, but it doesn't mean the battery device is complete yet.

if there happens to be an issue in communications with that node it can be a while before that notification takes place.. Not ideal.

We can't speedup timeouts!

You have AllNodesQueriedSomeDead - Meaning some devices are dead. You also have AwakeNodesQueried and the goal - AllNodesQueried to know when its all ok.

From your list of Notifications, here is the Mapping: NodeCachedDataLoaded: -> NodeQueriesComplete AllNodeCachedDataLoaded: AllNodesQueried<somedead or AwakeNodesQueried) NodeContactSuccess: Type_Notification with NotificationCode = NoOperation NodeContactFailed - Type_Notification with NotificationCode = Dead. (or Asleep) NetworkStarted -> DriverLoaded NetworkStartupComplete -> AllNodesQueried<somedead / AwakeNodesQueried) NetworkFailed: -> DriverFailed NetworkStopped: -> DriverUnloaded Notification Class

once a value has been added to the node am I able to alter that value before the physical node has been queried by the controller?

yes - With Caveats. You can change the Value, and it will be Queued up to be sent after anything else thats in a Queue before it (from the Interview phases). It has a Lower priority than some type of packets though, so if it was sitting in the Queue, and something else with a higher priority (say a Nonce_Report message) was triggered after it, the Nonce_Report would be sent before hand.

You can't rely on the Value present until you get a NodeQueryComplete Notification. (or you could "fudge" it and wait for a ValueChanged Notification - During the Interview stage, once you get a ValueChanged, then you know the value is fresh... but I would see some "hectic" logic to track all of that. :)

The only things that are probably "safe" prior to NodeQueryComplete are the Node "Information" methods in the Manager CC. Things like GetNodeType etc. They are tied into the EssentialNodeQueries complete phase.

Or does the change simply get ignored? or maybe an error occurs?

The Only time we "drop" a packet is for Sleeping devices, when there is a identical packet already queued up to go out. That's why if you sit there toggling your battery powered light on and off dozen's of times, when the device finally wakes up, it only toggles once.

These are the notifications that would be mainly used in an application to correctly integrate with pyOZW.

Haha.. I think you will find there are plenty more. Awake/Asleep Notifications, ControllerCommands (Add/Remove Nodes) NodeReset, Association Notifications (Groups etc).

Now if your head is still on your shoulders, let me introduce you to: QueryStages This is a much more "fine grained" reporting of the Interview status of a node. A list of possible "stages" we go through during the Interview Process is Here A lot of apps (wasn't my intention either) display that QueryStage to a user as feedback on how the interview progress is going. if a device is brand new (and no ozwcache exists) then we start at QueryStage_None. If there is a ozwcache and the node is in it, then we start at CacheLoad. Powered devices would quickly move onto the other Stages, Sleeping devices sit at CacheLoad waiting for the Node to send a WakeUp message to us. (ignore the depreciated notice on associations, the documentation Generator screwed up there)

@petergebruers

For mains, you can send a NOP and I would say that is not normally needed, though there is one use case.

NOP can also quickly trigger dead node detection in the controller. So useful for not only battery powered devices, but mains devices.

If you write 30.000 events per day, all temperature and power reports, and never purge, you'd better use something like influxdb or your database performance will become a bottleneck.

You often have to remind users that z-wave is low bandwidth as well... More often than not, reports of Slugish behavior on their network is because they are polling all their light switches or meters every 30 seconds (and maybe even to make things worse its all via S0 encryption - So 1 packet becomes 3 at a minimum!) and the network is at capacity.

Is the response being sluggish because the node has not yet been queried and a command is to be sent? or has the node been queried and it's slow because other nodes are being queried before the command is sent?

If its happening during the interview phase, its because there might be dozens of other packets in the queue in front of it.

These are the things I am trying to get a better clarification/understanding of in order to squeeze every drop of performance we can get while providing the high accuracy and the best possible user experience.

Honestly, the bottleneck here is not OZW, pwozw, or the application. Its the 9.6/40/100K bandwidth on the RF side.

While we can criticize z-wave for not updating and getting faster, I think one of the biggest benefits to it is a device from vendor A brought 9 years ago works with a device from vendor B brought yesterday. For that to happen, it has to be at the lowest common denominator so you have to trade something for it.

I can't say any of the other protocols that have lived for so long (except maybe X10) have managed to achieve that goal. Zigbee is a great example of "lets adopt the latest technology" and backwards compatibility is a afterthought. (I have a Philips Hue, plus a Philips Wireless Dimmable Bulb - Both Advertise Zigbee as a feature, but do they work together? Nope. and thats from the same vendor)

(although I will say, zigbee are learning the hard way and correcting their ways more recently, but the lack of certifications and a "standards" body ensuring everyone follows the rules means its always going to be wild west in Zigbee land for years to come).

petergebruers commented 5 years ago

@Fishwaldo "amen to that"...

kdschlosser commented 5 years ago

LOL @ wild west.

@Fishwaldo

I think we are having a clashing of terms used to some degree and that is making it even more challenging. I have attached a log file of my test network. I outlined what I think is taking place at specific points in the startup process. perhaps you would be willing to adjust my thinking using that log file.

This might be a better way of conveying what I am thinking. and maybe it simply cannot be done. I can't see why it couldn't. LOG.txt

kdschlosser commented 5 years ago

OK So I ran another test.. I used the NoOperation notification to trigger that a nodes cache is loaded. I do not know if this notification appears for all nodes or not. in my test setup it does. so that is what I used. This works because of all of the cached data has been loaded for a node including it's values. Even if the values them selves do not contain accurate information. I have a flag set when a node is created that indicates if the data returned is cached data or not. this can be used by an application do display whatever message it needs to. once the NodeQueriesComplete notification happens then I can change that flag to report that the information is not cached.

I also tried setting the level of the lights for one of my switches when the data was cached data.. the light changed almost instantly. However I am not able to obtain the new value. I believe that sending commands work perfect.. But if there is a reply the reply is lost because OZW is not waiting to get a replay perhaps. I do not know. only when the NodeQueriesComplete happens am I able to get non cached data from the values. Even if i refresh the node I am still not able to get correct data.

@petergebruers I have a highly modified version of python-openzwave. if you are willing to give it a go and test it out You will see what I am talking about. The time between when the network starts and when i am able to view static data/control a device is < 2 seconds. I do not know if I can send files over Discord or not. If i can and if you want I can send you my experimental copy of pyOZW to test. This version also has all of the 1.6 updates in it. The whole signaling(notification) system works differently (and easier) methods/properties are only available to the nodes that support them (based on command class) values can be accessed by index name. I did have to change the names somewhat from what they are in OpenZWave. The names needed to be unique between command classes. an example is switch binary it has an index name of level. and so does switch multilevel. these do 2 completely different things. so I changed the label on pyOZW to state. so on a dimmer switch to access the binary on/off value you would use. node.values.state and to access the level of the dimmer it would be node.values.level. because there is only a single values container per node we cannot have conflicting names. if you want to access the actual index number then you would use node.values.COMMAND_CLASS_SWITCH_BINARY.state or node.values.state.index if the value actually exists. also the network can be iterated over to provides the nodes

for node in network:
     print(node.name)

same goes for the node.

for node in network:
    print(node.id, ':', node.name)
    for value in node:
        print('   ', value.index, ':', value.label)

I also added location sorting. this is done alphabetically.

for location in network.locations:
    print(location.id, ':', location.name)
    for node in location:
        print('   ', node.id, ':', node.name)

kdschlosser commented 5 years ago

OK so here is an output of log information as well as the information needed to populate an application. It also includes the duration from the time the network starts until when we can populate an application with a nodes data.

at the end it shows how long an application currently waits until they can populate the nodes. It also shows when the last node was loaded in my example as well as the percentage of time savings.

This would be a proof of concept and nothing more. If there is some problem that can happen from doing something like this please let me know. You guys would know better then I as far as what OZW can and cannot do. This seems to work fine on my test setup. I am even able to set a lights level in the middle of the startup. and it does it right away. however I am not able to get the current state. I think the Set command work but the Get commands do not. I do not know if there is something that can be tweaked on the OZW end that would allow asking a node for data like the state of a switch while the startup process is still running.

EXAMPLE_LOAD.txt

kdschlosser commented 5 years ago

OK so when I had stated before that my 13 nodes were taking about 10 seconds to load. that was incorrect. I was using the NodeAdded notification to come up with that number. So that was not when the network was finished doing it's thing.

I found a MAJOR slow down in python-openzwave.

As we all know python is a high level programming language. so in terms of speed it would fall into the slug or snail category. What I think may have been overlooked is that any code run within the callback chain for a notification makes OpenZWave stop doing it's thing until the callback returns. Now if that callback ends up running python code.... we have a HUGE drop in performance. and I mean in every aspect of the program. Because OZW is notification driven it relies almost 100% on callbacks to get information to an application.

After I ran tests over and over again I was averaging about 50 seconds startup time for 13 nodes with 1 dropped packet. I am using the startup because it is pretty heavy with the notifications. So this is a good baseline to use. you will see the same results in the rest of the code that is seen in the startup.

I made some changes.. well actually a lot of changes. I am pretty sure they are for the better.

This is what I have come up with

a startup is defined as the time in which it takes to start the network until one of the AllNodesQueried notifications is fired.

I have turned off all logging no information is printed to the screen until after one of the AllNodesQueried notifications

50 startups using the original code. best time 45 seconds. worst time 90 seconds average time 54 seconds

50 startups with my modified code. best time 11 seconds worst time 25 seconds average time 20 seconds.

I did 2 things. I created a thread worker class to handle any outgoing information to OZW. this keeps any queries in the order in which they were created.

I have a different thread worker that handles the incoming notifications. This allows slow poke python to handle the notifications at it's pace and it releases the callback so OZW can continue doing what it is supposed to do.

With the way it was designed before with a slew of thread locking and unlocking and also the signaling of notifications on the python side is not being especially swift just those things in it's self was causing a performance problem. Now toss in the fact that applications are probably processing their GUI related application code in the same thread the callback was called from. What do you end up with? OZW moving at the speed of an elderly person with a walker.

I knew I would eventually locate the bottleneck. I just need to keep on testing and reading code.

You can run a very rudimentary test by adding this code block above class ZWaveNetwork in openzwave/network.py


def run_in_thread(func):
    import threading
    import traceback

    def wrapper(*args, **kwargs):

        def do(fnc, arg, karg):
            try:
                fnc(*arg, **karg)
            except Exception:
                logger.error(traceback.format_exc())

        t = threading.Thread(
            target=do,
            name='thread-' + func.__name__,
            args=(func, args, kwargs),
        )
        t.daemon = True
        t.start()

    wrapper.__doc__ = func.__doc__
    return wrapper

and then adding this line

    @run_in_thread

before this line in openzwave/network.py

    def zwcallback(self, args):

This is not the fix. but a proof of concept. The fix is going to be more involved. It is not ideal to spawn a thread for each notification, but it will show similiar results.

Fishwaldo commented 5 years ago

If you haven’t figured it out yet, you should be aware that Notification Callsbacks run in a different thread than the thread that creates the Manager class. (They run as the Driver Thread), so internally in the wrapper you will want to ensure proper locking between your main thread and the notification thread callbacks.

Fishwaldo commented 5 years ago

Sorry - Hit Comment too quickly.

Just be aware of the following scenario: Say a Sensor - It sends a On, then Off Packets to OZW: OZW processes the On Packet, and Raises a Notification - ValueChanged. Your App processes the ValueChanged Notification, but queues up (somehow) the actual Manager::GetValueAsBool call to be executed on your main thread, and the returns control to OZW. OZW then processes the Off Packet, raises a Notification - ValueChanged. Again, your app processes the ValueChanged Notification, but signals to your main thread its changed.

If your main thread was busy during this time, when it finally Makes the first Manager::GetValueAsBool call, it gets the "off" result (as the off packet has already been processed by OZW and updated in the ValueID. The second call also returns "off" result.

Not really a issue for pure GUI applications, but could be a problem for Automation, where it "misses" the On Value.

So you should ensure you get the "details" of the event prior to returning from the Notification Callback otherwise you have a possibility of "missing" events.

petergebruers commented 5 years ago

@kdschlosser I am sure you have found a bottleneck and I much appreciate your help in that direction. I would like to point out that when I look at people's OZW Log file they have at least one Notification - TimeOut and often many more during startup. That means at minimum 10 s delay so that is also a contributor to the perceived slowness... To explain why users have such a situation you'll have to analyze their situation but one possibility is they have reset a device (locally) instead of "deleting, excluding" it properly. The controller keeps a reference to that nodeID but it is dead, always. Apart from that, I am reasonably certain that OpenZWave has no significant bottlenecks.

kdschlosser commented 5 years ago

The problem was not in OZW it is in pyOZW and again the above is not the solution. simply a test I am aware of the synchronization that has to take place so things do not get missed. I am coding up a thread worker that will exist for each node. and it is only going to run when a notification arrives. if a second arrives while the first is still running (due to application code) then any notifications for that node will get queued up. It will allow for OZW to get released back to do whatever it is it needs to do.

kdschlosser commented 5 years ago

I have an idea also on how to handle dead/failed nodes. My question is this. How accurate is OZW at determining if a node is dead/failed and not a battery device that is simply asleep? what is the interval in which a battery device will awaken and report to the controller?

I am also curious to know if the code above is used in pyOZW on a HASS installation improves overall performance. I have done some code exploring on the HASS side of things and specifically the node creation and the value creation which is done during the startup process does not get done in a separate thread.

Here is the code from the zwave component in HASS that registers the callbacks for tasks that are done during the startup of OZW

dispatcher.connect(
        value_added, ZWaveNetwork.SIGNAL_VALUE_ADDED, weak=False)
    dispatcher.connect(
        node_added, ZWaveNetwork.SIGNAL_NODE_ADDED, weak=False)
    dispatcher.connect(
        network_ready, ZWaveNetwork.SIGNAL_AWAKE_NODES_QUERIED, weak=False)
    dispatcher.connect(
        network_complete, ZWaveNetwork.SIGNAL_ALL_NODES_QUERIED, weak=False)
    dispatcher.connect(
        network_complete_some_dead,
        ZWaveNetwork.SIGNAL_ALL_NODES_QUERIED_SOME_DEAD, weak=False)

and here is the callbacks from HASS

    if use_debug:  # pragma: no cover
        def log_all(signal, value=None):
            """Log all the signals."""
            print("")
            print("SIGNAL *****", signal)
            if value and signal in (ZWaveNetwork.SIGNAL_VALUE_CHANGED,
                                    ZWaveNetwork.SIGNAL_VALUE_ADDED,
                                    ZWaveNetwork.SIGNAL_SCENE_EVENT,
                                    ZWaveNetwork.SIGNAL_NODE_EVENT,
                                    ZWaveNetwork.SIGNAL_AWAKE_NODES_QUERIED,
                                    ZWaveNetwork.SIGNAL_ALL_NODES_QUERIED,
                                    ZWaveNetwork
                                    .SIGNAL_ALL_NODES_QUERIED_SOME_DEAD):
                pprint(_obj_to_dict(value))

            print("")

        dispatcher.connect(log_all, weak=False)

    def value_added(node, value):
        """Handle new added value to a node on the network."""
        # Check if this value should be tracked by an existing entity
        for values in hass.data[DATA_ENTITY_VALUES]:
            values.check_value(value)

        for schema in DISCOVERY_SCHEMAS:
            if not check_node_schema(node, schema):
                continue
            if not check_value_schema(
                    value,
                    schema[const.DISC_VALUES][const.DISC_PRIMARY]):
                continue

            values = ZWaveDeviceEntityValues(
                hass, schema, value, config, device_config, registry)

            # We create a new list and update the reference here so that
            # the list can be safely iterated over in the main thread
            new_values = hass.data[DATA_ENTITY_VALUES] + [values]
            hass.data[DATA_ENTITY_VALUES] = new_values

    component = EntityComponent(_LOGGER, DOMAIN, hass)
    registry = await async_get_registry(hass)

    def node_added(node):
        """Handle a new node on the network."""
        entity = ZWaveNodeEntity(node, network)

        def _add_node_to_component():
            if hass.data[DATA_DEVICES].get(entity.unique_id):
                return

            name = node_name(node)
            generated_id = generate_entity_id(DOMAIN + '.{}', name, [])
            node_config = device_config.get(generated_id)
            if node_config.get(CONF_IGNORED):
                _LOGGER.info(
                    "Ignoring node entity %s due to device settings",
                    generated_id)
                return

            hass.data[DATA_DEVICES][entity.unique_id] = entity
            component.add_entities([entity])

        if entity.unique_id:
            _add_node_to_component()
            return

        @callback
        def _on_ready(sec):
            _LOGGER.info("Z-Wave node %d ready after %d seconds",
                         entity.node_id, sec)
            hass.async_add_job(_add_node_to_component)

        @callback
        def _on_timeout(sec):
            _LOGGER.warning(
                "Z-Wave node %d not ready after %d seconds, "
                "continuing anyway",
                entity.node_id, sec)
            hass.async_add_job(_add_node_to_component)

        hass.add_job(check_has_unique_id, entity, _on_ready, _on_timeout,
                     hass.loop)

    def network_ready():
        """Handle the query of all awake nodes."""
        _LOGGER.info("Zwave network is ready for use. All awake nodes "
                     "have been queried. Sleeping nodes will be "
                     "queried when they awake.")
        hass.bus.fire(const.EVENT_NETWORK_READY)

    def network_complete():
        """Handle the querying of all nodes on network."""
        _LOGGER.info("Z-Wave network is complete. All nodes on the network "
                     "have been queried")
        hass.bus.fire(const.EVENT_NETWORK_COMPLETE)

    def network_complete_some_dead():
        """Handle the querying of all nodes on network."""
        _LOGGER.info("Z-Wave network is complete. All nodes on the network "
                     "have been queried, but some nodes are marked dead")
        hass.bus.fire(const.EVENT_NETWORK_COMPLETE_SOME_DEAD)

Now you can see here the network_complete_some_dead, network_complete and network_ready callbacks fire a signal internal in HASS. because HASS is async I would imagine that the firing done into the bus occurs in a different thread and would release OZW.

Now look at the add_node and value_changed. these create GUI related objects in the same thread as the notification from OZW came in on. so until that code has finished being run OZW does not get released.

I have not done a completely in depth analysis if the timeout's during startup and the notifications. So i cannot give any numbers. But I will say that it appears as tho there are less packet timeouts when releasing OZW in pyOZW before the notification conversion takes place. I do not know if there is something that may cause some kind of a ripple effect through OZW that is caused from within pyOZW if OZW is not released in a timely fashion. There could be some kind of a timing issue between the C module and the python interpreter that causes an issue. Strictly speaking when a pyd file gets made for python it has a bunch of python attachments that get compiled into it along with any code from OZW. I do not know how these attachments function and if problems can ripple through it. Just because the components that handle communications with the controller are running in a different thread then the notifications does not mean that an issue taking place in one thread does cannot cause problems in the other. I know a person I can ask and may have the answer. I do not know if the code in a python module gets run exactly the same as if it were run without being attached to python.

we would have to come up with some kind of a very generic bench test where we would be able to spit out code execution times in OZW whether it was run "standalone" or as a python module. and see if there is a variation between the 2. That is going to be a very hard nut to crack if using a live network. This is because of changes in the network that can skew the results. we would have to design some kind of a test suite that would mimic what a controller does exactly so OZW would not know the difference. then we would be able to control the "fake" network so the input side of the etst would be identical between tests. then we would be able to compare the results of the tests. That would be the only way to know.

Here is another thought is that when python loads a c dll (pyd) it does so into a sandbox type environment and doesn't create things like the threads in the way the threads would be created when the code runs outside of python. I can see this happening.

Has anyone looked at where the issues are stemming from. Are the majority of the issues coming from projects that are using pyOZW and not the projects that are using OZW directly as C code? if that is the case then the python to C connections need to be examined. and some research needs to be done to see how the C code gets executed from within python. From the one discovery.

From a simple thing like the code example in the previous post can cause an 80% reduction in code execution time there is definitely some code optimization that can be done. again I am not stating that the code in the previous post is the right approach or the correct way to go about it. It was only to show that very simple changes can be made that can cause HUGE improvements in performance. Those performance increases that are seen are only from the conversion of the C notifications to python signals and nothing more. There is no code being executed from a python signal other then the code within the signal it's self. no callbacks get run. so application code is not even tossed into the mix. If you then add the execution times of the application code that time difference is going to increase by very large amounts.

kdschlosser commented 5 years ago

OK so i ran some more tests. this time I used HASS and made the exact modification I posted above. I ran 3 tests modified and unmodified. I set the log level to debug and got the start and stop times from the OZW log clearing the log between each test. the stop time is for the AllNodesQueried notification log entry

Without modification

Start: 00:03:48.935 AllNodesQueried: 00:04:17.367 Duration: 28.432

Start: 00:14:05.877 AllNodesQueried: 00:14:38.662 Duration: 32.785

Start: 00:18:46.914 AllNodesQueried: 00:19:07.009 Duration: 20.095

With modification

Start: 00:08:44.097 AllNodesQueried: 00:09:00.667 Duration: 16.57

Start: 00:10:40.152 AllNodesQueried: 00:10:48.871 Duration: 8.719

Start: 00:12:14.711 AllNodesQueried: 00:12:23.932 Duration: 9.221

the 2 high times for the unmodified and the 1 high time for the modified are because of network timeouts. i believe 2 happened on each of them. the low times have no timeouts occurring.

time savings

best time from each: 56.6% worst time from each: 49.6%

I cannot tell you how many times i had to run the tests for the unmodified in order to get a time that had no timeouts in it.. it was a lot i can tell you that. HASS also froze about 15% of the time during startup. A freeze is the program does nothing for > 20 seconds.

the modified was quite the opposite. I had to run it a whole slew of times in order to get it so the 2 timeouts occurred. it froze not a single time.

I am not sure there is any other type of test i can perform that will demonstrate this. the folks using HASS or python-openzwave will have to verify my results.

there appears to be no issue with HASS running with this change at all. I do not have any kind of a complex setup or scenes to check against. simple on and off of the lights is all.

@Fishwaldo I checked the code in pyOZW. it does grab the data in the value in the notification thread. at the moment i do not believe that information is passed in the python signal. that is very easily fixed. so this would handle the issue mentioned above.

Also when the program runs I do not know in what thread the network gets started. and no data is passed to this thread knowingly. pyOZW can be divided into 3 distinct sections.

function calls that return the data requested
function calls that return if the manager send the command. and a notification will provide the result, these are usually lengthy network operations and do not need the result of the operation immediately and the result of the operation would be for display purposes in an application.
notifications

I am trying to set up a system where the notifications, and operations are kept in sync by pyOZW and not have an application have to worry about doing this. The current system basically will only let one thing take place at a time and it will sit there and wait until a lock is freed up if an application is using multiple threads. this is counter productive because the commands get queued by OZW and OZW is what should hold up the application if there is a command that requires specific data to be returned. if there is not specific data to be returned then why hold up that command. let it put the command into the OZW queue and be on it's way.. But that is not is what is happening because of the locks that are used. that is the reason for issue #1852 if a node id is included in the notifications then setting whatever the result is to the node it applies to can be done without needing the locks in order to run the commands one at a time. but there needs to be a way to know what node a command notification is for. this is the whole reason right now for the series of locks that are in place in pyOZW because there is no way to know. pyOZW is set up to print a log message and return a False if the lock is set. the problem with this is that there is no validation of say the node_id that is being used. so if the lock is not in place the call to the manager function can return a false indicating there is an issue but how does the application know what the False actually means? So if the application thinks the False is happening because of the controller lock it could be set up to keep on trying endlessly.. Not a good design I think. If we can remove the locks all together then the application would know that False means to not try again and to validate the data it is passing to the function. and an application would never have to come back and try later because the controller is busy waiting for a command before it to finish.

I feel that letting OZW do what it is designed to do is the best thing to do in that case. creating a thread worker would only be to keep incoming notifications that are targeted at a specific node in order to be processed in order by the application. and there would be one for any notifications that are not targeted for a specific node. I do believe because of the translation that is taking place between the C code and the python side of things and the wildly different execution times between them that there is some kind of a ripple effect that is causing hangs in the program and also the large number of timeouts. All I can do is provide the numbers to show the differences and what I am seeing is there is some kind of a breakdown between the 2 I do not have the knowledge to be able to tell you exactly why it is happening I can only provide a possible solution. You would need to locate someone that has knowledge of the low level workings of python and also the connection between python and c modules.