OpenZWave / open-zwave

a C++ library to control Z-Wave Networks via a USB Z-Wave Controller.
http://www.openzwave.net/
GNU Lesser General Public License v3.0
1.05k stars 915 forks source link

Driver Retry timeout #1988

Closed kdschlosser closed 4 years ago

kdschlosser commented 4 years ago

I have been trying to learn more about the internal workings of OZW and I have been watching the logs and noticed that something seems a bit off.

I have been looking through the message queue and the sending portions of the driver and I cannot seem to locate where any express timeout values are set.. What seems to be happening is there is a really long pause or stall if a packet either fails to get delivered or a response for a node fails to get returned. There should not be more then a 1.5 second pause this is the timeout for a complete expected return of data. and if re-sending the wait before resend should be 1.5 seconds n + 0.1 seconds where n would be the retry number. so on first failure there should be a 1.5 second pause and then a 1.6 second wait until the resend happens. Now I am not sure if this can be done and I would not see why not there is nothing in the specification that states during that 1.6 second wait to resend the program has to sit there and do nothing it should be able to move onto the next command in the queue and continue processing the queue until that 1.6 seconds has expired and then re transmit the failed packet. The wording they used I believe is "at least" 1.5 n + 0.1. so if waiting longer then that should not be an issue.

I don't think it is done this way in openzwave because i will have a stall for what seems like 6 or 7 seconds sometimes. and there should never be a stall of greater then 1.5 seconds as this is the timeout period for an incoming packet.

If anyone can shed a little light on this that would be really helpful. I have one node that is pretty handicapped it would appear as it doesn't seem to like responding to the first commend that has been sent to it, I do not know why it behaves in this manner and the manufacturer is less then helpful and will not provide any in depth information about this device. I would have guessed possibly some kind of a routing problem but the node has 7 or so neighbors and the controller is one of those neighbors. I have tried resetting the network multiple times and also moved the device to see if that would solve the issue.. everything I have tried has failed to correct the issue. It is ok that the problem exists But the long pauses of the network doing nothing is less then ideal.

petergebruers commented 4 years ago

That is not easy to answer but here are a few things:

petergebruers commented 4 years ago

BTW if you are serious about Z-Wave and want to diagnose that kind of trouble, you build a Zniffer eg based on a Z-Wave.me UZB1 dongle. Instructions are on the Fibaro forum, you'll have to register I think but registration is free.

[Tutorial] Z-wave diagnostics with PC Controller and Zniffer

https://forum.fibaro.com/topic/29923-tutorial-z-wave-diagnostics-with-pc-controller-and-zniffer/

EDIT: bear in mind, Zniffer comes with absolutely no explanation as what to expect so you'll have to read an learn about routing.

kdschlosser commented 4 years ago

OK so now I am even more confused because in the serial API it does not state anything about routing and it taking >10 seconds or anything like that.

page 3 of https://www.silabs.com/documents/login/user-guides/INS12350-Serial-API-Host-Appl.-Prg.-Guide.pdf defines the "host" as the device on the opposite end of the serial/USB as the controller.

The ACK frame indicates that the receiving end received a validData frame.The host MUST wait for an ACK frame after transmitting a Data frame to the Z-Wave chip. In case of transmission errors or race conditions, the host may receive other frames or no frames at all. The host MUST be robust towards such events. The host SHOULD queue up requests for processing once the expected ACK frame has been received or timed out. The host MUST wait for a period of 1500ms before timing out waiting for the ACK frame.

This quote is from page 7 of the the document that is linked above.

From the time the command gets sent if an ACK is NOT received in 1.5 seconds the packet is considered to be undelivered. Plain and simple. No routing involved. No knowing what the network is actually doing on a low level. That specification defines what the host should be doing it is very plain and clear to understand, I can see from watching the logs in real time then the application stalls after a command is sent, this stall is much larger then 1.5 seconds by far. the command gets sent and then openzwave just sits there. for a really long time it then "times out" sits and waits some more then resends waiting another huge amount of time if no ACK is delivered. the wait to send timeout is defined on page 12 of the document linked above. It states that a host should not try to resend a packet more then 3 times. so that is the number we will use here. the total amount of time an application should actually stall would be for 6 seconds. 1.5 seconds waiting for an ack for each of the 4 transmissions the initial send and then the 3 retries. These stalls should not be consecutive but they are cumulative. they would appear as a "hiccup" instead of stall. the wait intervals between each of the retries are 1.5 seconds * the retry number + 0.1, so that makes for wait to resend intervals of 1.6, 3.1 and 4.6 seconds. The application does not have to sit and do nothing during these intervals. It is able to send other packets to other nodes. The total amount of time before an application should give up is 15.3 seconds. This is not the case. I am seeing times of greater then 20 seconds and that is with getting a response on the first retry.

Part of the issue is that during the wait until resend openzwave sits there waiting. there is no need to sit there and do nothing. another packet to a different node can be sent during this wait to resend period. I am betting that is the reason why the wait to resend timeout is a multiple of 1.5 seconds is because if the next packet being transmitted goes the full timeout waiting for an ACK that would place the resend to be up next.

Now as far as the callback ID's are concerned. OZW does not seem to use these. it will still sit there and wait for a packet be responded to before moving on. Now correct me if I am wrong but a callback id is an identifier that is used to match a command with a response. so an application does not have to wait to get a response. It can continue and send the next command packet. and when a response comes in then the callback ID is used to determine what command the response is for.

I am not sure if this assessment is correct but it appears there is a single thread that handles sending and receiving and this thread also handles some of the notifications. If this is the case then this is the spot that is causing the application to "stop" while it waits for a response. There should be separate threads for sending, receiving and notifications. It is also impossible to be able to deal with packet wait times from a single thread. The message that gets added to the send queue would be responsible for communicating received information back to the place that created the message so the data can be dealt with and it would also be responsible for starting a timer thread when the data gets sent and if that timer expires then it would place it's self in the front of the send queue to be resent. if information is to be returned from a response that is gotten from a node then it would be responsible for collecting the proper response and once again for putting it's self to the front of the queue.

I would imagine that this is the design purpose for the callback id's and how the timeouts are supposed to be handled. The serial controller is going to have some kind of buffer and be able to handle this type of arrangement, This is how I would think it is done simply for the purpose of allowing an application to not sit there and do nothing while waiting for responses.

If the design was to issue a command and sit there and wait for a response then there would be absolutely no purpose to having callback ids.

I am not denying the fact that there is some kind of an issue with my device. I personally think the device does not have it's software written that properly handles incoming and outgoing messages. And because of this matched with how openzwave works it causes long delays in sending. and if the receiving thread is the same thread that sends (and I believe it is) this is going to cause no data to be received either.

petergebruers commented 4 years ago

That's exactly what I've said: the serial protocol defines these frames: DATA (SOF), ACK, NAK, CAN - people often confuse the "ACK" with the ACK used by the RADIO protocol... They are completely different concepts, they are in different documents and use different names...

When talking about devices, ... that is in INS13954, "Z-Wave 500 Series Appl. Programmers Guide v6.81.0x"

See "3.4 Z-Wave Routing Principles" and "4.3.3.1 ZW_SendData"

When transmission ends, you'll get a "callback" and it defines (See "Callback function Parameters") txStatus as TRANSMIT_COMPLETE_OK, TRANSMIT_COMPLETE_NO_ACK, TRANSMIT_COMPLETE_FAIL

If you have a dongle with a recent SDK (and some options enabled) like the UZB1 you'll also get "extended" status, see 'txStatusReport" and will tell you wTransmitTicks, a tick is 10 ms... For example, if wTransmitTicks is 2 you've got a fast, direct connection...

Also worth studying is "Figure 9. Application state machine for ZW_SendData"

There is no mention of a 10 second timeout, that was an arbitrary choice made some time ago, because much originally it was 40 seconds.

There is no good document exactly describing how long a device tries to transmit, as a rule of thumb it depends on the number of possible routes (direct, LWR, NLWR, 4 routes and theoretically an application route, all described in "3.4 Z-Wave Routing Principles") times "retries" (up to three) + backup timer in case of collision + 3.5 seconds for "explorer" frame. I mean worst case scenario, which is you trying to send to a dead device, that wil cause a very long sequence of tries and can definitely take longer than 10 seconds.

kdschlosser commented 4 years ago

My thing is that OZW is staling on the order of 15 or so seconds from the time it sends the first packet until it gets a response. which is on the 1st retry. so 15 seconds of stall for a single retry is a really long time. much greater then the 1.5 seconds before a fail and then 1.6 seconds until the packet is resent.

If it was 20 seconds for all 4 packet sends then I could understand the 20 seconds. this is not the case.

kdschlosser commented 4 years ago

my point is that the serial specification which deals with the application to UZW communications states that the application should wait 1.5 seconds for an ACK and then a wait time until the packet should get sent again. I do not see anywhere in the code for OZW where it will timeout after 1.5 seconds and I also do not see anything defined for waiting before a retry or sending a different packet for a different node during this retry wait period.

petergebruers commented 4 years ago

1) Please double check that 15 second stall in the log because by default OZW considers a transmission failed after 10 seconds. You can change that by setting a parameter in options.xml. try for example 5 seconds. But...

2) Because OZW does not implement ZW_SendDataAbort that won't always work because your controller does not stop trying. It depends on what's happening on your network and what OZW schedules next.

It is not possible for me to diagnose the issue, I can only guess you have a device issue based on your observations. I can only give you pointers to docs and a tool like Zniffer...

kdschlosser commented 4 years ago

also that document is not what should be used for the serial API as a whole. some of the API may be forwarded over the serial connection That document outlines the ZWave chip communications. if an application is talking directly to the zwave chip then yes it can be used. In openzwave's case there is a component between the zwave chip and openzwave and that is the USB bridge. this USB bridge is not 100% in alignment with the ZWave serial API either. This component is manufacturer specific. while most manufacturers do use the one made by silabs so it falls close to specification there are varying firmwares for them as you actually noted yourself in the link to the fabrio site about flashing a Gen 5 with the zniffer firmware and not being able to go back. so loss of functionality will occur. That loss of functionality is for the "extras" like the button and the LED. So the end game is this. Unless you know that the manufacturer of the dongle is not altering or changing any of the behavior that is outline in that specification it cannot be used as being the way it should be done. Because the specification only outlines the basic serial communications this is what the manufacturer must adhere to as a minimum. There is no guarantee that information contained in the specification you provided is going to behave in a manner that is outlined in that specification once it has been passed to the application. the manufacturer of the dongle can manipulate it's behavior. so unless an API specification is obtained from Aeotec for the serial communication from host to dongle it would be bad to assume that the commands sent from host to dongle are going to work as outline in that specification.

Now as you just stated.. 10 seconds for a timeout. that is simply the timeout for the ACK which should be no greater then 1.5. So that is an issue. if a command needs to be sent in order to stop the zwave chip from attempting to send the thing and OZW does not support this command then the length of time until the command is considered a failed transmit is dictated by the manufacturer of the controller. And we all know how they get things right.. BAH! So 10 second before the timeout for receiving an ACK. that is a whole lot greater then the 1.5 that is in the specification. But it is not out of the realm of 15 seconds for a single retry and getting a response is it?. how long does openzwave wait until retrying the packet?

and as I have stated. I do know there is an issue with the device. that is apparent. I am not trying to diagnose or repair the issue with the device. I am addressing the stalling that is stemming from openzwave.

petergebruers commented 4 years ago

In openzwave's case there is a component between the zwave chip and openzwave and that is the USB bridge.

A series 500 chip can implement serial over USB. The serial protocol does apply.

so unless an API specification is obtained from Aeotec for the serial communication from host to dongle it would be bad to assume that the commands sent from host to dongle are going to work as outline in that specification.

It is not so much about the protocol, it is about the antenna amplifier and no possibility to go back... As mentioned on the Fibaro site... Do NOT flash a Aeotec dongle indeed! You can flash the original Silabs UZB or the Z-Wave.me UZB1 and if you want it to become a normal controller you could flash it with the reference controller image provided in the 6.81 SDK, but I do not recommend that. I'd say once you want it to be Zniffer, it stays a Zniffer...

the timeout for the ACK which should be no greater then 1.5

You are confusing the serial protocol with device communication. Please check if the ACK is missing... I bet it is there...

Sorry to repeat myself but device communication is "Figure 9. Application state machine for ZW_SendData" - When you send a packet to a device there is no 1.5 second rule...

I am addressing the stalling that is stemming from openzwave.

This is not a OZW problem, it is the way Z-Wave implements routing. Other gateways will have the same "stall". I can tell you, if your network consists of only 2 nodes, and you send data to a dead node, the network will stall for about 5 seconds because that is how long the dongle tries to contact the dead device

kdschlosser commented 4 years ago

This should not be a routing issue because the node is connected directly to the controller. it is a single hop to the destination. there is no ACK received. it sits there and waits after sending the command. then a no ACK received entry. and then it retries and most of the time the node will respond. sometimes it will have to retry again. There is also a list of "transmission times" this is included with the embedded SDK. It should be not that difficult to be able to map a route and calculate how long it should take for a transmission to get where it needs to go and for a response to get back. these are measure in something like 10 milliseconds for each hop. I cannot see where a single timeout should take 10 seconds. It should be able to try every single route in less then 1 second.

I am not sure as to why there is 2 Serial API documents. one is more detailed as to the format of the packets. and the other is a very basic frame packet overview. I am thinking that the one that I am referencing is supposed to be used to outline all communications between the host application and the dongle where as the other covers the more specific details of communications with the internal uart in the zwave chip. and some of that flows over into the document I am referencing. I am going to ask silabs for some clarification on this. I can tell you this tho. The document I am speaking of specifically states serial communications over USB where as the one you are referencing specifically states the internal uart for the zwave chip these are 2 completely different connections. I do not know how much of the document you are referencing flows over into the USB API.

The other thing is that the document you are using outlines the use of the functions that are included in the SDK. The SDK is written to only use the Keil compiler which is specifically for embedded platforms. where an application would be directly accessing the internal uart of the zwave chip and it would not be communicating over a serial USB bridge. This is where i believe your confusion is. the Serial API in the document you are referencing is not the API for the communications used when communicating from a PC controller application to a UZB stick. The document I am referencing is suppose to be used an an "overlay" if you will and those mechanics should be implemented along with some of the ones in the document you are referencing.

This is from section 2.1 from INS13954-Instruction-Z-Wave-500-Series-Appl-Programmers-Guide-v6_81_0x

The Application Programming Guide gives guidance for developing Z-Waveapplication programs, which use the Z-Wave application programming interface (API) to access the Z-WaveProtocol services and 500 Series SoC resources. For host processor application development using the serial API refer also to [2]

I want to point out the last line on that quote.

and here is the footnote for [2]

[2]SD, INS12350, Instruction, Serial API Host Appl. Prg. Guide.

and this is the document I keep on referencing. It appears as tho the information contained in the document I am referencing should be used in conjunction with the document you are using. This would make the 1.5 second timeouts and the wait to retry very much applicable and the application developer is responsible for implementing it.

I am going to ask the folks over at silabs how this is supposed to be done.

Fishwaldo commented 4 years ago

Kevin, Your beating a dead horse. There are two ACK’s for each packet. First one is from the serial API that the controller received the packet. (Not sent on the RF layer.).

That is the 1.5 second timeout you reference. Yes, we don’t handle that differently in OZW. If the packet from the software to the controller doesn’t make it, you got a lot of bigger problems.

The second ACK is from the target node. How that packet is routed to the node is completely under the discretion of the sending device (the controller). It may try direct, but the cat walks by and the packet gets lost. Then it may try via another node on the mesh, and someone turns on the microwave, etc etc etc till eventually the node tries a explorer frame. All of these attempts have timeouts tied to them, and each timeout is in the 10 second range.

Additionally, the base protocol has no way to differentiate between failure to deliver a packet and a node not supporting the packet. If a node does not support a command sent to it, there is no failure, nor ACK sent. There is no way at the SerialAPI level to determine this. That adds complexity to the whole discussion. (The Supervision CC addresses this to a degree) - and I suspect this is what you seeing - it’s not a RF failure, it’s a poorly implemented device for whatever reason.

In OZW, we used to default to 40 second timeouts and 3 retries. Practical experience and countless logs (via the Log Analysier on the website) showed that it was extremely rare for a packet to make it after 10 seconds. As retries are handled at the MAC/Transport Layer, I never saw a subsequent retry succeed.

Unfortunately the timeouts are specified all over the place, at the ITU MAC later, at the serial API documents, and at transport documents and even in some of the CC documentation. Your concentrating at one level and Peter is trying to tell you that it’s at all different layers, even at the SDK levels.

As for Parallel sending commands - it wouldn’t work in OZW’s implementation. We are a transactional engine that works on state. As you are probably aware, OZW usually sends a GET after a SET to verify the change was taken by the node. If we were sending both the SET and GET at the same time to the node, your either going to send up with collisions (CAN messages) or the GET is replied/dropped before the SET is processed. (You can see examples of this on older devices with S0 enabled. I assume the description of the packet consumes the micro, and GET messages get dropped as it’s too slow)

And as we have talked before, Zwave is half duplex low bandwidth (slower than the speed that the Serial comms between the software and stick operate at in many cases).

petergebruers commented 4 years ago

Additionally, the base protocol has no way to differentiate between failure to deliver a packet and a node not supporting the packet.

@Fishwaldo you beat me to it... I wanted to say "maybe the node does not support a certain command eg a 'get' " - sometimes this is documented in the config files

@kdschlosser for example Aeotec Minimote, snippet from config:

  <!-- COMMAND_CLASS_WAKE_UP. This class is in the list reported by the Minimote, but it does not
  respond to requests.  It still needs to be supported so that wake up notifications are handled. -->
  <CommandClass id="132">
    <Compatibility>
      <CreateVars>false</CreateVars>
    </Compatibility>
  </CommandClass>
  <!-- COMMAND_CLASS_ASSOCIATION. This class is in the list reported by the Minimote, but it does not respond to requests -->
  <CommandClass id="133">
...

If we were sending both the SET and GET at the same time to the node, your either going to send up with collisions (CAN messages) or the GET is replied/dropped before the SET is processed.

I think quite a few people purely look at timestamps and conclude "hey, there is a gap, I can do something during that gap" but usually that is not the case. Either the network is busy, or the node is busy and you cannot start somthing while the first "thing" hasn't finished yet. That would either cause CAN as @Fishwaldo says or it could do somthing weirder, like sending a packet but then corrupt the callback (*). There have been a few attempts at sending data in "gaps" an afaik they all failed and that is actually documented in the spec. The very short version is this: if you do ZW_SendData, the dongle will tell you when it is ready to do accept the next command. And secondly, if the dongle confirms "I have send the 'get'", as @Fishwaldo points out and the Minimote demonstrates, you might not get a reply at all. We could argue that while waiting on a reply, you could do something else but there is a certain risk. If you ask sensor data, you expect an answer with sensor dtat, but when you start transmitting to other devices you'll keep the network busy...

(*) EDIT: I tested that long time ago, calling ZW_SendData twice, without waiting on the callback, really does weird things...

Fishwaldo commented 4 years ago

I should also point out - on top of timeouts, at the RF later, it will retry at different speeds in certain circumstances 100K -> 40k -> 9.6K.

And on top of it all, things like FLiRS/Beaming also add onto the overhead.

kdschlosser commented 4 years ago

I did also want to mention the fact that the ZWave PC controller application available on the silabs website also includes the source code. If you look at the code it in fact does institute a timeout.

it is located in /Source/Libraries/ZWaveDll/BasicApplication/Operations/SetRFReceiveModeOperation.cs on line 37 and it is a hard coded timeout of _iterationDelay * _maxAttempts + 1000 where _iterationDelay = 200 and _maxAttempts = 2

I am unable to locate anything in the code that makes any kind of an adjustment to this.

petergebruers commented 4 years ago

While you're investigating that, have a look at Z/IP source code too. you'll see some different choices and I would argue that while the "PC Controller" is only a demo app, the Z/IP code is actually production code... Not that it would explain what you see but it is educational. Sometimes there are interesting comments like:

/*TODO consider to use an exponential backoff, and do not backoff until our own framehandler is idle. Also
 * the magnitude of the backoff seem very large... this is to be analyzed. */

🙂

petergebruers commented 4 years ago

So maybe in your case it is a device that does not respond to some command... but... when it is an RF or routing issue... Did I mention you'll need Zniffer?

A while ago I noticed sometimes a seconds delay happened on my network. A far-away node has a flaky connection, I am not sure what happened here but it is definitely in a "blind spot". You'll see the controller (node 1) doing a meter Get on node 165. But the node does not receive it, or does not respond, and when it does respond the answer seems to get lost as well. This causes an avalanche of packets and this screenshot is only a part of it... There is a seconds-long burst of "explorer frames" after this as well. So the network is blocked for about 5 seconds before it all ends.

Failed_com

In the log of the controller all you'll see is a 2 lines, one saying "get" then 5 second later "report".

kdschlosser commented 4 years ago

when I do a man in the middle on the serial communications between OZW and the UZB stick there is only a single ACK that I am able to see I am going to assume this is the ACK from the controller stating that it received the command and not that the command and is not meant as the node has sent the ACK.

Now this is captured data from the serial communications between OZW and the UZB. I just captured this data...

01 09 00 13 08 02 27 02 25 51 BE
06
{not sure how to interpert this}
01 04 
01 13
01 E8
{-------}

06
01 07 00 13 51 00 00 02 B8
06 
01 09 00 04 00 08 03 27 03 FF 22 
06 
01 09 00 13 08 02 73 02 25 52 E9 
06 

{or this}
01 04 
01 13 
01 E8 
{----------}

06 
01 07 00 13 52 00 00 02 BB
06 
01 0A 00 04 00 08 04 73 03 00 00 8D 
06 
01 09 00 13 08 02 26 02 25 69 87 
06 

{or this}
01 04 
01 13
01 E8 
{--------}

06 
01 07 00 13 69 00 00 02 80 
06 

I am understanding what you are saying with there being 2 ACKs. I am only seeing a single ACK in the captures. If you are referring to TRANSMIT_COMPLETE_OK and TRANSMIT_COMPLETE_NO_ACK this would be sent from the controller to the application as what would be a standard frame starting with 0x01 correct? and one of these should be able to be seen for each of the commands that get sent correct?

The node I am having a problem with is < 15 feet from 3 other nodes < 20 feet from 4 other nodes. and approx 25 feet from the controller there is only a single wall between any of the nodes and the device... The device is a multi channel device and it is also a controller I do now know where this would play into it.

You did mention the Zniffer and I will be ordering a UZB. what brand would you recommend purchasing? and would i need to have the zniffer be mobile, so plugged into a laptop vs a desktop?

kdschlosser commented 4 years ago

a packet is

SOF, packet length, request/response, function, node id/broadcast id , command/response, checksum

Is this correct?

kdschlosser commented 4 years ago

Would the data that I have marked be the TRANSMIT_COMPLETE_OK or TRANSMIT_COMPLETE_NO_ACK responses from the controller?

so this data 01 04 01 13 01 E8

would actually be 01 04 01 13 01 E8

meaning TRANSMIT_COMPLETE_NO_ACK for this command 01 09 00 13 08 02 26 02 25 69 87

petergebruers commented 4 years ago

I'll have more time later but let me quickly point you to how to read the docs to find that TRANSMIT_COMPLETE_OK byte in a serial stream...

Please keep reading till the end... Then reread... I know this might sound confusing.

So suppose you want to issue a certain CC and you have prepared it as a packet. You'll have to send it by wrapping that CC in ZW_SendData then wrap that in a serial Data Frame. You are correct, serial data is SOF, packet len, request/response.. and so on. Take a NOP, a very simple CC.

So CC goes in ZW_SendData frame goes in a Serial Frame

Start reading the doc:

INS13954-7 Z-Wave 500 Series Appl. Programmers Guide v6.81.0x

4.3.3.1 ZW_SendData

Scroll down a few pages and you'll see

HOST->ZW: REQ | 0x13 | nodeID | dataLength | pData[ ] | txOptions | funcID
ZW->HOST: RES | 0x13 | RetVal

If either (funcID == 0) OR (RetVal == FALSE) -> no callback

If (funcID != 0) AND (RetVal == TRUE) then callback returns with: ZW->HOST: REQ | 0x13 | funcID | txStatus

You'll notice that type of line at the end of each chapter describing a function available as "Serial API" So by searching for "HOST->ZW" in that pdf you can quickly find out what functions are available over serial link.

Your CC is that pData array, you'll have to add the other stuff and the doc tells you the serial frame starts with REQUEST then 0x13... BTW: REQ(UEST) = 0x00 and RES(PONSE) = 0x01

So 0x01, serial packet length byte, 0x00 (REQ), 0x13, ...

Example: Node003, Sending (NoOp) message (Callback ID=0x0a, Expected Reply=0x13) - NoOperation_Set (Node=3): 0x01, 0x09, 0x00, 0x13, 0x03, 0x02, 0x00, 0x00, 0x25, 0x0a, 0xcb

And the serial API responds with 0x01, serial packet length byte, 01, 0x13, RetVal

The RetVal is TRUE if the dongle accepted the packet and it can transmit, or FALSE If chip's transmit queue overflows. It should come very fast, it only means "can be sent", it is not an ACK in any way.

Continuing that NOP example, OZW logs:

Received: 0x01, 0x04, 0x01, 0x13, 0x01, 0xe8

The last 0x01 means OK, going to send the packet...

Now we're waiting on a "callback"

Example:

Node003, Received: 0x01, 0x18, 0x00, 0x13, 0x0a, 0x00, 0x00, 0x02, 0x00, 0xbe, 0x7f, 0x7f, 0x7f, 0x7f, 0x01, 0x01, 0x03, 0x00, 0x00, 0x00, 0x00, 0x02, 0x01, 0x00, 0x00, 0x42

It follows the same format but we know it is coming from the controller so it is a callback. Let's analyze

0x01, SOF 0x18 number of bytes in the serial packet 0x00 REQUEST 0x13 ZW_SendData aka FUNC_ID_ZW_SEND_DATA, see Defs.h 0x0a callback ID, it matches the second last byte of our request 0x00 TRANSMIT_COMPLETE_OK

Because my dongle has a recent SDK and is "IMA enabled" see text after "SerialAPI targets supporting IMA" in he doc) it actualy supports more bytes, but up to that status byte, TRANSMIT_COMPLETE_OK, the format is actually the same.

But in my case:

ZW->HOST: REQ | 0x13 | funcID | txStatus | wTransmitTicksMSB | wTransmitTicksLSB | bRepeaters | rssi_values.incoming[0] | rssi_values.incoming[1] | rssi_values.incoming[2] | rssi_values.incoming[3] | rssi_values.incoming[4] | bACKChannelNo | bLastTxChannelNo | bRouteSchemeState | repeater0 | repeater1 | repeater2 | repeater3 | routespeed | bRouteTries | bLastFailedLink.from | bLastFailedLink.to

So let's continue the decode a bit...

0x00 wTransmitTicksMSB 0x02 wTransmitTicksLSB 0x00 bRepeaters

So TX took 2 ticks = 20 ms (OZW decodes it for you) and used 0 repeaters aka direct connection.

I recommend the UZB1 as a Zniffer because it is inexpensive and if you brick it, by writing the wrong firmware to it, you might be able to carefully open the plastic housing, solder wires to RX, TX GND and RESET and connect it to a cheap CP2102 to reprogram it... But better double check the filename before flashing... The pads are small. I had to do that once ;)

https://z-wave.me/uzb/

No need to buy a license, the license is only needed for running their Z-Way software.

petergebruers commented 4 years ago

So far we've talked about controller -> device but eventually the device can report back, or send unsolicited data... That enters as "0x04" aka FUNC_ID_APPLICATION_COMMAND_HANDLER (fourth byte of a serial frame, including the SOF marker)

See "4.3.1.5 ApplicationCommandHandler (Not Bridge Controller library)"

ZW->HOST: REQ | 0x04 | rxStatus | sourceNode | cmdLength | pCmd[] | rxRSSIVal | securityKey

For example

Node004, Received: 0x01, 0x10, 0x00, 0x04, 0x00, 0x04, 0x08, 0x72, 0x05, 0x01, 0x0f, 0x03, 0x01, 0x10, 0x01, 0xbc, 0x00, 0x31

So the CC is in the 8th byte, it is 0x72 aka "manufacturer specific" and it is used to identify the device and lookup its config file:

Queuing Lookup on 1001.0301.010f.db.openzwave.com for Node 4

The rxStatus is a bit mask and it can tell you, for example, the received frame was "explorer" - RECEIVE_STATUS_TYPE_EXPLORE

kdschlosser commented 4 years ago

they sure did make this portion of the API unnecessarily complex didn't they..

kdschlosser commented 4 years ago

I am thinking about purchasing the zwave 700 development kit. This would be a great way for me to dive into the low level aspects of the protocol. and it includes a zniffer and some other USB devices with it. I have to read up on it more. The only thing that is kind of crappy is I am going to have to install eclipse and their IDE in order to download the SDK.

petergebruers commented 4 years ago

I am thinking about purchasing the zwave 700 development kit. This would be a great way for me to dive into the low level aspects of the protocol.

Yes but... SDK 7 does not contain info about serial protocol (vey likely because gateway builders should (must?) use Z/IP instead). So when I referred to the old SDK 6.81 (INS13954) that is not because I forgot about SDK7... If you install Simplicity Studio and download the SDK (I think that is the only way to download the SDK) you'll see what I mean.

The only thing that is kind of crappy is I am going to have to install eclipse and their IDE in order to download the SDK.

I don't remember having to install eclipse separately, iirc the installer is all-in-one.

kdschlosser commented 4 years ago

I know that simplicity studio is based on eclipse. I wasn't sure if it was a plugin or extension or if it is like Atmel studio where it comes already built into eclipse. But from what it sounds like it is already built into it and is a single installer.

The reason I am thinking about the development board is because it comes with a zniffer and a bunch of other things with it. like a UZB7 so I am looking at it this way.. Zniffer = 40.00 ish USD shipped. UZB7 = ??? none released yet.. thinking probably somewhere about 70.00 USD when it gets released.

so I am almost already a 3rd of the way to the total cost of the development kit. if I get the development kit i also would not have to fiddle fart around with flashing zniffer firmware and the possibility of doing something wrong and bricking the thing goes away. so you have to add something in there for peace of mind.. so call it another 40.00 for bricking one of the things LOL. so now I am close to 1/2 way there.. may as well for out the last 150 and get the whole kit

The other thing is the kit is 2 development packages really. It has 2 mainboards, 2 radios and 2 expansion boards. So if I look at it this way I am getting 2 100% programmable zwave devices for 150 or 75 each.. that's not a bad price all in all. I could make a zwave fireplace controller or completely ditch my thermostats and interface with my furnaces using rs485 use the switch multilevel command class and get the full variable control of the fans and the modulating johnson valve (btu output) because rs485 is a "network" serial connection and the protocol used is ClimateTalk which is a network protocol for HVAC devices. 4 wires and you can connect furnaces, air conditioners, humidity control systems, thermostats all on a single bus. I would be able to offer control of my 3 furnaces using a single development board by using the multichannel command class.

By looking at the datasheet for the board is comes with a plethora of input and output pins and communication types. Can use one of these things to automate just about any electronic device in your home.. ZWave enabled Microwave anyone?? LOL

petergebruers commented 4 years ago

UZB7 is listed on mouser for < 20 EUR, it is known as "SLUSB001A"

Edit: but I do agree the 700 series kit is a lot more value for money than the series 500. I don't own one, I only have the UZB7

kdschlosser commented 4 years ago

ok so probably 30 USD.

That really doesn't effect a whole lot because even still at around 100 USD for each development setup that is still a good price. And I would be able to automate things that may never get added or if they do it will be 10 years from now.

petergebruers commented 4 years ago

Yes, but an ESP32 board costs about 5-10 $ including shipping and it is every bit as capable as the series 700 except for low power applications. You already have wifi network. You also have wifi on your phone. The SDK is free, the sniffer is free (Wireshark) and some people say, when you run HASS, this is also awesome: https://esphome.io And if you are familiar with http/mqtt/IP/REST you don't have to learn the Z-Wave serial protocol 🙂 I know this is the OpenZWave bug tracker but I like to put things in perspective.

kdschlosser commented 4 years ago

i would never use wifi for my automation.. either Ethernet or ZWave. for the most part ZWave is stable. I have not had a device not do what it was told to do.

kdschlosser commented 4 years ago

I would also need something with a decent amount of ram as well. so boards like the Arduino Uno and Max are out of the question.

kpishere commented 4 years ago

@kdschlosser Have you ever looked at or interacted with the Climatetalk protocol? Not trivial. It is 'open' but not really. I would like to but it is a challange.

kdschlosser commented 4 years ago

I have the protocol API and I have also done a little bit of tinkering with it. You are correct in the fact that it is an "open" API and I do use that term loosely. ClimateTalk appears to be a dead protocol yet it is still used by many manufacturers. these manufacturers (like Rheem/Rudd) have changed/updated the protocol to suit their needs. But the core functionality of it is mostly the same as it was. When I purchased my 3 furnaces which are Rheem/Rudd I intentionally purchased the ClimateControl2 furnaces and not the EcoNet furnaces. the ClimateControl2 furnaces have the ClimateTalk protocol where as the EcoNet have something else. I think the EcoNet is still ClimateTalk but it is a very modified version of it. It is something I plan on diving into in a few months, One of the furnaces I have installed is for a pretty sizeable room. and 3 of the walls are exterior walls. and the kitchen and hallway. I have no suitable place to mount the thermostat where it would not be effected by another zone the kitchen stove or be mounted on an exterior wall. So I need to build something that has an external wifi temp sensor. Plus there is not a single thermostat made that will allow me to adjust all 3 from a single thermostat which is another feature I would like to have. since the furnaces internally control the coldown runtimes and things like that I will not have to worry about coding in consideration for delta-t float. I simply need to convey to the furnace the current temperature and the furnaces will handle when it turns on and off. There is also the fact that there is no thermostat made that handles modulating gas valves and EC blowers beyond offering 2 additional stages of heat and cool. where as with my furnaces they are adjustable anywhere from 20K btu output to 60K btu. and the blower from 20% to 100% all done in 1% increments of that range. This allows for a much finer grained control and also being able to display the status.

A lot of the climatetalk protocol is geared at error reporting if there is an issue. There really is no need to display this information at the thermostat other then for convenience because there is a readout on the furnace or AC that will also report the problem there, I can code in the ability to display that information but it is not 100% needed.

I also have 2 electric radiant heat floors. one in my office and one in the master bath. These I would also like to have the ability to control from the same control point as the furnaces.

kpishere commented 4 years ago

@kdschlosser Well, sounds like you've got some good scenarios for testing. In my case, got a Goodman one (the better brand name, for get what it is) but have no thermostat etc. I bought a mini-spit AC unit, hacked the internal part and put it in the furnace plenum. Get SEER 20 this way with quiet AC for cheap. Need the controller to modulate the furnace fan though. So, I have no ClimateTalk thermostat and starting with modeling that etc. :) Entirely from the spec. I've started but it is a daunting task. Start of it is here : https://github.com/kpishere/Net485. I'm paused on the master negotiation right now but plan to get at it soon enough. Maybe this is a good start for yourself? If we collaborate, great! Check out here to see what was done to AC unit -- https://github.com/kpishere/homie_heatPump/wiki

Fishwaldo commented 4 years ago

Closing this. Please take the discussion somewhere more appropriate than a issue tracker.