fairecasoimeme / ZiGate

Zigate is an Universal Zigbee Gateway
http://zigate.fr
171 stars 59 forks source link

Added build configuration and updates to support 200 device network. #317

Closed schrodingersket closed 3 years ago

schrodingersket commented 4 years ago

Tested with 200 physical devices. Configuration allows for commissioning of up to 30 end devices with no router as well.

I did have to disable registration of endpoints other than ZLO in order to keep the RAM usage within the limits of the 5168 chip; I left that change out of this pull request in anticipation of that being resolved by PR #315 and Issue #310 .

pipiche38 commented 4 years ago

My understanding was to get up to 40 devices directly connected to the gateway. If we are moving down to 30 . I don’t see the benefits. Also because there is no feature inside the gateway to force proviennent on routers before itself

schrodingersket commented 4 years ago

@pipiche38 This pull request adds support for a different build configuration, which supports 30 devices connected directly to the gateway and up to 200 devices total allowed in the network when ZigBee routers are included in the network. It does this by utilizing the host to manage EEPROM, and it is adjacent to the default configuration which utilizes internal EEPROM; this build configuration is not meant to replace the current configuration, as that would require users to re-commission their network anyway to use the new memory management capability. This pull request simply adds support for a different build type to support networks which are comprised primarily of routers.

I can try to push the number of devices that can join directly to the coordinator in this configuration to 40, but the RAM limitations are currently a little tight for that. Pushing the total node count to 200 already took some finagling.

schrodingersket commented 4 years ago

Juste a comment. You have updated the PDM_RECORD_LENGTH from 256 to 244. Any reason for that ? As @fairecasoimeme and myself worked indeed for the 256 .

Sure - that was to keep the entire message length (with all fields - record id, block id, etc.) under the MAX_BUFFER_SIZE which sets the maximum serial message length. I may be understanding incorrectly, but I'd expect that if the message length is too high, bytes will be lost when saving/loading messages.

Could you also clarify the change on the PDMsave "Removed PDMSaveMessage confirmation loop in order to bring per". What do you mean ? What did you change ? Is that impacting the logic ?

The logic isn't impacted at all here - the current behavior is to wait for a message of type 0x8200 for each and every record and block that is to be saved. This caused a pretty significant performance hit and significantly increased the amount of data being sent over the serial line for very little gain. This pull request removes the requirement that a 0x8200 message is received before sending the next block to the host to be saved.

pipiche38 commented 4 years ago

As the PDM_RECORD_LENGTH my understanding was that it was increased with the MAX_BUFFER_SIZE and so we optimized also the Save/Load part by reducing the number of iteration in the loops.`

For the 0x8200, that mean the host has to remove that. However I would suggest to keep that one when the last block is received and so the Record is completed. This will be consistent with the LoadPDM part

schrodingersket commented 4 years ago

As the PDM_RECORD_LENGTH my understanding was that it was increased with the MAX_BUFFER_SIZE and so we optimized also the Save/Load part by reducing the number of iteration in the loops.`

Absolutely! PDM_RECORD_LENGTH should definitely be as high as possible for exactly the reason you mention - in fact, the original source code from NXP that I used to write pdm_host.c used 128 as the record length, so we're already increasing from that value. I think that 246 is the limit though (updating this PR for 246 instead of 244; reasoning is below), as more than that causes the message body to exceed MAX_BUFFER_SIZE, which is 256. Here's how I got the number:

Bytes Description Size [bytes]
0-1 Record ID 2
2-3 Total Size 2
4-5 Total Number of Blocks 2
6-7 Current Block 2
8-9 Block Size 2
10-256 Record ID 0-246

I think I dropped a 2 in my initial math - so PDM_RECORD_LENGTH should actually be 246 instead of the current 244. Let me know if you agree with this logic. If not, please help me understand what I'm missing (:

For the 0x8200, that mean the host has to remove that. However I would suggest to keep that one when the last block is received and so the Record is completed. This will be consistent with the LoadPDM part

That's fine by me. The main performance hit was in waiting for acknowledgement on each individual block, so an acknowledgement on the full record set shouldn't impact performance very much.

pipiche38 commented 4 years ago

On the 0x8200 I think that there is a potential issue. I need to look deeper, but for me the firmware need to request each block otherwise, there is a risk of flooding the HW .

schrodingersket commented 4 years ago

Can you elaborate on the issue you're considering?

EDIT: Also, my Win10 virtual machine is currently bricked, so I'm going to be a little slow adding updates to this pull request until that's resolved.

EDIT 2: No longer bricked. I can actually be useful now!

schrodingersket commented 4 years ago

I'm now remembering why 244 instead of 246 - the latter doesn't work when loading the PDM back in from a warm start.

pipiche38 commented 4 years ago

On the 0x8200, I might have made a mix between Load and Save. So indeed, I agree with you the 0x8200 from the Host to the ZiGate is not needed, and will indeed save some time. On the other side, there is no flow control at the HW/SW level to manage the Serial line, and if we remove the 0x8200, I suspect that after having sent the first 0x0200 save recordId/BlockId, we will get the next one and so on. If on the host, we are not processing at the right speed, there is a risk of saturation, and lost of data. Last if we have a CRC error on the host side while receiving a block, we won't have any way to tell th ZiGate that this block must be re-transmited.

Short question @schrodingersket. Is your host based on python lib ? are you on a native C/C++ code ? What is the speed you are using for the serial line 115200 ?

cc: @fairecasoimeme

schrodingersket commented 4 years ago

On the other side, there is no flow control at the HW/SW level to manage the Serial line, and if we remove the 0x8200, I suspect that after having sent the first 0x0200 save recordId/BlockId, we will get the next one and so on. If on the host, we are not processing at the right speed, there is a risk of saturation, and lost of data. Last if we have a CRC error on the host side while receiving a block, we won't have any way to tell th ZiGate that this block must be re-transmited.

You are spot on with this assessment; if the host isn't able to keep up with the incoming PDM save messages, they are lost. That said, once the coordinator is up and running, the only tables that really change are the device tables, which are saved as full records fairly frequently. Thus, a missed record will be re-transmitted upon the next save. In practice, this seems to happen for normal networks every ~10-20s from my experience. So missing blocks seems to be more of a theoretical problem than it is a practical problem.

Short question @schrodingersket. Is your host based on python lib ? are you on a native C/C++ code ? What is the speed you are using for the serial line 115200 ?

I'm using pyserial at 115200 baud to keep compatibility with ZiGate defaults; speed hasn't been an issue at all so far. I've seen a few bytes get lost here and there, but it doesn't seem to impact overall operation for the reasons listed above.

badzz commented 4 years ago

Hi @schrodingersket and thanks Could you break this into smaller chunks so that @fairecasoimeme and I could review and modify if/when needed ? start by creating a PR for adding the new device only as it is the most critical (and simple) part

Moreover, 1/ did you create the new files yourselves or they come from NXP? you put yourself as author but kept the copyright from NXP. 2/ Is the new device sending a new zigate api message? if not, and it is only attributes report/read/write, why do we need new device files, the firmware already deals with this regardless of the device type. thx

schrodingersket commented 4 years ago

Hi @schrodingersket and thanks Could you break this into smaller chunks so that @fairecasoimeme and I could review and modify if/when needed ? start by creating a PR for adding the new device only as it is the most critical (and simple) part

I assume you are referring to the Develco VOC cluster support here - is that correct? If so, I opened a pull request for that a while back, but branched off of that for this work. Each commit in this pull request should consist of a small, functional chunk of work that's easier to review.

Moreover, 1/ did you create the new files yourselves or they come from NXP? you put yourself as author but kept the copyright from NXP.

It's a bit of both. I reached out to NXP for support on building this out initially; I started with the file from the JN-AN-1223 appnote, which contained support for Host PDM management for the old APIs, but modified that quite heavily to support the new PDM APIs utilized in JN-AN-1216. I wanted to update the author so that there wouldn't be confusion as to who to reach out to for questions that arise in that implementation.

2/ Is the new device sending a new zigate api message? if not, and it is only attributes report/read/write, why do we need new device files, the firmware already deals with this regardless of the device type. thx

I don't think I added any new devices to this pull request. Which one(s) are you referring to? I did update the clusters on the coordinator, but again, that's due to this pull request being a branch off of another feature branch which added support for a couple of manufacturer-specific air quality clusters.

schrodingersket commented 3 years ago

@badzz @pipiche38 - it looks like device-specific clusters are dropped in favor of propagating the 0x8002 message; since that's the case, will this code change be accepted if this pull request closed and re-opened with just the large network/Host PDM modifications?

schrodingersket commented 3 years ago

Closing in favor of implementing in https://github.com/nimbus9inc/ZiGate.