Closed dumpfheimer closed 1 year ago
I'd guess you got it right. I just misunderstood/got confused.
It very much seems like SDK 6 being the cause.
What came to my mind:
What controllers do you all have? Are there RAM size differences between the CC1352 and the CC2652?
I have the CC1352P2 Launchpad
Cc2652 here, and stable
ZHA?
Z2mqtt
It seems to affect only ZHA
Maybe the method zha talks to the coordinator causes issues with the increased routing table. Maybe ping the maintainer of zha as well? 🤔
Did that @ https://github.com/zigpy/zigpy-znp/issues/165 Still, I believe the root cause is in the firmware
I found one change that I forgot to revert which was the number of retries:
// Increase frame retries
#define ZMAC_MAX_FRAME_RETRIES 7
#define NWK_MAX_DATA_RETRIES 4`
@dumpfheimer can you check with: fws.zip
sure, thanks!
Crashed again after apparently ~14h
I think I am able to kind of reliably crash the controller. I wrote a python script that is meant to keep the main thread busy. I am using the velux integration for this. What it does is:
It crashed the controller once instantly and once after a minute or two.
I reenabled my "dedicated thread test" and it seems like this prevents the controller from locking up with the same mechanism. I do still get quite a few asyncio.exceptions.TimeoutError but it stays up.
https://github.com/dumpfheimer/zigpy-znp/tree/dumpfheimer/dedicated-thread
Perhaps we should report this bug to Texas Instruments? While I understand there is some timing problem either with ZHA's serial handling or with the Z-Stack runtime configuration, it should not be possible to completely crash the firmware like this.
@puddly before we can do that we first need more information what really goes wrong. If someone can do some debugging by using a CC1352/CC2652 launchpad + Simplelink SDK that would be great.
@TheJulianJES Do you have smart plugs with energy reading? Or do you have routers that are regularly unplugged?
I do have quite a few smart plugs with energy measurement. They also get polled at the same time (I think every 30 seconds), as they use HA polling. Only ZHA lights get polled in somewhat randomized intervals (and thus not at the same time). (Maybe the "polling flooding" helps for causing the "crash"?)
However, no routers that I have should get unplugged at all.
Interesting. Do you also have Tuya TS011F?
I thought maybe it could have to do with route discovery (every 30s) when the devices are unplugged. But if yours are always plugged in that's not quite it.
I only have one Tuya TS011F plug (where, additionally to the standard EM cluster attributes that always get polled, smartenergy current_summ_delivered also gets polled)
It's a BW-SHP16 plug if I'm correct. Other than that, I have some BW-SHP13 Tuya plugs with the better/older firmware
I ordered a CC1352P7 which should have more RAM. Maybe that will help.. Would you accept a PR with the diff for the P7 or is that out of scope?
I ordered a CC1352P7 which should have more RAM. Maybe that will help.. Would you accept a PR with the diff for the P7 or is that out of scope?
P7 is out of scope for the moment, unless it becomes widely available (excluding dev boards)
Were any of the changes in your firmware patch memory related or should I be able to/am I supposed to use them all with the P7?
@dumpfheimer yes some are, e.g. nvpages and heap size but I think you should be able to take them all.
Thanks!
When I try to restore a network backup it works up until it says "Waiting for NIB to stabilize" but the NIB never updates. Do you have an idea where I could start to look?
@dumpfheimer that sounds something ZHA specific, I suggest asking the ZHA developers
Apparently my NVOCMP_NVPAGES was too small to contain all routing tables and stuff
Hey @Koenkk could you maybe explain the reason for this addition in z-stack? I removed it and everything seems to work perfectly.
On a side note: I am seeing huge performance improvements with the P7 and a few parameter changes. Hope the P2 has enough RAM to handle the changes.
@dumpfheimer you can find the reason of this change here:
I'm very interested in what params you changed
I made quite a few changes, also some in zigpy. So without further testing it will be difficult for me to isolate the best performing changes.
I do have a personal "prime suspect", though:
I set NWK_MAX_DEVICE_LIST, ZDSECMGR_TC_DEVICE_MAX and NWK_MAX_DEVICE_LIST to 130 which is a bit over the amount of devices I have. Also, I set the aging of source routes to 150s (maybe even disabled it in the mean time) and reduced the concentrator interval to 60s. My assumption was that the concentrator routing information will be used before the source route so this would be a fallback anyway. BUT my next test is to reduce source routing to a small number and expire it very soon because my next idea was that source routing might only be interesting for ACKs?
Anyway the latency of switching my living room with ~ 20 lamps as gone down significantly. Also, the lights stay responseive. Before they started getting slow if they were not used for some time
I was advised not to increase the NWK_MAX_DEVICE_LIST: https://e2e.ti.com/support/wireless-connectivity/zigbee-thread-group/zigbee-and-thread/f/zigbee-thread-forum/1119300/simplelink-cc13x2-26x2-sdk-znwktablefull
Route ageing may indeed be increased (SRC_RTG_EXPIRY_TIME
), I suggest also checking with 255
. But from what I understood even with a low expiry time the routes stay until more room is needed.
It would be interesting to see what happens if you revert your zigpy changes, do the delays come back?
Not sure about the internals but I would argue that it is better to have more traffic on the network than having to look up the route of a lamp when it is supposed to toggle.
One thought that came up when reading https://www.ti.com/lit/an/swra650b/swra650b.pdf was that they were optimizing their network for sensors, so devices sending messages to the coordinator. They did not optimize the latency from coordinator to a router. In that case for example the source routing table could be a couple dozen in size. If they had 400 Lamps which they wanted to reach in low latency they must have kept a route to every device in memory, even if it causes more "idle" traffic.
Hope I am not completely off
I assumed you were referring to this comment: "_MAX_RTG[_SRC]_ENTRIES and ZDSECMGR_TC_DEVICE_MAX are important as well, although these appear to already be accounted for. NWK_MAX_DEVICE_LIST and MAX_NEIGHBOUR_ENTRIES should not be too large as they could cause instability on the ZNP/ZC given the excessive amount of traffic caused by directly neighboring/associated devices. If possible, you could debug the ZNP to determine which Z-Stack function returns the error.
Regards,"_
I was referring to that indeed, your reasoning makes sense. I'm still interested in what changes make the real difference here.
I'll create a repo soon, sorry for the delay.
I think one key change was setting MAC_CFG_RX_MAX to double the amount of transmission size (I now have 32/64). This allows for the answers to be buffered plus additional incoming traffic. My settings probably are excessive, thogh. If the RX buffer is smaller than the TX buffer it might easily overflow with ACKs or similar, causing retransmissions and delays, so my idea.
Created the repo now. I removed the ti_* files and the led command because they seem device specific.
But the params are where you put them: https://github.com/dumpfheimer/ZNP-Firmware/blob/master/source/znp_LP_CC1352P7_4_tirtos_ticlang/Stack/Config/preinclude.h
Thanks!
ZNP_UART_BAUD
change; I guess you also changed the baudrate on the z2m side?ZMAC_MAX_FRAME_RETRIES
; I expect that MAC frame retries are much faster compared to z2m retries. Earlier tests by sonoff produced more reliability by increasing thisNWK_MAX_BINDING_ENTRIES
since we don't bind the coordinator to anything I don't see a reason to increase itMAX_RTG_SRC_ENTRIES
it was recommended by TI to always have this larger than the MAX_RTG_ENTRIES
DIS_GPRAM
I cannot find this macro mentioned in the source code; where did you get this from?Hereby the recommendations + rationale I received from TI in April 2019:
Sorry I had written ans answer but it must have gotten lost.
ZNP_UART_BAUD change; I guess you also changed the baudrate on the z2m side?
yes I updated zigpy too to match the baud rate. Have had no issues with ~1M baud rate
ZMAC_MAX_FRAME_RETRIES; I expect that MAC frame retries are much faster compared to z2m retries. Earlier tests by sonoff produced more reliability by increasing this
true! I expected the network to be very reliable and the route to fail not because of a transient error but because of one that would likely persist for the foreseeable future. Which would make the retransmission very likely to fail. Nevertheless I set it back to 3 because the network throughput seems to not be the issue.
NWK_MAX_BINDING_ENTRIES since we don't bind the coordinator to anything I don't see a reason to increase it
Thanks! was not sure what this is. Set back to 1
MAX_RTG_SRC_ENTRIES it was recommended by TI to always have this larger than the MAX_RTG_ENTRIES
IMHO it is fine to have this set to the maximum number of devices you want to support in the network. Unless there is a memory issue more should be better?
DIS_GPRAM I cannot find this macro mentioned in the source code; where did you get this from?
It seems like the CC1352 device has a part of memory dedicated to caching something which reduces memory but speeds up computation. I was playing around with this but with the P7 memory is not an issue anymore. 0 is the default mode which needs some memory but speeds up computation. On the P2 this might be more interesting.
All in all I would say the bufferes are most likely the reason for my network stability boost. Especially having the RX buffer set to about triple the TX buffer makes responses much more reliably and reduces the "time in flight" from a zigpy perspective and load on the network caused by retransmissions. I have set them to TX 128 and RX 128*3 right now.
Another thing that speeds things up is using DataRequest asynchronously. You can check out the discussion I started over at zigpy-znp
BTW I set the repository to private because I woke up some night and was afraid of legal stuff... Will take a look at that soon
BTW I set the repository to private because I woke up some night and was afraid of legal stuff... Will take a look at that soon
Could you invite me to the repo (just read-only access)?
Done
@dumpfheimer thanks!
I have an issue with this coordinator FW as well. I have a lot of "NWK_TABLE_FULL" errors, which has the effect, that some actions are not reach the target device. Example: I want to switch on a light by a button, but I need to trigger the button two times until the light turns on. Afterwards, the light is responsive. If I do not trigger the light for some time, I face the effect again. It looks like the route to this device is swapped out in favour of another route entry and restored only with the failing trigger.
I have currently 131 devices and an Sonoff Plus Stick P with TI-Chip.
I'm ready to use test FW builds to help identifying the problem (based on CC1352P2_CC2652P_launchpad_coordinator).
My problem cold be linked to the one reported by dumpfheimer.
Tried the new dev build on my network. With ZHA, it still crashes. (After about 1 hour this time)
Reverted back to CC1352P2_CC2652P_launchpad_coordinator_20220507.hex
(6.10 SDK), as that seems to be stable (for me at least).
My stick is stable since I have upgraded (immediately after the availability of the new the new dev build), but it was already stable before. I have no more NWK_TABLE_FULL so far.
The new dev release works great here too. But I've noticed with the latest 2 dev releases that I can only update 1 light at the time, and even then it will fail sometimes. There is quite a lot of timeouts during updates, and the network will be unresponsive for a bit. Previously I remember updating 3 or more at the time without issues. It could ofc be something else in my deployment causing this, but I just thought I'd mention it if someone else have had similar experiences.
When flashing to ZZH I'm getting this error:
Opening port /dev/tty.usbserial-10, baud 500000
Reading data from CC2652R_coordinator_20220928.zip
Cannot auto-detect firmware filetype: Assuming .bin
Connecting to target...
CC1350 PG2.0 (7x7mm): 352KB Flash, 20KB SRAM, CCFG.BL_CONFIG at 0x00057FD8
Primary IEEE Address: XX:XX:XX:XX:XX:XX (there was my MAC)
Performing mass erase
Erasing all main bank flash sectors
Erase done
Writing 176954 bytes starting at address 0x00000000
ERROR: Invalid data size: 176954. Size must be a multiple of 4.
Reading data from CC2652R_coordinator_20220928.zip
You're flashing the .zip
file itself. Unzip it and try the enclosed .hex
file.
Reading data from CC2652R_coordinator_20220928.zip
You're flashing the
.zip
file itself. Unzip it and try the enclosed.hex
file.
My fault. So basic :S Thank you!
Updated today along with latest stable z2m - ZigStar LAN gateway so CC2652 - 136 devices - stable so far.
After seeing in the changelog that the routing table sizes have increased I wanted to test the latest DEVELOPMENT firmware.
I am having issues which I believe are caused by the firmware update.
It seems to me that the firmware crashes after a few hours / an amout of requests. Unfortunately I cannot provide detailed feedback, but am glad to try with some guidance.
The first time it got stuck I did not pay a lot of attention and simply restarted everything. The second time I un- and replugged the coordinator and things recovered without any issues worth mentioning. The logs were full of messages as shown below (1). Later it changed to other error messages (2).
On the positive side: I do feel like the larger routing table might have had a positive effect on my environment. I have ~120 zigbee devices of which probably 2/3 are routers. Especially when toggling a bunch of lights at the same time I feel like it has less "hickups"
My environment: I am using a CC1352P2 launchpad with zigpy/zha/home assistant. The firware in use was https://github.com/Koenkk/Z-Stack-firmware/blob/develop/coordinator/Z-Stack_3.x.0/bin/CC1352P2_CC2652P_launchpad_coordinator_20220724.zip
Error message 1:
Error message 2: