Feedback development firmware 2022/07

dumpfheimer commented 2 years ago

After seeing in the changelog that the routing table sizes have increased I wanted to test the latest DEVELOPMENT firmware.

I am having issues which I believe are caused by the firmware update.

It seems to me that the firmware crashes after a few hours / an amout of requests. Unfortunately I cannot provide detailed feedback, but am glad to try with some guidance.

The first time it got stuck I did not pay a lot of attention and simply restarted everything. The second time I un- and replugged the coordinator and things recovered without any issues worth mentioning. The logs were full of messages as shown below (1). Later it changed to other error messages (2).

On the positive side: I do feel like the larger routing table might have had a positive effect on my environment. I have ~120 zigbee devices of which probably 2/3 are routers. Especially when toggling a bunch of lights at the same time I feel like it has less "hickups"

My environment: I am using a CC1352P2 launchpad with zigpy/zha/home assistant. The firware in use was https://github.com/Koenkk/Z-Stack-firmware/blob/develop/coordinator/Z-Stack_3.x.0/bin/CC1352P2_CC2652P_launchpad_coordinator_20220724.zip

Error message 1:

2022-07-26 01:06:59 ERROR (MainThread) [homeassistant.helpers.entity] Update for sensor.server_electricity_power fails
Traceback (most recent call last):
  File "/srv/homeassistant/lib/python3.10/site-packages/homeassistant/helpers/entity.py", line 514, in async_update_ha_state
    await self.async_device_update()
  File "/srv/homeassistant/lib/python3.10/site-packages/homeassistant/helpers/entity.py", line 709, in async_device_update
    raise exc
  File "/srv/homeassistant/lib/python3.10/site-packages/homeassistant/components/zha/sensor.py", line 297, in async_update
    await super().async_update()
  File "/srv/homeassistant/lib/python3.10/site-packages/homeassistant/components/zha/entity.py", line 250, in async_update
    await asyncio.gather(*tasks)
  File "/srv/homeassistant/lib/python3.10/site-packages/homeassistant/components/zha/core/channels/homeautomation.py", line 100, in async_update
    result = await self.get_attributes(attrs, from_cache=False, only_cache=False)
  File "/srv/homeassistant/lib/python3.10/site-packages/homeassistant/components/zha/core/channels/base.py", line 460, in _get_attributes
    read, _ = await self.cluster.read_attributes(
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy/zcl/__init__.py", line 441, in read_attributes
    result = await self.read_attributes_raw(to_read, manufacturer=manufacturer)
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy/quirks/__init__.py", line 233, in read_attributes_raw
    results = await super().read_attributes_raw(
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy/device.py", line 291, in request
    radio_result, msg = await self._application.request(
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/zigbee/application.py", line 302, in request
    return await self._send_request(
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/zigbee/application.py", line 1161, in _send_request
    response = await self._send_request_raw(
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/zigbee/application.py", line 1047, in _send_request_raw
    self._znp.request_callback_rsp(
AttributeError: 'NoneType' object has no attribute 'request_callback_rsp'

Error message 2:


2022-07-26 01:10:04 ERROR (MainThread) [zigpy_znp.zigbee.application] Failed to reconnect
Traceback (most recent call last):
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/api.py", line 652, in _skip_bootloader
    result = await responses.get()
  File "/usr/lib/python3.10/asyncio/queues.py", line 159, in get
    await getter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/zigbee/application.py", line 886, in _reconnect
    await self.connect()
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/zigbee/application.py", line 111, in connect
    await znp.connect()
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/api.py", line 694, in connect
    self.capabilities = (await self._skip_bootloader()).Capabilities
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/api.py", line 651, in _skip_bootloader
    async with async_timeout.timeout(CONNECT_PROBE_TIMEOUT):
  File "/srv/homeassistant/lib/python3.10/site-packages/async_timeout/__init__.py", line 129, in __aexit__
    self._do_exit(exc_type)
  File "/srv/homeassistant/lib/python3.10/site-packages/async_timeout/__init__.py", line 212, in _do_exit
    raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
2022-07-26 01:10:19 ERROR (MainThread) [zigpy_znp.zigbee.application] Failed to reconnect
Traceback (most recent call last):
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/api.py", line 652, in _skip_bootloader
    result = await responses.get()
  File "/usr/lib/python3.10/asyncio/queues.py", line 159, in get
    await getter
asyncio.exceptions.CancelledError

dumpfheimer commented 2 years ago

I'd guess you got it right. I just misunderstood/got confused.

It very much seems like SDK 6 being the cause.

dumpfheimer commented 2 years ago

What came to my mind:

What controllers do you all have? Are there RAM size differences between the CC1352 and the CC2652?

I have the CC1352P2 Launchpad

Wireheadbe commented 2 years ago

Cc2652 here, and stable

dumpfheimer commented 2 years ago

ZHA?

Wireheadbe commented 2 years ago

Z2mqtt

dumpfheimer commented 2 years ago

It seems to affect only ZHA

Wireheadbe commented 2 years ago

Maybe the method zha talks to the coordinator causes issues with the increased routing table. Maybe ping the maintainer of zha as well? 🤔

dumpfheimer commented 2 years ago

Did that @ https://github.com/zigpy/zigpy-znp/issues/165 Still, I believe the root cause is in the firmware

Koenkk commented 2 years ago

I found one change that I forgot to revert which was the number of retries:

// Increase frame retries
#define ZMAC_MAX_FRAME_RETRIES 7
#define NWK_MAX_DATA_RETRIES 4`

@dumpfheimer can you check with: fws.zip

dumpfheimer commented 2 years ago

sure, thanks!

dumpfheimer commented 2 years ago

Crashed again after apparently ~14h

dumpfheimer commented 2 years ago

I think I am able to kind of reliably crash the controller. I wrote a python script that is meant to keep the main thread busy. I am using the velux integration for this. What it does is:

Reset power to the velux gateway
In a loop toggle light groups and then send commands to the velux gateway (which should keep the main thread busy, as the connection is dead and it wants to reconnect)

It crashed the controller once instantly and once after a minute or two.

dumpfheimer commented 2 years ago

I reenabled my "dedicated thread test" and it seems like this prevents the controller from locking up with the same mechanism. I do still get quite a few asyncio.exceptions.TimeoutError but it stays up.

https://github.com/dumpfheimer/zigpy-znp/tree/dumpfheimer/dedicated-thread

puddly commented 2 years ago

Perhaps we should report this bug to Texas Instruments? While I understand there is some timing problem either with ZHA's serial handling or with the Z-Stack runtime configuration, it should not be possible to completely crash the firmware like this.

Koenkk commented 2 years ago

@puddly before we can do that we first need more information what really goes wrong. If someone can do some debugging by using a CC1352/CC2652 launchpad + Simplelink SDK that would be great.

dumpfheimer commented 2 years ago

@TheJulianJES Do you have smart plugs with energy reading? Or do you have routers that are regularly unplugged?

TheJulianJES commented 2 years ago

I do have quite a few smart plugs with energy measurement. They also get polled at the same time (I think every 30 seconds), as they use HA polling. Only ZHA lights get polled in somewhat randomized intervals (and thus not at the same time). (Maybe the "polling flooding" helps for causing the "crash"?)

However, no routers that I have should get unplugged at all.

dumpfheimer commented 2 years ago

Interesting. Do you also have Tuya TS011F?

I thought maybe it could have to do with route discovery (every 30s) when the devices are unplugged. But if yours are always plugged in that's not quite it.

TheJulianJES commented 2 years ago

I only have one Tuya TS011F plug (where, additionally to the standard EM cluster attributes that always get polled, smartenergy current_summ_delivered also gets polled)

It's a BW-SHP16 plug if I'm correct. Other than that, I have some BW-SHP13 Tuya plugs with the better/older firmware

dumpfheimer commented 2 years ago

I ordered a CC1352P7 which should have more RAM. Maybe that will help.. Would you accept a PR with the diff for the P7 or is that out of scope?

Koenkk commented 2 years ago

I ordered a CC1352P7 which should have more RAM. Maybe that will help.. Would you accept a PR with the diff for the P7 or is that out of scope?

P7 is out of scope for the moment, unless it becomes widely available (excluding dev boards)

dumpfheimer commented 2 years ago

Were any of the changes in your firmware patch memory related or should I be able to/am I supposed to use them all with the P7?

Koenkk commented 2 years ago

@dumpfheimer yes some are, e.g. nvpages and heap size but I think you should be able to take them all.

dumpfheimer commented 2 years ago

Thanks!

dumpfheimer commented 2 years ago

When I try to restore a network backup it works up until it says "Waiting for NIB to stabilize" but the NIB never updates. Do you have an idea where I could start to look?

Koenkk commented 2 years ago

@dumpfheimer that sounds something ZHA specific, I suggest asking the ZHA developers

dumpfheimer commented 2 years ago

Apparently my NVOCMP_NVPAGES was too small to contain all routing tables and stuff

dumpfheimer commented 2 years ago

Hey @Koenkk could you maybe explain the reason for this addition in z-stack? I removed it and everything seems to work perfectly.

https://github.com/Koenkk/Z-Stack-firmware/blob/94ff2b02ef7ac7dac48e16e20b47e659841f96c6/coordinator/Z-Stack_3.x.0/firmware.patch#L517-L523

On a side note: I am seeing huge performance improvements with the P7 and a few parameter changes. Hope the P2 has enough RAM to handle the changes.

Koenkk commented 2 years ago

@dumpfheimer you can find the reason of this change here:

I'm very interested in what params you changed

dumpfheimer commented 2 years ago

I made quite a few changes, also some in zigpy. So without further testing it will be difficult for me to isolate the best performing changes.

I do have a personal "prime suspect", though:

I set NWK_MAX_DEVICE_LIST, ZDSECMGR_TC_DEVICE_MAX and NWK_MAX_DEVICE_LIST to 130 which is a bit over the amount of devices I have. Also, I set the aging of source routes to 150s (maybe even disabled it in the mean time) and reduced the concentrator interval to 60s. My assumption was that the concentrator routing information will be used before the source route so this would be a fallback anyway. BUT my next test is to reduce source routing to a small number and expire it very soon because my next idea was that source routing might only be interesting for ACKs?

Anyway the latency of switching my living room with ~ 20 lamps as gone down significantly. Also, the lights stay responseive. Before they started getting slow if they were not used for some time

Koenkk commented 2 years ago

I was advised not to increase the NWK_MAX_DEVICE_LIST: https://e2e.ti.com/support/wireless-connectivity/zigbee-thread-group/zigbee-and-thread/f/zigbee-thread-forum/1119300/simplelink-cc13x2-26x2-sdk-znwktablefull

Route ageing may indeed be increased (SRC_RTG_EXPIRY_TIME), I suggest also checking with 255. But from what I understood even with a low expiry time the routes stay until more room is needed.

It would be interesting to see what happens if you revert your zigpy changes, do the delays come back?

dumpfheimer commented 2 years ago

Not sure about the internals but I would argue that it is better to have more traffic on the network than having to look up the route of a lamp when it is supposed to toggle.

One thought that came up when reading https://www.ti.com/lit/an/swra650b/swra650b.pdf was that they were optimizing their network for sensors, so devices sending messages to the coordinator. They did not optimize the latency from coordinator to a router. In that case for example the source routing table could be a couple dozen in size. If they had 400 Lamps which they wanted to reach in low latency they must have kept a route to every device in memory, even if it causes more "idle" traffic.

Hope I am not completely off

I assumed you were referring to this comment: "_MAX_RTG[_SRC]_ENTRIES and ZDSECMGR_TC_DEVICE_MAX are important as well, although these appear to already be accounted for. NWK_MAX_DEVICE_LIST and MAX_NEIGHBOUR_ENTRIES should not be too large as they could cause instability on the ZNP/ZC given the excessive amount of traffic caused by directly neighboring/associated devices. If possible, you could debug the ZNP to determine which Z-Stack function returns the error.

Regards,"_

Koenkk commented 2 years ago

I was referring to that indeed, your reasoning makes sense. I'm still interested in what changes make the real difference here.

dumpfheimer commented 2 years ago

I'll create a repo soon, sorry for the delay.

I think one key change was setting MAC_CFG_RX_MAX to double the amount of transmission size (I now have 32/64). This allows for the answers to be buffered plus additional incoming traffic. My settings probably are excessive, thogh. If the RX buffer is smaller than the TX buffer it might easily overflow with ACKs or similar, causing retransmissions and delays, so my idea.

dumpfheimer commented 2 years ago

Created the repo now. I removed the ti_* files and the led command because they seem device specific.

But the params are where you put them: https://github.com/dumpfheimer/ZNP-Firmware/blob/master/source/znp_LP_CC1352P7_4_tirtos_ticlang/Stack/Config/preinclude.h

Koenkk commented 2 years ago

Thanks!

ZNP_UART_BAUD change; I guess you also changed the baudrate on the z2m side?
ZMAC_MAX_FRAME_RETRIES; I expect that MAC frame retries are much faster compared to z2m retries. Earlier tests by sonoff produced more reliability by increasing this
NWK_MAX_BINDING_ENTRIES since we don't bind the coordinator to anything I don't see a reason to increase it
MAX_RTG_SRC_ENTRIES it was recommended by TI to always have this larger than the MAX_RTG_ENTRIES
DIS_GPRAM I cannot find this macro mentioned in the source code; where did you get this from?

Hereby the recommendations + rationale I received from TI in April 2019:

dumpfheimer commented 2 years ago

Sorry I had written ans answer but it must have gotten lost.

ZNP_UART_BAUD change; I guess you also changed the baudrate on the z2m side?

yes I updated zigpy too to match the baud rate. Have had no issues with ~1M baud rate

ZMAC_MAX_FRAME_RETRIES; I expect that MAC frame retries are much faster compared to z2m retries. Earlier tests by sonoff produced more reliability by increasing this

true! I expected the network to be very reliable and the route to fail not because of a transient error but because of one that would likely persist for the foreseeable future. Which would make the retransmission very likely to fail. Nevertheless I set it back to 3 because the network throughput seems to not be the issue.

NWK_MAX_BINDING_ENTRIES since we don't bind the coordinator to anything I don't see a reason to increase it

Thanks! was not sure what this is. Set back to 1

MAX_RTG_SRC_ENTRIES it was recommended by TI to always have this larger than the MAX_RTG_ENTRIES

IMHO it is fine to have this set to the maximum number of devices you want to support in the network. Unless there is a memory issue more should be better?

DIS_GPRAM I cannot find this macro mentioned in the source code; where did you get this from?

It seems like the CC1352 device has a part of memory dedicated to caching something which reduces memory but speeds up computation. I was playing around with this but with the P7 memory is not an issue anymore. 0 is the default mode which needs some memory but speeds up computation. On the P2 this might be more interesting.

All in all I would say the bufferes are most likely the reason for my network stability boost. Especially having the RX buffer set to about triple the TX buffer makes responses much more reliably and reduces the "time in flight" from a zigpy perspective and load on the network caused by retransmissions. I have set them to TX 128 and RX 128*3 right now.

Another thing that speeds things up is using DataRequest asynchronously. You can check out the discussion I started over at zigpy-znp

dumpfheimer commented 2 years ago

BTW I set the repository to private because I woke up some night and was afraid of legal stuff... Will take a look at that soon

Koenkk commented 2 years ago

BTW I set the repository to private because I woke up some night and was afraid of legal stuff... Will take a look at that soon

Could you invite me to the repo (just read-only access)?

dumpfheimer commented 2 years ago

Done

Koenkk commented 2 years ago

@dumpfheimer thanks!

artist67 commented 2 years ago

I have an issue with this coordinator FW as well. I have a lot of "NWK_TABLE_FULL" errors, which has the effect, that some actions are not reach the target device. Example: I want to switch on a light by a button, but I need to trigger the button two times until the light turns on. Afterwards, the light is responsive. If I do not trigger the light for some time, I face the effect again. It looks like the route to this device is swapped out in favour of another route entry and restored only with the failing trigger.

I have currently 131 devices and an Sonoff Plus Stick P with TI-Chip.

I'm ready to use test FW builds to help identifying the problem (based on CC1352P2_CC2652P_launchpad_coordinator).

My problem cold be linked to the one reported by dumpfheimer.

Koenkk commented 2 years ago

Just released it: https://github.com/Koenkk/Z-Stack-firmware/tree/develop/coordinator/Z-Stack_3.x.0/bin

TheJulianJES commented 2 years ago

Tried the new dev build on my network. With ZHA, it still crashes. (After about 1 hour this time) Reverted back to CC1352P2_CC2652P_launchpad_coordinator_20220507.hex (6.10 SDK), as that seems to be stable (for me at least).

artist67 commented 2 years ago

My stick is stable since I have upgraded (immediately after the availability of the new the new dev build), but it was already stable before. I have no more NWK_TABLE_FULL so far.

io53 commented 2 years ago

The new dev release works great here too. But I've noticed with the latest 2 dev releases that I can only update 1 light at the time, and even then it will fail sometimes. There is quite a lot of timeouts during updates, and the network will be unresponsive for a bit. Previously I remember updating 3 or more at the time without issues. It could ofc be something else in my deployment causing this, but I just thought I'd mention it if someone else have had similar experiences.

agsola commented 2 years ago

When flashing to ZZH I'm getting this error:

Opening port /dev/tty.usbserial-10, baud 500000
Reading data from CC2652R_coordinator_20220928.zip
Cannot auto-detect firmware filetype: Assuming .bin
Connecting to target...
CC1350 PG2.0 (7x7mm): 352KB Flash, 20KB SRAM, CCFG.BL_CONFIG at 0x00057FD8
Primary IEEE Address: XX:XX:XX:XX:XX:XX (there was my MAC)
    Performing mass erase
Erasing all main bank flash sectors
    Erase done
Writing 176954 bytes starting at address 0x00000000
ERROR: Invalid data size: 176954. Size must be a multiple of 4.

puddly commented 2 years ago

Reading data from CC2652R_coordinator_20220928.zip

You're flashing the .zip file itself. Unzip it and try the enclosed .hex file.

agsola commented 2 years ago

Reading data from CC2652R_coordinator_20220928.zip

You're flashing the .zip file itself. Unzip it and try the enclosed .hex file.

My fault. So basic :S Thank you!

cpuks commented 2 years ago

Updated today along with latest stable z2m - ZigStar LAN gateway so CC2652 - 136 devices - stable so far.

Koenkk / Z-Stack-firmware

Feedback development firmware 2022/07 #383