Koenkk / Z-Stack-firmware

Compilation instructions and hex files for Z-Stack firmwares
MIT License
2.41k stars 651 forks source link

Feedback development firmware 2022/07 #383

Closed dumpfheimer closed 1 year ago

dumpfheimer commented 2 years ago

After seeing in the changelog that the routing table sizes have increased I wanted to test the latest DEVELOPMENT firmware.

I am having issues which I believe are caused by the firmware update.

It seems to me that the firmware crashes after a few hours / an amout of requests. Unfortunately I cannot provide detailed feedback, but am glad to try with some guidance.

The first time it got stuck I did not pay a lot of attention and simply restarted everything. The second time I un- and replugged the coordinator and things recovered without any issues worth mentioning. The logs were full of messages as shown below (1). Later it changed to other error messages (2).

On the positive side: I do feel like the larger routing table might have had a positive effect on my environment. I have ~120 zigbee devices of which probably 2/3 are routers. Especially when toggling a bunch of lights at the same time I feel like it has less "hickups"

My environment: I am using a CC1352P2 launchpad with zigpy/zha/home assistant. The firware in use was https://github.com/Koenkk/Z-Stack-firmware/blob/develop/coordinator/Z-Stack_3.x.0/bin/CC1352P2_CC2652P_launchpad_coordinator_20220724.zip

Error message 1:

2022-07-26 01:06:59 ERROR (MainThread) [homeassistant.helpers.entity] Update for sensor.server_electricity_power fails
Traceback (most recent call last):
  File "/srv/homeassistant/lib/python3.10/site-packages/homeassistant/helpers/entity.py", line 514, in async_update_ha_state
    await self.async_device_update()
  File "/srv/homeassistant/lib/python3.10/site-packages/homeassistant/helpers/entity.py", line 709, in async_device_update
    raise exc
  File "/srv/homeassistant/lib/python3.10/site-packages/homeassistant/components/zha/sensor.py", line 297, in async_update
    await super().async_update()
  File "/srv/homeassistant/lib/python3.10/site-packages/homeassistant/components/zha/entity.py", line 250, in async_update
    await asyncio.gather(*tasks)
  File "/srv/homeassistant/lib/python3.10/site-packages/homeassistant/components/zha/core/channels/homeautomation.py", line 100, in async_update
    result = await self.get_attributes(attrs, from_cache=False, only_cache=False)
  File "/srv/homeassistant/lib/python3.10/site-packages/homeassistant/components/zha/core/channels/base.py", line 460, in _get_attributes
    read, _ = await self.cluster.read_attributes(
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy/zcl/__init__.py", line 441, in read_attributes
    result = await self.read_attributes_raw(to_read, manufacturer=manufacturer)
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy/quirks/__init__.py", line 233, in read_attributes_raw
    results = await super().read_attributes_raw(
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy/device.py", line 291, in request
    radio_result, msg = await self._application.request(
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/zigbee/application.py", line 302, in request
    return await self._send_request(
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/zigbee/application.py", line 1161, in _send_request
    response = await self._send_request_raw(
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/zigbee/application.py", line 1047, in _send_request_raw
    self._znp.request_callback_rsp(
AttributeError: 'NoneType' object has no attribute 'request_callback_rsp'

Error message 2:


2022-07-26 01:10:04 ERROR (MainThread) [zigpy_znp.zigbee.application] Failed to reconnect
Traceback (most recent call last):
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/api.py", line 652, in _skip_bootloader
    result = await responses.get()
  File "/usr/lib/python3.10/asyncio/queues.py", line 159, in get
    await getter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/zigbee/application.py", line 886, in _reconnect
    await self.connect()
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/zigbee/application.py", line 111, in connect
    await znp.connect()
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/api.py", line 694, in connect
    self.capabilities = (await self._skip_bootloader()).Capabilities
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/api.py", line 651, in _skip_bootloader
    async with async_timeout.timeout(CONNECT_PROBE_TIMEOUT):
  File "/srv/homeassistant/lib/python3.10/site-packages/async_timeout/__init__.py", line 129, in __aexit__
    self._do_exit(exc_type)
  File "/srv/homeassistant/lib/python3.10/site-packages/async_timeout/__init__.py", line 212, in _do_exit
    raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
2022-07-26 01:10:19 ERROR (MainThread) [zigpy_znp.zigbee.application] Failed to reconnect
Traceback (most recent call last):
  File "/srv/homeassistant/lib/python3.10/site-packages/zigpy_znp/api.py", line 652, in _skip_bootloader
    result = await responses.get()
  File "/usr/lib/python3.10/asyncio/queues.py", line 159, in get
    await getter
asyncio.exceptions.CancelledError
TheJulianJES commented 2 years ago

Also running 20220724 with ZHA here. Had a lockup after a few hours of uptime and needed to replug the stick. After the replug, it's been fine for ~18 hours now. I can try 20220726/https://github.com/Koenkk/Z-Stack-firmware/commit/704d0e7372756932bc766afe336d8b57a8dd82d8 later but it doesn't look like it would solve the issue(?)

Edit: Had another lockup after ~22 hours

io53 commented 2 years ago

I've been running it for ~12h without issues on both my networks, running z2m with Sonoff cc2652p. Will report if things change.

TheJulianJES commented 2 years ago

Update: 20220726 crashed after 3 hours for me

Koenkk commented 2 years ago

Currently running the 20220726 but this runs fine for me with z2m. Lets monitor it further.

dumpfheimer commented 2 years ago

I am able to reproduce this when I call the light.toggle service with a few seconds transition (12) with 2 areas selected that total in 12 light bulbs. Wild guess: Maybe the flood of incoming messages causes memory issues?

dumpfheimer commented 2 years ago

Correction: This happens as described above ONLY WHEN I have the Home Assistant Companion App open...? I then get a bunch of these warnings and my network freezes: 2022-07-27 20:00:19 WARNING (MainThread) [asyncio] Executing <Handle _async_track_state_change_event.<locals>._async_state_change_dispatcher(<Event state_...920903+02:00>>) at /srv/homeassistant/lib/python3.10/site-packages/homeassistant/helpers/event.py:272 created at /srv/homeassistant/lib/python3.10/site-packages/homeassistant/core.py:456> took 0.416 seconds

@dmulcahey I believe once said that one of the ZHA issues is that it is limited to the HA event loop. Could it be, that..?:

  1. A bunch of messages are received from lamps reporting their transition (reporting their brightness repetedly)
  2. The messages pile up on the controller/firmware side, because zigpy is not able to read them, because the event loop from HA is blocked
MattL0 commented 2 years ago

Running it since 4-5 days without any issue.

I have a 82 node network wich 30 are routers.

TheJulianJES commented 2 years ago

@MattL0 You're running Zigbee2MQTT, right? (and not Home Assistant's ZHA integration)

MattL0 commented 2 years ago

Yes I am on zigbee2mqtt

dumpfheimer commented 2 years ago

It seems like this issue only causes the controller to lock up with ZHA.

Is there a way to test if the controller locks up when there is some kind of backing up of messages?

digiblur commented 2 years ago

No issues on Z2M here since this firmware dropped on a USB CC2652P2 coordinator with a mixture of about 45 devices on mains and battery.

Koenkk commented 2 years ago

Also still running fine in my case. @dumpfheimer I suggest to open an issue in the ZHA issue tracker

puddly commented 2 years ago

I believe this may be caused by zigpy-znp registering callbacks for all known ZDO commands at startup, which increases its runtime RAM usage beyond Z2M's: https://github.com/zigpy/zigpy-znp/blob/695ebf50a86117daeec92ac08c256fc1bfd11e60/zigpy_znp/zigbee/application.py#L212-L218

dumpfheimer commented 2 years ago

I will. But there still seems to be some kind of regression with the firmware. IMHO it should not freeze like this and I have never had this issue before updating the firmware.

puddly commented 2 years ago

@dumpfheimer if you have a second, try applying this patch to see if it still crashes. Only five of the thirty defined ZDO commands will be registered:

diff --git a/zigpy_znp/zigbee/application.py b/zigpy_znp/zigbee/application.py
index 354edb7..382120e 100644
--- a/zigpy_znp/zigbee/application.py
+++ b/zigpy_znp/zigbee/application.py
@@ -214,6 +214,14 @@ class ControllerApplication(zigpy.application.ControllerApplication):
             # Ignore outgoing ZDO requests, only receive announcements and responses
             if cluster_id.name.endswith(("_req", "_set")):
                 continue
+            elif cluster_id not in (
+                zdo_t.ZDOCmd.Node_Desc_rsp,
+                zdo_t.ZDOCmd.Simple_Desc_rsp,
+                zdo_t.ZDOCmd.Active_EP_rsp,
+                zdo_t.ZDOCmd.Mgmt_Lqi_rsp,
+                zdo_t.ZDOCmd.Mgmt_Permit_Joining_rsp,
+            ):
+                continue

             await self._znp.request(c.ZDO.MsgCallbackRegister.Req(ClusterId=cluster_id))

It won't be possible to bind/unbind things or properly join new devices since those all use additional ZDO commands but this should be enough to "run" ZHA once it's set up.

dumpfheimer commented 2 years ago

I'll try as soon as possible, thanks!

dumpfheimer commented 2 years ago

So, a first quick test did not crash the coordinator.

This to me seems promising but not conclusive. I believe in Hass 2022.8 the state updates while transitioning are ignored which could prevent the loop to be busy. Ill keep you posted.

TheJulianJES commented 2 years ago

@dumpfheimer The attribute reports during light transitions are ignored on the HA side (but still sent by the light, so they still travel through the network). (It can be disabled in the ZHA settings too.)

Also, before/after you’ve applied the patch, you could also try to do a topology scan. (puddly mentioned that could cause higher memory usage and possibly the crash.)

dumpfheimer commented 2 years ago

The amount of messages might only be part of the issue. If memory on the controller is the issue it might be fine as long as zigpy is reading them in the same speed they are received. But my best guess was that something like this is happening:

  1. A bunch of attribute changes (like brightness during transitioning) are received.
  2. Zigpy reads an Attribute change
  3. Tbe state change runs through state change callbacks (eg update via websocket) that block the asyncio event loop preventing zigpy from reading the serial port
  4. The messages on the controller back up and cause memory errors
  5. The controller fails to handle the memory issue correctly and freezes

If the brightness during transition does not cause Hass to update the state the websocket does not block the loop and the messages are read in time for the controller to not block up.

There are a few wild guesses here but the fact that it only happens when my Hass companion is open causing warning messages that the events took long and that it seems to be some kind of memory issue makes me believe this just might be the case.

I am however a complete noob with python/asyncio @puddly do you have deeper insight into the event loop? Could a "rogue" task in the event loop cause zigpy to hang?

dumpfheimer commented 2 years ago

I tried it with a topology scan. I did get a lot of similar error messages: 2022-08-04 21:20:06.295 WARNING (MainThread) [asyncio] Executing <Handle _async_track_state_change_event.<locals>._async_state_change_dispatcher(<Event state_...058055+02:00>>) at /srv/homeassistant/lib/python3.10/site-packages/homeassistant/helpers/event.py:272 created at /srv/homeassistant/lib/python3.10/site-packages/homeassistant/core.py:456> took 0.116 seconds

But it seems like the interruptions went from .4 down to .1 seconds. However it did not crash the controller. ( With puddly's test fix)

TheJulianJES commented 2 years ago

Mhm right, puddly's test-fix also breaks topology scans then. Might be worth to see if a topology scan without the fix triggers the crash.

dumpfheimer commented 2 years ago

Right now I cannot make it crash deliberately by transitions or topology scans. (Without test fix)

Wireheadbe commented 2 years ago

Running 20220726 here on an LAUNCHXL-CC26X2R1 and it's been very stable (81 Zigbee devices, groups,..) Topology scans don't crash,...

edit: still rocksolid. In the past, some of my LED2005r5 would stop reporting/fail to respond to direct commands (group kept working!), so far, this hasn't occured on this firmware

edit2: after a week, still solid, no timeouts from reporting. Everything seems to behave better.

alexruffell commented 2 years ago

Trying to figure out why I keep having so many zigbee issues I updated my coordinator to this release of the firmware. Something new now started... my devices seem to stop updating. I have to reload ZHA and they work for a bit then issues happen all over again. In the logs I found this:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/zigpy_znp/api.py", line 1023, in request_callback_rsp
    await self.request(request, **response_params)
  File "/usr/local/lib/python3.10/site-packages/zigpy_znp/api.py", line 984, in request
    response = await response_future
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/zigpy_znp/api.py", line 1022, in request_callback_rsp
    async with async_timeout.timeout(timeout):
  File "/usr/local/lib/python3.10/site-packages/async_timeout/__init__.py", line 129, in __aexit__
    self._do_exit(exc_type)
  File "/usr/local/lib/python3.10/site-packages/async_timeout/__init__.py", line 212, in _do_exit
    raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
2022-08-07 21:15:20.750 ERROR (MainThread) [homeassistant] Error doing job: Task exception was never retrieved
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/zigpy_znp/api.py", line 1023, in request_callback_rsp
    await self.request(request, **response_params)
  File "/usr/local/lib/python3.10/site-packages/zigpy_znp/api.py", line 960, in request
    async with self._sync_request_lock:
  File "/usr/local/lib/python3.10/asyncio/locks.py", line 14, in __aenter__
    await self.acquire()
  File "/usr/local/lib/python3.10/asyncio/locks.py", line 114, in acquire
    await fut
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/zigpy_znp/api.py", line 1022, in request_callback_rsp
    async with async_timeout.timeout(timeout):
  File "/usr/local/lib/python3.10/site-packages/async_timeout/__init__.py", line 129, in __aexit__
    self._do_exit(exc_type)
  File "/usr/local/lib/python3.10/site-packages/async_timeout/__init__.py", line 212, in _do_exit
    raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
2022-08-07 21:15:20.912 ERROR (MainThread) [homeassistant] Error doing job: Task exception was never retrieved
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/zigpy_znp/api.py", line 1025, in request_callback_rsp
    return await callback_rsp
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/zigpy_znp/api.py", line 1022, in request_callback_rsp
    async with async_timeout.timeout(timeout):
  File "/usr/local/lib/python3.10/site-packages/async_timeout/__init__.py", line 129, in __aexit__
    self._do_exit(exc_type)
  File "/usr/local/lib/python3.10/site-packages/async_timeout/__init__.py", line 212, in _do_exit
    raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
  File "/usr/local/lib/python3.10/site-packages/async_timeout/__init__.py", line 129, in __aexit__
    self._do_exit(exc_type)
  File "/usr/local/lib/python3.10/site-packages/async_timeout/__init__.py", line 212, in _do_exit
    raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError

I don't know whether this is a ZHA issue, or coordinator issue. I see it is zigpy erroring out but I assume an issue with the coordinator could lead to that too. Before I go there and get blamed for the issue because I am using a development firmware, I'd like to rule this out. Thanks!

dumpfheimer commented 2 years ago

I am completely uneducatedly convinced that this is primarily a firmware issue (bad memory management?) and probably needs to be fixed by Texas Instruments. But in a similar fashion I believe there could be things done in zigpy/ZHA that could prevent this bug from happening. Lastly, I also think that the zigpy/ZHA developers are aware of the "issue".

I have been playing around with python event loops/threads which might be the reason this is happening with ZHA but not with z2m. I am too inexperienced with python and zigpy to do anything productive myself but I might start a discussion over at zigpy to get some guidance if and how it could be possible to decouple zigpy from the Home Assistant event loop which in the worsed case should decrease delays and in the best case might fix this bug from being triggered.

salopette commented 2 years ago

I also have error messages with the Dev FW

Coordinator Typ zStack3x0 Coordinator Version 20220726 LAUNCHXL-CC1352P-2

2022-08-15 17:42:18Publish 'set' 'state' to '0x001788010b4b1466' failed: 'Error: Command 0x001788010b4b1466/11 genLevelCtrl.moveToLevelWithOnOff({"level":0,"transtime":10}, {"sendWhen":"immediate","timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":false,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (SREQ '--> ZDO - extRouteDisc - {"dstAddr":28838,"options":0,"radius":30}' failed with status '(0xc7: NWK_TABLE_FULL)' (expected '(0x00: SUCCESS)'))'

Shred99 commented 2 years ago

I'll add my experience to the thread:

I upgraded three days ago and haven't had any issues at all.

alexruffell commented 2 years ago

After the issues in my post above 9 days ago, I wiped the stick and reformed a network on ch25. It has been working great since then. My coordinator is now also linking directly to 40ish devices instead of being limited to 20ish as before this latest network reforming.

Using this CC1352P2_CC2652P_launchpad_coordinator_20220726 on my Sonoff ZBDongle-P

dumpfheimer commented 2 years ago

I finally got zigpy_znp working in a dedicated thread. Will let you know if it has any impact.

Koenkk commented 2 years ago

Can the people using ZHA check if the issue is fixed with the following fw? fws.zip

Changes in this fw compared to 20220726:

HEAPMGR_SIZE   0xc000 -> 0x6000
SRC_RTG_EXPIRY_TIME 2 -> 255
dumpfheimer commented 2 years ago

Thanks for your time.

It's flashed & I will let you know

TheJulianJES commented 2 years ago

Still crashed after about an hour for me

dumpfheimer commented 2 years ago

Unfortunately mine crashed too over night

Koenkk commented 2 years ago

Can you try with this fw: fws.zip ?

This fw is exactly the same as 20220219 except that stack has been updated from 5.40.00.40 to 6.20.00.29

TheJulianJES commented 2 years ago

Will try later. Dev firmware 20220507 (based on SDK 6.10.00.29) works completely fine btw.

dumpfheimer commented 2 years ago

20220507 seems to be stable for me too. I flashed the latest fws.zip firmware and it has been in use for a few hours now.

@TheJulianJES do you have the same error messages as I do? and if so, do you happen to use markdown cards in HA frontend? I found out that my warnings were caused by 2 templates I made. Both iterate over the ZHA devices. One was listing me all unavailable devices, the other listed all lights that were on. I removed both templates and the error messages stopped.

However I did not try if this has any effect on controller firmware getting stuck

Example error message: 2022-08-18 12:07:54.588 WARNING (MainThread) [asyncio] Executing <Handle _async_track_state_change_event.<locals>._async_state_change_dispatcher(<Event state_...453616+02:00>>) at /srv/homeassistant/lib/python3.10/site-packages/homeassistant/helpers/event.py:272 created at /srv/homeassistant/lib/python3.10/site-packages/homeassistant/core.py:456> took 0.133 seconds

dumpfheimer commented 2 years ago

Firmware from 2 days ago (in the zip) seems to be stable

dumpfheimer commented 2 years ago

Well my controller did not like me saying that. It just froze

dumpfheimer commented 2 years ago

Died again over night

Koenkk commented 2 years ago

Then I think something it is something in the SDK itself or some specific things in ZHA which cause it to break.

dumpfheimer commented 2 years ago

So, I am now flashing the latest stable firmware from feb, because I had two freezes with the dev firmware from may

jerrm commented 2 years ago

Just saw this and was about to test.

Am I reading the thread correctly - the problems seem to be ZHA only (or at least not z2m)?

Has anyone tried a build with the prior SDK, but with all other table and retry build option changes?

dumpfheimer commented 2 years ago

Yes that is how it seems. I think no such firmware has been built and published yet. I was thinking about setting up a build environment to test that but have not had the time for it yet.

Wireheadbe commented 2 years ago

Just saw this and was about to test.

Am I reading the thread correctly - the problems seem to be ZHA only (or at least not z2m)?

Has anyone tried a build with the prior SDK, but with all other table and retry build option changes?

Seems to be the case, for me it's been rockstable, and a lot better than previous firmwares. I used to have nodes that stopped reporting, not anymore

alexruffell commented 2 years ago

Seems to be the case, for me it's been rockstable, and a lot better than previous firmwares. I used to have nodes that stopped reporting, not anymore

I am on the 072022 release and it doesn't seem to be crashing but I often have to reload ZHA or restart HA as sensors randomly stop reporting one or all their values.

Koenkk commented 2 years ago

@jerrm

Has anyone tried a build with the prior SDK, but with all other table and retry build option changes?

Yes, see https://github.com/Koenkk/Z-Stack-firmware/issues/383#issuecomment-1218341311

dumpfheimer commented 2 years ago

Wait, so it's not just SimpleLink SDK >= 6 causing the lock ups? It's either SimpleLink 6 or the routing table increase?

jerrm commented 2 years ago

Looks like SDK if you are failing with the fws.zip @Koenkk linked in this post: https://github.com/Koenkk/Z-Stack-firmware/issues/383#issuecomment-1218341311

Looks like that should be older February build options with new SDK.

Opposite of the way I asked my question, but theoretically yields the same result.

dumpfheimer commented 2 years ago

So, there is no firmware with SDK 5 but larger routing table at the moment, correct?

Koenkk commented 2 years ago

I will do a double check on https://github.com/Koenkk/Z-Stack-firmware/issues/383#issuecomment-1218341311 (check if the only change is the SDK version)