Closed dumpfheimer closed 1 year ago
Also running 20220724
with ZHA here. Had a lockup after a few hours of uptime and needed to replug the stick. After the replug, it's been fine for ~18 hours now.
I can try 20220726
/https://github.com/Koenkk/Z-Stack-firmware/commit/704d0e7372756932bc766afe336d8b57a8dd82d8 later but it doesn't look like it would solve the issue(?)
Edit: Had another lockup after ~22 hours
I've been running it for ~12h without issues on both my networks, running z2m with Sonoff cc2652p. Will report if things change.
Update: 20220726
crashed after 3 hours for me
Currently running the 20220726 but this runs fine for me with z2m. Lets monitor it further.
I am able to reproduce this when I call the light.toggle service with a few seconds transition (12) with 2 areas selected that total in 12 light bulbs. Wild guess: Maybe the flood of incoming messages causes memory issues?
Correction: This happens as described above ONLY WHEN I have the Home Assistant Companion App open...? I then get a bunch of these warnings and my network freezes:
2022-07-27 20:00:19 WARNING (MainThread) [asyncio] Executing <Handle _async_track_state_change_event.<locals>._async_state_change_dispatcher(<Event state_...920903+02:00>>) at /srv/homeassistant/lib/python3.10/site-packages/homeassistant/helpers/event.py:272 created at /srv/homeassistant/lib/python3.10/site-packages/homeassistant/core.py:456> took 0.416 seconds
@dmulcahey I believe once said that one of the ZHA issues is that it is limited to the HA event loop. Could it be, that..?:
Running it since 4-5 days without any issue.
I have a 82 node network wich 30 are routers.
@MattL0 You're running Zigbee2MQTT, right? (and not Home Assistant's ZHA integration)
Yes I am on zigbee2mqtt
It seems like this issue only causes the controller to lock up with ZHA.
Is there a way to test if the controller locks up when there is some kind of backing up of messages?
No issues on Z2M here since this firmware dropped on a USB CC2652P2 coordinator with a mixture of about 45 devices on mains and battery.
Also still running fine in my case. @dumpfheimer I suggest to open an issue in the ZHA issue tracker
I believe this may be caused by zigpy-znp registering callbacks for all known ZDO commands at startup, which increases its runtime RAM usage beyond Z2M's: https://github.com/zigpy/zigpy-znp/blob/695ebf50a86117daeec92ac08c256fc1bfd11e60/zigpy_znp/zigbee/application.py#L212-L218
I will. But there still seems to be some kind of regression with the firmware. IMHO it should not freeze like this and I have never had this issue before updating the firmware.
@dumpfheimer if you have a second, try applying this patch to see if it still crashes. Only five of the thirty defined ZDO commands will be registered:
diff --git a/zigpy_znp/zigbee/application.py b/zigpy_znp/zigbee/application.py
index 354edb7..382120e 100644
--- a/zigpy_znp/zigbee/application.py
+++ b/zigpy_znp/zigbee/application.py
@@ -214,6 +214,14 @@ class ControllerApplication(zigpy.application.ControllerApplication):
# Ignore outgoing ZDO requests, only receive announcements and responses
if cluster_id.name.endswith(("_req", "_set")):
continue
+ elif cluster_id not in (
+ zdo_t.ZDOCmd.Node_Desc_rsp,
+ zdo_t.ZDOCmd.Simple_Desc_rsp,
+ zdo_t.ZDOCmd.Active_EP_rsp,
+ zdo_t.ZDOCmd.Mgmt_Lqi_rsp,
+ zdo_t.ZDOCmd.Mgmt_Permit_Joining_rsp,
+ ):
+ continue
await self._znp.request(c.ZDO.MsgCallbackRegister.Req(ClusterId=cluster_id))
It won't be possible to bind/unbind things or properly join new devices since those all use additional ZDO commands but this should be enough to "run" ZHA once it's set up.
I'll try as soon as possible, thanks!
So, a first quick test did not crash the coordinator.
This to me seems promising but not conclusive. I believe in Hass 2022.8 the state updates while transitioning are ignored which could prevent the loop to be busy. Ill keep you posted.
@dumpfheimer The attribute reports during light transitions are ignored on the HA side (but still sent by the light, so they still travel through the network). (It can be disabled in the ZHA settings too.)
Also, before/after you’ve applied the patch, you could also try to do a topology scan. (puddly mentioned that could cause higher memory usage and possibly the crash.)
The amount of messages might only be part of the issue. If memory on the controller is the issue it might be fine as long as zigpy is reading them in the same speed they are received. But my best guess was that something like this is happening:
If the brightness during transition does not cause Hass to update the state the websocket does not block the loop and the messages are read in time for the controller to not block up.
There are a few wild guesses here but the fact that it only happens when my Hass companion is open causing warning messages that the events took long and that it seems to be some kind of memory issue makes me believe this just might be the case.
I am however a complete noob with python/asyncio @puddly do you have deeper insight into the event loop? Could a "rogue" task in the event loop cause zigpy to hang?
I tried it with a topology scan.
I did get a lot of similar error messages: 2022-08-04 21:20:06.295 WARNING (MainThread) [asyncio] Executing <Handle _async_track_state_change_event.<locals>._async_state_change_dispatcher(<Event state_...058055+02:00>>) at /srv/homeassistant/lib/python3.10/site-packages/homeassistant/helpers/event.py:272 created at /srv/homeassistant/lib/python3.10/site-packages/homeassistant/core.py:456> took 0.116 seconds
But it seems like the interruptions went from .4 down to .1 seconds. However it did not crash the controller. ( With puddly's test fix)
Mhm right, puddly's test-fix also breaks topology scans then. Might be worth to see if a topology scan without the fix triggers the crash.
Right now I cannot make it crash deliberately by transitions or topology scans. (Without test fix)
Running 20220726 here on an LAUNCHXL-CC26X2R1 and it's been very stable (81 Zigbee devices, groups,..) Topology scans don't crash,...
edit: still rocksolid. In the past, some of my LED2005r5 would stop reporting/fail to respond to direct commands (group kept working!), so far, this hasn't occured on this firmware
edit2: after a week, still solid, no timeouts from reporting. Everything seems to behave better.
Trying to figure out why I keep having so many zigbee issues I updated my coordinator to this release of the firmware. Something new now started... my devices seem to stop updating. I have to reload ZHA and they work for a bit then issues happen all over again. In the logs I found this:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/zigpy_znp/api.py", line 1023, in request_callback_rsp
await self.request(request, **response_params)
File "/usr/local/lib/python3.10/site-packages/zigpy_znp/api.py", line 984, in request
response = await response_future
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/zigpy_znp/api.py", line 1022, in request_callback_rsp
async with async_timeout.timeout(timeout):
File "/usr/local/lib/python3.10/site-packages/async_timeout/__init__.py", line 129, in __aexit__
self._do_exit(exc_type)
File "/usr/local/lib/python3.10/site-packages/async_timeout/__init__.py", line 212, in _do_exit
raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
2022-08-07 21:15:20.750 ERROR (MainThread) [homeassistant] Error doing job: Task exception was never retrieved
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/zigpy_znp/api.py", line 1023, in request_callback_rsp
await self.request(request, **response_params)
File "/usr/local/lib/python3.10/site-packages/zigpy_znp/api.py", line 960, in request
async with self._sync_request_lock:
File "/usr/local/lib/python3.10/asyncio/locks.py", line 14, in __aenter__
await self.acquire()
File "/usr/local/lib/python3.10/asyncio/locks.py", line 114, in acquire
await fut
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/zigpy_znp/api.py", line 1022, in request_callback_rsp
async with async_timeout.timeout(timeout):
File "/usr/local/lib/python3.10/site-packages/async_timeout/__init__.py", line 129, in __aexit__
self._do_exit(exc_type)
File "/usr/local/lib/python3.10/site-packages/async_timeout/__init__.py", line 212, in _do_exit
raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
2022-08-07 21:15:20.912 ERROR (MainThread) [homeassistant] Error doing job: Task exception was never retrieved
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/zigpy_znp/api.py", line 1025, in request_callback_rsp
return await callback_rsp
asyncio.exceptions.CancelledError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.10/site-packages/zigpy_znp/api.py", line 1022, in request_callback_rsp
async with async_timeout.timeout(timeout):
File "/usr/local/lib/python3.10/site-packages/async_timeout/__init__.py", line 129, in __aexit__
self._do_exit(exc_type)
File "/usr/local/lib/python3.10/site-packages/async_timeout/__init__.py", line 212, in _do_exit
raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
File "/usr/local/lib/python3.10/site-packages/async_timeout/__init__.py", line 129, in __aexit__
self._do_exit(exc_type)
File "/usr/local/lib/python3.10/site-packages/async_timeout/__init__.py", line 212, in _do_exit
raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
I don't know whether this is a ZHA issue, or coordinator issue. I see it is zigpy erroring out but I assume an issue with the coordinator could lead to that too. Before I go there and get blamed for the issue because I am using a development firmware, I'd like to rule this out. Thanks!
I am completely uneducatedly convinced that this is primarily a firmware issue (bad memory management?) and probably needs to be fixed by Texas Instruments. But in a similar fashion I believe there could be things done in zigpy/ZHA that could prevent this bug from happening. Lastly, I also think that the zigpy/ZHA developers are aware of the "issue".
I have been playing around with python event loops/threads which might be the reason this is happening with ZHA but not with z2m. I am too inexperienced with python and zigpy to do anything productive myself but I might start a discussion over at zigpy to get some guidance if and how it could be possible to decouple zigpy from the Home Assistant event loop which in the worsed case should decrease delays and in the best case might fix this bug from being triggered.
I also have error messages with the Dev FW
Coordinator Typ zStack3x0 Coordinator Version 20220726 LAUNCHXL-CC1352P-2
2022-08-15 17:42:18Publish 'set' 'state' to '0x001788010b4b1466' failed: 'Error: Command 0x001788010b4b1466/11 genLevelCtrl.moveToLevelWithOnOff({"level":0,"transtime":10}, {"sendWhen":"immediate","timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":false,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (SREQ '--> ZDO - extRouteDisc - {"dstAddr":28838,"options":0,"radius":30}' failed with status '(0xc7: NWK_TABLE_FULL)' (expected '(0x00: SUCCESS)'))'
I'll add my experience to the thread:
I upgraded three days ago and haven't had any issues at all.
After the issues in my post above 9 days ago, I wiped the stick and reformed a network on ch25. It has been working great since then. My coordinator is now also linking directly to 40ish devices instead of being limited to 20ish as before this latest network reforming.
Using this CC1352P2_CC2652P_launchpad_coordinator_20220726 on my Sonoff ZBDongle-P
I finally got zigpy_znp working in a dedicated thread. Will let you know if it has any impact.
Can the people using ZHA check if the issue is fixed with the following fw? fws.zip
Changes in this fw compared to 20220726:
HEAPMGR_SIZE 0xc000 -> 0x6000
SRC_RTG_EXPIRY_TIME 2 -> 255
Thanks for your time.
It's flashed & I will let you know
Still crashed after about an hour for me
Unfortunately mine crashed too over night
Can you try with this fw: fws.zip ?
This fw is exactly the same as 20220219
except that stack has been updated from 5.40.00.40
to 6.20.00.29
Will try later. Dev firmware 20220507 (based on SDK 6.10.00.29) works completely fine btw.
20220507 seems to be stable for me too. I flashed the latest fws.zip firmware and it has been in use for a few hours now.
@TheJulianJES do you have the same error messages as I do? and if so, do you happen to use markdown cards in HA frontend? I found out that my warnings were caused by 2 templates I made. Both iterate over the ZHA devices. One was listing me all unavailable devices, the other listed all lights that were on. I removed both templates and the error messages stopped.
However I did not try if this has any effect on controller firmware getting stuck
Example error message:
2022-08-18 12:07:54.588 WARNING (MainThread) [asyncio] Executing <Handle _async_track_state_change_event.<locals>._async_state_change_dispatcher(<Event state_...453616+02:00>>) at /srv/homeassistant/lib/python3.10/site-packages/homeassistant/helpers/event.py:272 created at /srv/homeassistant/lib/python3.10/site-packages/homeassistant/core.py:456> took 0.133 seconds
Firmware from 2 days ago (in the zip) seems to be stable
Well my controller did not like me saying that. It just froze
Died again over night
Then I think something it is something in the SDK itself or some specific things in ZHA which cause it to break.
So, I am now flashing the latest stable firmware from feb, because I had two freezes with the dev firmware from may
Just saw this and was about to test.
Am I reading the thread correctly - the problems seem to be ZHA only (or at least not z2m)?
Has anyone tried a build with the prior SDK, but with all other table and retry build option changes?
Yes that is how it seems. I think no such firmware has been built and published yet. I was thinking about setting up a build environment to test that but have not had the time for it yet.
Just saw this and was about to test.
Am I reading the thread correctly - the problems seem to be ZHA only (or at least not z2m)?
Has anyone tried a build with the prior SDK, but with all other table and retry build option changes?
Seems to be the case, for me it's been rockstable, and a lot better than previous firmwares. I used to have nodes that stopped reporting, not anymore
Seems to be the case, for me it's been rockstable, and a lot better than previous firmwares. I used to have nodes that stopped reporting, not anymore
I am on the 072022 release and it doesn't seem to be crashing but I often have to reload ZHA or restart HA as sensors randomly stop reporting one or all their values.
@jerrm
Has anyone tried a build with the prior SDK, but with all other table and retry build option changes?
Yes, see https://github.com/Koenkk/Z-Stack-firmware/issues/383#issuecomment-1218341311
Wait, so it's not just SimpleLink SDK >= 6 causing the lock ups? It's either SimpleLink 6 or the routing table increase?
Looks like SDK if you are failing with the fws.zip @Koenkk linked in this post: https://github.com/Koenkk/Z-Stack-firmware/issues/383#issuecomment-1218341311
Looks like that should be older February build options with new SDK.
Opposite of the way I asked my question, but theoretically yields the same result.
So, there is no firmware with SDK 5 but larger routing table at the moment, correct?
I will do a double check on https://github.com/Koenkk/Z-Stack-firmware/issues/383#issuecomment-1218341311 (check if the only change is the SDK version)
After seeing in the changelog that the routing table sizes have increased I wanted to test the latest DEVELOPMENT firmware.
I am having issues which I believe are caused by the firmware update.
It seems to me that the firmware crashes after a few hours / an amout of requests. Unfortunately I cannot provide detailed feedback, but am glad to try with some guidance.
The first time it got stuck I did not pay a lot of attention and simply restarted everything. The second time I un- and replugged the coordinator and things recovered without any issues worth mentioning. The logs were full of messages as shown below (1). Later it changed to other error messages (2).
On the positive side: I do feel like the larger routing table might have had a positive effect on my environment. I have ~120 zigbee devices of which probably 2/3 are routers. Especially when toggling a bunch of lights at the same time I feel like it has less "hickups"
My environment: I am using a CC1352P2 launchpad with zigpy/zha/home assistant. The firware in use was https://github.com/Koenkk/Z-Stack-firmware/blob/develop/coordinator/Z-Stack_3.x.0/bin/CC1352P2_CC2652P_launchpad_coordinator_20220724.zip
Error message 1:
Error message 2: