Koenkk / Z-Stack-firmware

Compilation instructions and hex files for Z-Stack firmwares
MIT License
2.29k stars 639 forks source link

SONOFF USB 3.0 stick running 20221226 stops responding if multiple Zigbee devices lose power #440

Closed pfak closed 1 year ago

pfak commented 1 year ago

I have a SONOFF Zigbee 3.0 stick on a Raspberry Pi 3 accessed over ser2net with ~164 devices running Z-Stack 3.30+ (build 20221226).

If I power off a bunch of Zigbee devices (i.e. breaker or power failure) I can get the SONOFF stick into a state where it doesn't respond over serial, and kernel messages on the machine show:

Apr  4 01:15:55 zha-coordinator kernel: [1383198.903388] WARN::dwc_otg_hcd_urb_dequeue:639: Timed out waiting for FSM NP transfer to complete on 3
Apr  4 01:16:10 zha-coordinator kernel: [1383213.915668] WARN::dwc_otg_hcd_urb_dequeue:639: Timed out waiting for FSM NP transfer to complete on 7
Apr  4 01:16:25 zha-coordinator kernel: [1383228.923033] WARN::dwc_otg_hcd_urb_dequeue:639: Timed out waiting for FSM NP transfer to complete on 6
Apr  4 01:17:25 zha-coordinator kernel: [1383288.965764] WARN::dwc_otg_hcd_urb_dequeue:639: Timed out waiting for FSM NP transfer to complete on 0
Apr  4 01:17:40 zha-coordinator kernel: [1383303.974544] WARN::dwc_otg_hcd_urb_dequeue:639: Timed out waiting for FSM NP transfer to complete on 5
Apr  4 01:18:25 zha-coordinator kernel: [1383349.005392] WARN::dwc_otg_hcd_urb_dequeue:639: Timed out waiting for FSM NP transfer to complete on 5

I can only fix it by restarting the machine or re-plugging the USB device, and then restarting HomeAssistant or reloading the integration. ZHA just responds with the following if I don't restart or re-plug the USB device:

2023-04-03 18:48:33.065 WARNING (MainThread) [homeassistant.components.zha.core.gateway] Couldn't start ZNP = Texas Instruments Z-Stack ZNP protocol: CC253x, CC26x2, CC13x2 coordinator (attempt 1 of 3)
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/zigpy_znp/api.py", line 665, in _skip_bootloader
    result = await responses.get()
  File "/usr/local/lib/python3.10/asyncio/queues.py", line 159, in get
    await getter
asyncio.exceptions.CancelledError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/src/homeassistant/homeassistant/components/zha/core/gateway.py", line 185, in async_initialize
    self.application_controller = await app_controller_cls.new(
  File "/usr/local/lib/python3.10/site-packages/zigpy/application.py", line 171, in new
    await app.startup(auto_form=auto_form)
  File "/usr/local/lib/python3.10/site-packages/zigpy/application.py", line 145, in startup
    await self.connect()
  File "/usr/local/lib/python3.10/site-packages/zigpy_znp/zigbee/application.py", line 107, in connect
    await znp.connect()
  File "/usr/local/lib/python3.10/site-packages/zigpy_znp/api.py", line 707, in connect
    self.capabilities = (await self._skip_bootloader()).Capabilities
  File "/usr/local/lib/python3.10/site-packages/zigpy_znp/api.py", line 664, in _skip_bootloader
    async with async_timeout.timeout(CONNECT_PROBE_TIMEOUT):
  File "/usr/local/lib/python3.10/site-packages/async_timeout/__init__.py", line 129, in __aexit__
    self._do_exit(exc_type)
  File "/usr/local/lib/python3.10/site-packages/async_timeout/__init__.py", line 212, in _do_exit
    raise asyncio.TimeoutError
asyncio.exceptions.TimeoutError
pfak commented 1 year ago

Since this is an RPi 3 1v.2 I am going to try using dtcoverlay=dwc2 which will use the dwc2 USB contoller instead of dwc_otg and report back.

pfak commented 1 year ago

Switching to dwc2 on the RPi did not fix my issue, I can reproduce every time I flip a bunch of power on lights:

Apr  9 19:24:26 zha-coordinator kernel: [439986.981446] cp210x ttyUSB0: usb_serial_generic_read_bulk_callback - urb stopped: -32
Apr  9 19:24:26 zha-coordinator kernel: [439986.981603] cp210x ttyUSB0: usb_serial_generic_read_bulk_callback - urb stopped: -32
Apr  9 19:24:26 zha-coordinator kernel: [439987.009525] usb 1-1.4: USB disconnect, device number 4
Apr  9 19:24:26 zha-coordinator kernel: [439987.010025] cp210x ttyUSB0: failed set request 0x12 status: -19
Apr  9 19:24:26 zha-coordinator kernel: [439987.010054] cp210x ttyUSB0: failed set request 0x0 status: -19
Apr  9 19:24:26 zha-coordinator kernel: [439987.010829] cp210x ttyUSB0: cp210x converter now disconnected from ttyUSB0
Apr  9 19:24:26 zha-coordinator kernel: [439987.018592] cp210x 1-1.4:1.0: device disconnected
Apr  9 19:24:27 zha-coordinator kernel: [439987.348531] usb 1-1.4: new full-speed USB device number 5 using dwc2
Apr  9 19:24:27 zha-coordinator kernel: [439987.452150] usb 1-1.4: New USB device found, idVendor=10c4, idProduct=ea60, bcdDevice= 1.00
Apr  9 19:24:27 zha-coordinator kernel: [439987.452193] usb 1-1.4: New USB device strings: Mfr=1, Product=2, SerialNumber=3
Apr  9 19:24:27 zha-coordinator kernel: [439987.452216] usb 1-1.4: Product: Sonoff Zigbee 3.0 USB Dongle Plus
Apr  9 19:24:27 zha-coordinator kernel: [439987.452237] usb 1-1.4: Manufacturer: ITead
Apr  9 19:24:27 zha-coordinator kernel: [439987.452256] usb 1-1.4: SerialNumber: a6d9bdd3aac9eb11848a8f4f1d69213e
Apr  9 19:24:27 zha-coordinator kernel: [439987.453812] cp210x 1-1.4:1.0: cp210x converter detected
Apr  9 19:24:27 zha-coordinator kernel: [439987.457775] usb 1-1.4: cp210x converter now attached to ttyUSB0
Koenkk commented 1 year ago

Can you test the new fw and provide feedback here? https://github.com/Koenkk/Z-Stack-firmware/issues/439

pfak commented 1 year ago

hi @Koenkk:

I am using the new firmware in #439 now (20230410). I can no longer reproduce this issue that I reported.

However, with 20230410 I am seeing 1) Latency on commands, and 2) The stick becoming unresponsive as the network was being rebuilt after upgrading, 3) Great difficulty pairing Inovelli VZM31-SN switches, it is taking >5 attempts to pair them.

pfak commented 1 year ago

Removing RPi from the mix, and ser2net:

20230410 = Spewing lots of max concurrency errors, very slow response and finally the stick stops responding 20221226 = Network immediately comes back up and is responsive

Koenkk commented 1 year ago

Spewing lots of max concurrency errors

Where do you see these errors, can you provide the log of it?

pfak commented 1 year ago

On 20230410, in home assistant logs with ZHA debugging enabled. After upgrading the firmware, I immediately get high requests, which is expected, but they do not level off:

2023-04-13 20:08:34.297 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (290 enqueued)
2023-04-13 20:08:34.332 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (291 enqueued)
2023-04-13 20:08:34.360 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (292 enqueued)
2023-04-13 20:08:34.440 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (293 enqueued)
2023-04-13 20:08:42.023 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (278 enqueued)
2023-04-13 20:08:42.050 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (279 enqueued)
2023-04-13 20:08:42.059 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (280 enqueued)
2023-04-13 20:08:42.105 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (281 enqueued)
2023-04-13 20:08:42.212 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (282 enqueued)
2023-04-13 20:08:42.219 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (283 enqueued)
2023-04-13 20:08:42.244 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (284 enqueued)
2023-04-13 20:08:42.264 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (285 enqueued)
2023-04-13 20:08:42.303 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (286 enqueued)
2023-04-13 20:08:42.370 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (287 enqueued)
2023-04-13 20:08:42.374 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (288 enqueued)
2023-04-13 20:08:42.407 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (289 enqueued)
2023-04-13 20:08:42.416 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (290 enqueued)
2023-04-13 20:08:42.444 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (291 enqueued)
2023-04-13 20:08:42.458 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (292 enqueued)
2023-04-13 20:08:42.467 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (293 enqueued)
2023-04-13 20:08:44.701 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (294 enqueued)
2023-04-13 20:08:50.721 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (0 enqueued)
2023-04-13 20:08:50.722 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (1 enqueued)
2023-04-13 20:08:50.730 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (2 enqueued)
2023-04-13 20:08:50.733 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (3 enqueued)
2023-04-13 20:08:50.735 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (4 enqueued)
2023-04-13 20:08:50.736 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (4 enqueued)
2023-04-13 20:08:50.738 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (5 enqueued)
2023-04-13 20:08:50.739 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (6 enqueued)
2023-04-13 20:08:50.741 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (7 enqueued)
2023-04-13 20:08:50.746 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (7 enqueued)
2023-04-13 20:08:50.747 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (8 enqueued)
2023-04-13 20:08:50.759 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (7 enqueued)
2023-04-13 20:08:50.760 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (8 enqueued)
2023-04-13 20:08:50.760 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (9 enqueued)
2023-04-13 20:08:50.778 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (8 enqueued)
2023-04-13 20:08:50.781 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (9 enqueued)
2023-04-13 20:08:50.785 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (9 enqueued)
2023-04-13 20:08:50.786 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (10 enqueued)
2023-04-13 20:08:50.787 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (11 enqueued)
2023-04-13 20:08:50.810 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (9 enqueued)
2023-04-13 20:08:50.831 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (8 enqueued)
2023-04-13 20:08:50.832 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (9 enqueued)
2023-04-13 20:08:50.836 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (10 enqueued)
2023-04-13 20:08:50.847 DEBUG (MainThread) [zigpy.application] Max concurrency (16) reached, delaying request (10 enqueued)

For 2023-03-13, which is a mix of 20230410 and 20221226:

$ grep -E '^2023-04-13.*Max concurrency' home-assistant.log | wc -l
13081

As soon as I downgraded to 20221226 the max concurrency messages are very little. 2023-04-14 I have only been running 20221226:

$ grep -E '^2023-04-14.*Max concurrency' home-assistant.log | wc -l
63

Can you let me know what details you'd like? I have a complete log of me upgrading, downgrading and trying various things but it's a couple GiB in size.

Koenkk commented 1 year ago

I'm not familiar with the ZHA logging, is there any way to see responses from the adapter?

pfak commented 1 year ago

@Koenkk Is this helpful? puddly suggested I include zigpy_znp.api to narrow down to Z-Stack itself.

Complete output: complete-output-20230410-firmware.txt Filtered output: truncated-zigpy-output-20230410-firmware.txt

But being a layman all I see is that the stick has stopped responding and then comes back?

See attached.

Koenkk commented 1 year ago

I see this a lot of times in the log:

2023-04-13 20:08:49.745 DEBUG (MainThread) [homeassistant.components.zha.core.channels.base] [0x1ECA:11:0x0008]: failed to get attributes '['current_level']' on 'level' cluster: Expected SRSP response AF.DataRequestExt.Rsp(Status=<Status.SUCCESS: 0>), got AF.DataRequestExt.Rsp(Status=<Status.INVALID_PARAMETER: 2>)

Especially <Status.INVALID_PARAMETER: 2> is interesting, it seems ZHA sends invalid request to the coordinator, this should be fixed first in ZHA

pfak commented 1 year ago

@Koenkk

The original issue (USB stick crashing) with 20221226 was resolved by removing ser2net and RPi 3 from the equation. I can verify that without ser2net in the mix, the controller works reliably under conditions where I flip a breaker and a bunch of devices lose power.

I am hard pressed to understand how if 20221226 is working fine, that the issues with 20230410 that are 100% reproducible when switching are related to ZHA, since as soon as I downgrade to 20221226 the issues resolve.

puddly mentioned they can't do much if the USB coordinator disappears out from under them, as they can't communicate with them anymore.

We can close this issue out and I can provide further discussion on the other issue if you believe it is still a ZHA issue.