Koenkk / Z-Stack-firmware

Compilation instructions and hex files for Z-Stack firmwares
MIT License
2.36k stars 645 forks source link

ZZH stops working after a while #205

Closed JoseAlcerreca closed 3 years ago

JoseAlcerreca commented 4 years ago

I migrated yesterday from a cheap CC2530 (which worked for months) to ZZH but after ~8 hours it stopped working:

dmesg

[Thu Sep 10 03:03:32 2020] ch341-uart ttyUSB0: usb_serial_generic_read_bulk_callback - urb stopped: -32

zigbee2mqtt's log (with log_level = debug) stopped showing activity but it was running (it was saving state):

debug 2020-09-10 02:53:27: Received Zigbee message from '0x0017880102c1783e', type 'commandQueryNextImageRequest', cluster 'genOta', data '{"fieldControl":0,"manufacturerCode":4107,"imageType":265,"fileVersion":1107314102}' from endpoint 2 with groupID 0
debug 2020-09-10 02:53:37: Received Zigbee message from '0x0017880102c1783e', type 'commandQueryNextImageRequest', cluster 'genOta', data '{"fieldControl":0,"manufacturerCode":4107,"imageType":265,"fileVersion":1107314102}' from endpoint 2 with groupID 0
debug 2020-09-10 02:55:56: Saving state to file /opt/zigbee2mqtt/data/state.json
debug 2020-09-10 03:00:56: Saving state to file /opt/zigbee2mqtt/data/state.json
debug 2020-09-10 03:05:56: Saving state to file /opt/zigbee2mqtt/data/state.json
debug 2020-09-10 03:10:56: Saving state to file /opt/zigbee2mqtt/data/state.json
...

Zigbee2mqtt was able to receive commands but they were timing out. Sending a command from home assistant:

info  2020-09-10 07:34:41: MQTT publish: topic 'zigbee2mqtt/bridge/log', payload '{"type":"zigbee_publish_error","message":"Publish 'set' 'state' to 'outlet-A-ender' failed: 'Error: Command 0x000b3cfffef78d01/1 genOnOff.off({}, {\"timeout\":10000,\"disableResponse\":false,\"disableDefaultResponse\":false,\"direction\":0,\"srcEndpoint\":null,\"reservedBits\":0,\"manufacturerCode\":null,\"transactionSequenceNumber\":null}) failed (SRSP - AF - dataRequest after 6000ms)'","meta":{"friendly_name":"outlet-A-ender"}}'

I restarted the zigbee2mqtt service and everything started working again (did not unplug the zzh).

Firmware is https://github.com/Koenkk/Z-Stack-firmware/blob/master/coordinator/Z-Stack_3.x.0/bin/CC26X2R1_20200805.zip

RPi 3 B+ running DietPI with RPi power supply.

Koenkk commented 4 years ago

Could you provide the herdsman debug logging at the moment it stops receiving/sending?

To enable herdsman debug logging, see https://www.zigbee2mqtt.io/information/debug.html#zigbee-herdsman-debug-logging

JoseAlcerreca commented 4 years ago

It happened again, the zigbee network is down:

Z2M log:

info  2020-09-11 06:48:12: MQTT publish: topic 'zigbee2mqtt/0xec1bbdfffee4bf62', payload '{"battery":80,"voltage":2700,"linkquality":72,"temperature":26.8,"humidity":34.3}'
debug 2020-09-11 06:48:13: Received Zigbee message from '0xec1bbdfffee4bf62', type 'attributeReport', cluster 'genPowerCfg', data '{"batteryPercentageRemaining":160,"batteryVoltage":27}' from endpoint 1 with groupID 0
info  2020-09-11 06:48:13: MQTT publish: topic 'zigbee2mqtt/0xec1bbdfffee4bf62', payload '{"battery":80,"voltage":2700,"linkquality":72,"temperature":26.8,"humidity":34.3}'
debug 2020-09-11 06:48:21: Received Zigbee message from '0xec1bbdfffee4bf62', type 'attributeReport', cluster 'genBasic', data '{"65506":19,"appVersion":255}' from endpoint 1 with groupID 0
debug 2020-09-11 06:48:21: No converter available for 'TS0201' with cluster 'genBasic' and type 'attributeReport' and data '{"65506":19,"appVersion":255}'
debug 2020-09-11 06:48:21: Received Zigbee message from '0xec1bbdfffee4bf62', type 'attributeReport', cluster 'genBasic', data '{"65506":19,"appVersion":65}' from endpoint 1 with groupID 0
debug 2020-09-11 06:48:21: No converter available for 'TS0201' with cluster 'genBasic' and type 'attributeReport' and data '{"65506":19,"appVersion":65}'
debug 2020-09-11 06:50:41: Saving state to file /opt/zigbee2mqtt/data/state.json
debug 2020-09-11 06:55:41: Saving state to file /opt/zigbee2mqtt/data/state.json
debug 2020-09-11 07:00:41: Saving state to file /opt/zigbee2mqtt/data/state.json
debug 2020-09-11 07:05:41: Saving state to file /opt/zigbee2mqtt/data/state.json
debug 2020-09-11 07:07:27: Received Zigbee message from 'outlet-A-ender', type 'attributeReport', cluster 'genOnOff', data '{"onOff":1}' from endpoint 1 with groupID 0
info  2020-09-11 07:07:27: MQTT publish: topic 'zigbee2mqtt/outlet-A-ender', payload '{"state":"ON","linkquality":135,"update_available":true}'
debug 2020-09-11 07:10:41: Saving state to file /opt/zigbee2mqtt/data/state.json
debug 2020-09-11 07:15:41: Saving state to file /opt/zigbee2mqtt/data/state.json
(repeated lines until now)

dmesg saw something at 16:42 that didn't affect anything and again at 7:09

[Thu Sep 10 16:42:49 2020] ch341-uart ttyUSB0: usb_serial_generic_read_bulk_callback - urb stopped: -32
[Fri Sep 11 07:09:28 2020] ch341-uart ttyUSB0: usb_serial_generic_read_bulk_callback - urb stopped: -32

zigbee-herdsman log:

Everything seems fine until:

Sep 11 07:12:25 DietPi npm[4908]: 2020-09-11T05:12:25.507Z zigbee-herdsman:adapter:zStack:unpi:writer --> frame [254,5,37,54,15,252,255,254,0,228]
Sep 11 07:12:31 DietPi npm[4908]: (node:4924) UnhandledPromiseRejectionWarning: Error: SRSP - ZDO - mgmtPermitJoinReq after 6000ms
Sep 11 07:12:31 DietPi npm[4908]:     at Timeout._onTimeout (/opt/zigbee2mqtt/node_modules/zigbee-herdsman/dist/utils/waitress.js:46:35)
Sep 11 07:12:31 DietPi npm[4908]:     at listOnTimeout (internal/timers.js:551:17)
Sep 11 07:12:31 DietPi npm[4908]:     at processTimers (internal/timers.js:494:7)
Sep 11 07:12:31 DietPi npm[4908]: (Use `node --trace-warnings ...` to show where the warning was created)
Sep 11 07:12:31 DietPi npm[4908]: (node:4924) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by r
ejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_
unhandled_rejections_mode). (rejection id: 1)

Which is repeated dozens of times until now. Full log since 6am: https://pastebin.com/jqXvYtDD

omerk commented 4 years ago

Hmm, looks like the USB port is the culprit here, can you post a full dmesg from around the time you notice comms issues? I am interested in seeing any potential power related messages from the kernel.

Do you have any power hungry devices connected to the USB ports and do you have a decent power supply for your Raspberry Pi?

Looking around, this appears to be a common problem with Raspberry Pi boards, there is even some FAQs mentioning it: https://www.klipper3d.org/FAQ.html#i-keep-getting-random-lost-communication-with-mcu-errors and a whole selection of threads mentioning that message on RPi.

More importantly, as a temporary test, can you perhaps move zzh to another computer (not a RPi) and see if you can replicate the issue? We can then work backwards to try and solve the problem on the Pi.

Another good step would be to ensure your Pi is fully up to date, in case this is a driver/kernel issue...

JoseAlcerreca commented 4 years ago

The dmesg line is the only thing there is around that time. It's important to note that that message happened yesterday at 16:42 as well, but nothing happened, the network worked fine until this morning.

Is the urb stopped message related to power problems? I can find multiple threads mentioning it but not specifically about power.

There is an SSD connected to the Pi and I'm using an official power supply for the 3B+.

Another good step would be to ensure your Pi is fully up to date, in case this is a driver/kernel issue...

Moved to DietPi a month ago Linux DietPi 5.4.51-v8+ #1333 SMP PREEMPT Mon Aug 10 16:58:35 BST 2020 aarch64 GNU/Linux

In any case, zigbee2mqtt is not crashing so, would it be possible to catch the error and continue working? I can restart zigbee2mqtt and everything works again so the hardware seems fine.

omerk commented 4 years ago

Is the urb stopped message related to power problems? I can find multiple threads mentioning it but not specifically about power.

Seen mentions of it so I was curious to see if you had any power related messages there. Seems like there aren't any?

There is an SSD connected to the Pi and I'm using an official power supply for the 3B+.

SSDs can draw a fair amount of current, in theory you should be fine id you're using the official power supply though...

Linux DietPi 5.4.51-v8+ #1333 SMP PREEMPT Mon Aug 10 16:58:35 BST 2020 aarch64 GNU/Linux

I haven't used DietPi before but that certainly is a recent enough kernel :-)

I have not seen this error before but it doesn't sound like a zzh hardware issue given that you can communicate with the stick (and even after seeing that error message as well). If you can try replicating this on another non-Pi computer that would allow us to rule this possibility out completely.

JoseAlcerreca commented 4 years ago

It failed again after 8:30 hours. It's surprisingly consistent. It takes 3-4 minutes from the dmesg error to the UnhandledPromiseRejectionWarning.

Looks to me that this should be indeed a warning and continue working.

This is very similar to https://github.com/Koenkk/zigbee2mqtt/issues/3345 Which was fixed in 1.13.3 and I'm using 1.14.2 and it doesn't seem related at all with hardware. :shrug:

JoseAlcerreca commented 4 years ago

@Koenkk Happy to try dev builds if you have an idea of how to ignore the error.

My workaround at the moment is cron restarting the service every 8 hours but sometimes it only lasts 6 :)

Koenkk commented 4 years ago

@JoseAlcerreca while the error message might be the same, the cause of this is really different. In your case there goes something wrong with the driver/kernel so I don't see how we can fix that from Zigbee2MQTT. I recommend that you first try the suggestion from @omerk to rule out this is a hardware or zigbee2mqtt issue:

If you can try replicating this on another non-Pi computer that would allow us to rule this possibility out completely.

JoseAlcerreca commented 4 years ago

It looks like there's a hardware issue, I agree. I'm just saying that if restarting the Zigbee2MQTT service immediately fixes it, there's a potential fix in software that would make the daemon a bit more reliable.

For now, I disabled permit_join since the trace shows mgmtPermitJoinReq. No idea what I'm doing though.

omerk commented 4 years ago

@JoseAlcerreca For me to rule out if this is a zzh hardware issue and/or @Koenkk to propose either a solution or a recommendation, we really need you to try and replicate this on another computer. Without establishing the actual cause of the problem, making random changes to firmware/software is not going to help anyone and it can potentially cause problems for others.

I'm afraid until we have more data, we are not going to be able to resolve this issue properly.

JoseAlcerreca commented 4 years ago

I switched to another Pi without anything else connected and it's been working for a week now.

That said, I still think the problem is recoverable in software (with a very loud warning) since restarting zigbee2mqtt solved the issue temporarily.

Thanks, feel free to close the issue.

coldfire84 commented 4 years ago

Running my zzh! adapter on a Pi 3B (using official PSU, running Ubuntu), with a SSD attached via USB and have exactly the same issues (have been chasing my tail on this for a few days before I found this post).

Also running latest controller firmware (tried stable and dev releases), and up-to-date (at time of writing) zigbee2mqtt.

Same failed (SRSP - AF - dataRequest after 6000ms) errors, dmesg shows the same output as @JoseAlcerreca :

dmesg | grep ttyUSB
[   43.010395] usb 1-1.2: ch341-uart converter now attached to ttyUSB0
[ 5300.697255] ch341-uart ttyUSB0: usb_serial_generic_read_bulk_callback - urb stopped: -32
[19177.836177] ch341-uart ttyUSB0: usb_serial_generic_read_bulk_callback - urb stopped: -32
[75907.066063] ch341-uart ttyUSB0: usb_serial_generic_read_bulk_callback - urb stopped: -32
[80406.244388] ch341-uart ttyUSB0: usb_serial_generic_read_bulk_callback - urb stopped: -32

I'd noted the following from the documentation here:

Are you using any virtualisation/container pass-through for the USB device? There have been reports of these potentially causing problems so for debugging purposes try communicating with the adapter directly from the host OS it is connected to (i.e: see if it works without the bypass)

I can confirm that the issues occurs irrespective of running it bare-metal or via Docker with USB pass-thru. If this is a power related issue (going to buy a powered USB hub to troubleshoot) then its probably worth adding something to the docs flagging this as a potential issue.

omerk commented 4 years ago

@coldfire84 As a test, would you be able to temporarily unplug your SSD and see if the issue persists? Thanks!

coldfire84 commented 4 years ago

Hi @omerk , unfortunately it is the boot drive (and only drive attached to the system), used in-place of micro-sd card for reliability/ longevity.

I'll setup another Pi 3B and configure zigbee2mqtt running on an micro-sd card.

omerk commented 4 years ago

@coldfire84 Ah, removing the SSD could indeed be problematic :-) Another test would be to see if moving zzh to another USB port (if it is on the same set of 2 connectors as the SSD) and/or using a USB extension cable to see if that fixes things. The hypothesis there is interference that the SSD might be causing (instead of lack of power, although SSDs are known to generate spikes).

coldfire84 commented 4 years ago

@omerk , can confirm:

coldfire84 commented 3 years ago

Just an update on this - I moved the zzh! stick to a x86 PC running containerized zigbee2mqtt and, in 6 days, have had no further failed (SRSP - AF - dataRequest after 6000ms) issues.

At some point I'll test with a Pi that isn't using a USB SSD, but for now this fixes my issues and is running on a device that is on 24/7 anyway.

juslex commented 3 years ago

Running my zzh! adapter on a Pi 3B (using official PSU, running Ubuntu), with a SSD attached via USB and have exactly the same issues (have been chasing my tail on this for a few days before I found this post).

Also running latest controller firmware (tried stable and dev releases), and up-to-date (at time of writing) zigbee2mqtt.

Same failed (SRSP - AF - dataRequest after 6000ms) errors, dmesg shows the same output as @JoseAlcerreca :

dmesg | grep ttyUSB
[   43.010395] usb 1-1.2: ch341-uart converter now attached to ttyUSB0
[ 5300.697255] ch341-uart ttyUSB0: usb_serial_generic_read_bulk_callback - urb stopped: -32
[19177.836177] ch341-uart ttyUSB0: usb_serial_generic_read_bulk_callback - urb stopped: -32
[75907.066063] ch341-uart ttyUSB0: usb_serial_generic_read_bulk_callback - urb stopped: -32
[80406.244388] ch341-uart ttyUSB0: usb_serial_generic_read_bulk_callback - urb stopped: -32

I'd noted the following from the documentation here:

Are you using any virtualisation/container pass-through for the USB device? There have been reports of these potentially causing problems so for debugging purposes try communicating with the adapter directly from the host OS it is connected to (i.e: see if it works without the bypass)

I can confirm that the issues occurs irrespective of running it bare-metal or via Docker with USB pass-thru. If this is a power related issue (going to buy a powered USB hub to troubleshoot) then its probably worth adding something to the docs flagging this as a potential issue.

Same issue here.

chrisdaloa commented 3 years ago

Is it solved?

JoseAlcerreca commented 3 years ago

It looks like a power/usb issue. Someone needs to confirm if a powered hub is enough.

github-actions[bot] commented 3 years ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days

magnusnordlander commented 3 years ago

I've had this issue for a while (same errors, same device etc), and bought a powered hub today to see if that helps. I'll try to report back in a few weeks regardless, but if you haven't heard anything, that probably means that it's working and I've forgotten about this.

magnusnordlander commented 3 years ago

Well, that was fast. No, a powered hub didn't help in my case.

Some more info about my setup:

Adapter: ZZH Machine: HPE Microserver Gen 10 Zigbee2MQTT version: 1.16.2 (Running through Docker with the image koenkk/zigbee2mqtt:1.16.2) Firmware: CC26X2R1_20201026

I've also had the same issue with a TI LAUNCHXL-CC1352P-2 dev board.

Zigbee2MQTT logs are normal until things stop working. The logs start showing a mix of errors like the following (first line is the actual first errors, later are just a sample of errors):

Zigbee2MQTT:error 2020-12-29 17:01:33: Publish 'set' 'color_temp' to 'Kitchen spotlights 3' failed: 'Error: Command 0x00178801043134a2/11 lightingColorCtrl.moveToColorTemp({"colortemp":272,"transtime":100}, {"timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":false,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null}) failed (SRSP - AF - dataRequest after 6000ms)'
Zigbee2MQTT:error 2020-12-29 17:01:37: Publish 'set' 'color' to 'Bedroom ceiling lamp 2' failed: 'Error: Command 0x000d6ffffe30c2be/1 lightingColorCtrl.moveToColor({"transtime":100,"colorx":28901,"colory":24248}, {"timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":false,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null}) failed (Data request failed with error: 'Timeout' (9999))'
Zigbee2MQTT:error 2020-12-29 17:02:27: Publish 'set' 'brightness' to 'Bedroom floor lamp 2' failed: 'Error: Command 0xccccccfffe8e6cdf/1 genLevelCtrl.moveToLevelWithOnOff({"level":101,"transtime":0}, {"timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":false,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null}) failed (SRSP - ZDO - extRouteDisc after 6000ms)'

I do have the following errors in my dmesg, but they don't appear when the adapter stops working, but rather when I physically disconnect the device:

[416190.289537] ch341-uart ttyUSB0: usb_serial_generic_read_bulk_callback - urb stopped: -32
[416190.290013] ch341-uart ttyUSB0: usb_serial_generic_read_bulk_callback - urb stopped: -32
[416190.467784] usb 3-1.4.1: USB disconnect, device number 17
[416190.470244] usb 3-1.4.1: failed to send control message: -19
[416190.470651] ch341-uart ttyUSB0: ch341-uart converter now disconnected from ttyUSB0
[416190.470703] ch341 3-1.4.1:1.0: device disconnected
github-actions[bot] commented 3 years ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days

HydrelioxGitHub commented 2 years ago

I have exactly the same issue. Try a lot of things :

It seems zigbee2mqtt does not work on a pi3B with ssd attached.

HydrelioxGitHub commented 2 years ago

Maybe this issue should be kept open ?

m4tek commented 2 years ago

After over a year of stable work I'm getting the same issues just recently (kernel update?!) I cannot link it to any particular software change and there were no hardware change to my setup whatsoever.

I'm running RPI3 with branded high quality USB power supply providing up to 3A. USB ports are occupied with USB Drive and zzh module. There are no under-voltage errors reported either.

Apr 13 19:37:01 rpi3 kernel: [83794.325647] ch341-uart ttyUSB0: usb_serial_generic_read_bulk_callback - urb stopped: -32
Apr 13 19:37:01 rpi3 kernel: [83794.325767] ch341-uart ttyUSB0: usb_serial_generic_read_bulk_callback - urb stopped: -32
Apr 13 19:37:01 rpi3 kernel: [83794.366114] usb 1-1.5: USB disconnect, device number 6
Apr 13 19:37:01 rpi3 kernel: [83794.366676] usb 1-1.5: failed to send control message: -19
Apr 13 19:37:01 rpi3 kernel: [83794.367261] ch341-uart ttyUSB0: ch341-uart converter now disconnected from ttyUSB0
Apr 13 19:37:01 rpi3 kernel: [83794.661109] usb 1-1.5: new full-speed USB device number 7 using dwc_otg
Apr 13 19:37:01 rpi3 kernel: [83794.794312] usb 1-1.5: New USB device found, idVendor=1a86, idProduct=7523, bcdDevice= 2.64
Apr 13 19:37:01 rpi3 kernel: [83794.794350] usb 1-1.5: New USB device strings: Mfr=0, Product=2, SerialNumber=0
Apr 13 19:37:01 rpi3 kernel: [83794.794371] usb 1-1.5: Product: USB Serial
Apr 13 19:37:01 rpi3 kernel: [83794.795551] ch341 1-1.5:1.0: ch341-uart converter detected
Apr 13 19:37:01 rpi3 kernel: [83794.797958] usb 1-1.5: ch341-uart converter now attached to ttyUSB1
Apr 13 20:34:59 rpi3 kernel: [87272.639173] usb 1-1.5: USB disconnect, device number 7
Apr 13 20:34:59 rpi3 kernel: [87272.639778] ch341-uart ttyUSB1: ch341-uart converter now disconnected from ttyUSB1
Apr 13 20:35:02 rpi3 kernel: [87275.497705] usb 1-1.5: new full-speed USB device number 8 using dwc_otg
Apr 13 20:35:02 rpi3 kernel: [87275.650788] usb 1-1.5: New USB device found, idVendor=1a86, idProduct=7523, bcdDevice= 2.64
Apr 13 20:35:02 rpi3 kernel: [87275.650818] usb 1-1.5: New USB device strings: Mfr=0, Product=2, SerialNumber=0
Apr 13 20:35:02 rpi3 kernel: [87275.650828] usb 1-1.5: Product: USB Serial
Apr 13 20:35:02 rpi3 kernel: [87275.651676] ch341 1-1.5:1.0: ch341-uart converter detected
Apr 13 20:35:02 rpi3 kernel: [87275.653735] usb 1-1.5: ch341-uart converter now attached to ttyUSB0

This does not seam to be zigbee2mqtt issues frankly, but chiming in to show scale of the issue.

m4tek commented 2 years ago

To supplement post with a patch that I applied to get going.

I have my zigbee2mqtt started from systemd and it keeps on restarting when the USB dongle is lost. Problem I had is that (as shown above) device comes back as a different ttyUSBX so was not picked up. My config was statically pointing to /dev/ttyUSB0.

Following guide in https://www.domoticz.com/wiki/PersistentUSBDevices I checked my zzh details:

#  sudo lsusb -v | grep 'idVendor\|idProduct\|iProduct\|iSerial'
  idVendor           0x1a86 QinHeng Electronics
  idProduct          0x7523 CH340 serial converter
  iProduct                2 USB Serial
  iSerial                 0
(... snip ...)

and crafted udev rule:

$ cat  /etc/udev/rules.d/99-usb-serial.rules
ACTION=="add", SUBSYSTEM=="tty", ATTRS{idVendor}=="1a86", ATTRS{idProduct}=="7523", SYMLINK+="ttyUSB-zzh"

and updated zigbee2mqtt config:

$ cat /opt/zigbee2mqtt/data/configuration.yaml  | grep -A 2 serial
serial:
  port: /dev/ttyUSB-zzh
  adapter: auto

This is still an ugly patch, as my network of 57 device almost never re-converges fully after such an event and reports number of offline nodes a while.

m4tek commented 2 years ago

Oh this is actually a known issue (https://electrolama.com/radio-docs/troubleshooting/#why-does-my-usb-rf-stick-keep-disconnecting-and-connecting-again), however my setup did not change so hard to understand what is happening.