Koenkk / zigbee2mqtt

Zigbee 🐝 to MQTT bridge 🌉, get rid of your proprietary Zigbee bridges 🔨
https://www.zigbee2mqtt.io
GNU General Public License v3.0
11.77k stars 1.64k forks source link

90% of the Zigbee devices are offline #23571

Open mrespin opened 1 month ago

mrespin commented 1 month ago

What happened?

At some point, the Zigbee network becomes unstable, and devices do not respond. From 117 Zigbee devices, about 10% only responded.

The collection of error messages extracted from the log: failed with status '(0x02: INVALID_PARAM) zh:controller:device: Read response to failed zh:controller: Failed to keep permit join alive: Error: SRSP - AF - dataRequestExt after 6000ms zh:controller:device: Request Queue (): default expiration timeout set to 0 z2m: Error: SRSP - UTIL - getDeviceInfo after 6000ms zh:controller: Failed to keep permit join alive: Error: SRSP - ZDO - mgmtPermitJoinReq after 6000ms zh:controller:device: Handling of poll check-in from z2m: Publish 'set' 'state' to '0xa4c1383e0c9e6596' failed: 'Error: ZCL command 0xa4c1383e0c9e6596/1 genOnOff.on({}, {"timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":false,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (SRSP - AF - dataRequest after 6000ms)' zh:controller:device: Read response to 0x385b44fffe3c15b8 failed zh:controller: Failed to keep permit join alive: Error: SRSP - ZDO - mgmtPermitJoinReq after 6000ms zh:controller:device: Handling of poll check-in from 0x0cae5ffffeab4ace failed z2m: Failed to ping '0xa4c138dc2060e358' (attempt 1/2, ZCL command 0xa4c138dc2060e358/1 genBasic.read(["zclVersion"], {"timeout":10000,"disableResponse":false,"disableRecovery":true,"disableDefaultResponse":true,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (Timeout - 62511 - 1 - 113 - 0 - 1 after 10000ms)) z2m: Exception while calling fromZigbee converter: Cannot read properties of undefined (reading 'hasOwnProperty')} zh:zstack: Data confirm error (0xa4c1383ee7809253:39187,205,0) zh:controller:endpoint: Error: ZCL command 0xa4c1383ee7809253/1 genBasic.read(["zclVersion"], {"timeout":10000,"disableResponse":false,"disableRecovery":true,"disableDefaultResponse":true,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (Data request failed with error: 'No network route' (205)) Failed to configure '0xa4c138edf9f23915', attempt 1 (Error: ZCL command 0xa4c138edf9f23915/1 genBasic.read(["manufacturerName","zclVersion","appVersion","modelId","powerSource",65534], {"timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":true,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (SREQ '--> ZDO - extRouteDisc - {"dstAddr":20355,"options":0,"radius":30}' failed with status '(0xc7: NWK_TABLE_FULL)' (expected '(0x00: SUCCESS)')) zh:controller:device: Handling of poll check-in from 0x0cae5ffffeab4ace failed zh:controller: Failed to keep permit join alive: Error: Cannot request when znp has not been initialized yet z2m: Exception while calling fromZigbee converter: Cannot read properties of undefined (reading 'hasOwnProperty')} z2m: Publish 'set' 'state' to '0xa4c138489987e95a' failed: 'Error: ZCL command 0xa4c138489987e95a/1 genOnOff.on({}, {"timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":false,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (Data request failed with error: 'MAC no ack' (233))' zh:controller: Interview failed for '0xa4c13866d68d2656 with error 'Error: Interview failed because can not get active endpoints ('0xa4c13866d68d2656')' z2m: Failed to interview '0xa4c13866d68d2656', device has not successfully been paired

While testing, I ran HackRF One with Airspy to see RF transmission (and hear it as I turn on the sound in Airspy) and a Zigbee sniffer with Wireshark. It all looks perfect, with no or minimal background noises and no corruption in the Zigbee network. However, the Zigbee network is busy as hell, with 117 devices talking.

What did you expect to happen?

All ZigBee devices are to be responsive and have no errors.

How to reproduce it (minimal and precise)

It was like a Pandora's box; nothing could stop the issue as it happened. Without a difference, I tried replacing the SONOFF Zigbee 3.0 USB Dongle Plus-P with another one, firmware 20240716 and SLZB-06P7, and only 10% of the devices are still responding. I downgraded Z2M to 1.36.0 with some help, as I can reach 80% of the devices regardless of testing with one of the three coordinators.

By the way, I did not suspect that changing coordinators (back and forth) was so easy - thank you, Koenkk, for a great job.

Re-pairing an unresponsive device shows information in Z2M that the device left the network (what?) before pairing again. However, with so many devices hidden in the wall, re-pairing all was not an option.

Finally, Availability: I could not change Availability over Z2M GUI as it shows errors: "should be boolean should be object should match exactly one schema in oneOf Checks whether devices are online/offline etc" So, I removed Availability from the config file: Availability: active: timeout: 10 passive: timeout: 1500 After removing the Availability, all devices and the network work like charms on Z2M 1.39.1 without any error.

Zigbee2MQTT version

1.39.1

Adapter firmware version

20230507 or 20240716

Adapter

SONOFF Zigbee 3.0 USB Dongle Plus-P or SLZB-06P7

Setup

Runs on docker on x86-64 as a container.

Debug log

No response

Koenkk commented 1 month ago

Could you provide the debug log of this with the 20240716 fw?

See this on how to enable debug logging.

mrespin commented 1 month ago

I have separate logs for SLZB-06P7 with 20240716 and mixed logs (20230507 or 20240716) for Sonoff USB Dongle Plus-P due to testing two coordinators with different firmware versions SLZB-06P7_z2m_N02.zip SonOff.zip

mrespin commented 1 month ago

Hi Koen, I have a 55MB pcapng file from Wireshark 800K decoded Zigbee packets; let me know if you need them.

astan00 commented 1 month ago

I noticed same problem today after power failure. My setup:
z-stack v 20230507 z2m: 1.39.1

Network randomly puts devices offline. Repairing helps, but often these newly paired goes offline too. Most battery-powered are offline since power failure.

I use identical setup elsewhere, with no issues.

Debug logs attached. log.zip

Koenkk commented 1 month ago

@mrespin I need the debug logging from starting z2m until the errors. The sniff would also be nice to have

See this on how to enable debug logging.

@astan00

Network randomly puts devices offline. Repairing helps, but often these newly paired goes offline too. Most battery-powered are offline since power failure.

Battery powered (end) devices are pinged, they are marked offline if no message has been received from them in 25 hours. I guess for some reason they disconnected, I expect that they also don't generate any events when triggering them?

astan00 commented 1 month ago

This happened to routers too. Several light bulbs became offline.

I also noticed routers being marked online, but last seen date is hours ago. These are not reachable too. Repair helps, but also happened that few moments later the problem occurs again for just repaired routers.

For the battery powered sensor, yesterday I was not able to get an event from the occupancy sensor until the physical button was pushed once. Regular event which should send data from sensor did not work.

I really cannot see a pattern here.

mrespin commented 1 month ago

When I turned off Availability, all devices started to generate traffic in Z2M. The problem seems to have returned; the most recent "Last seen" is 12 hours. I have attached old logs for the SonOff coordinator, and the second is SLZB-06P7 (two instances of Z2M with the same data; only the coordinator is different). Z2m_SonOff.zip Z2M_SLZB-06P7.zip

mrespin commented 1 month ago

Wireshark data. Zigbee_N02.zip

astan00 commented 1 month ago

Yesterday I noticed I have messages' parsing errors in Appdaemon (controllerx app). These showed even when no controllers were used.

Debug of this is far above my skills, so I just tried to remove all retained messages from mqtt, assuming I have something malformed there (homeasistant and zigbee2mqtt topics)

And it seems fixed now. Still need to repair devices affected, but issues, including Appdaemon parsing errors, are gone.

Will look for further problems.

mrespin commented 1 month ago

I let it run (did not restart/reboot), and the minimal Last seen increased to 1 day. Eventually, Z2M started producing errors and kept restarting (by watchdog?) z2m: Error: Failed to connect to the adapter (Error: SRSP - SYS - ping after 6000ms) and restart (by watchdog?) I checked, and the SLZB-06P7 showed that it is connected with Z2M. Therefore, I rebooted the adapter, and the error changed to: z2m: Error: AREQ - ZDO - activeEpRsp after 10000ms, Then, I restarted the Z2M container, and after a few minutes, about 50% of the devices had been last seen for less than 1 day. Errors (Z2M continue to run) zh:zstack:znp: Error while parsing to ZpiObject 'Error: CommandID '159' from subsystem '5' not found Now logs include "info". Zigbee2MQTT version 1.39.1 commit: e132316 Coordinator type zStack3x0 Coordinator revision 20240716 Frontend version 0.7.4 zigbee-herdsman-converters version 19.72.0 zigbee-herdsman version 0.55.3

Event2.zip

mrespin commented 1 month ago

I let it run again and have new symptoms: Most recent last seen 1 day. When I action anything (directly from "exposes" in Z2M), I have an error immediately: error 2024-08-14 20:43:31z2m: Publish 'set' 'power_on_behavior' to '0xa4c13899e4e73e35' failed: 'Error: ZCL command 0xa4c13899e4e73e35/1 manuSpecificTuya.dataRequest({"seq":1,"dpValues":[{"dp":14,"datatype":2,"data":[0,0,0,2]}]}, {"timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":true,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (SREQ '--> AF - dataRequest - {"dstaddr":35344,"destendpoint":1,"srcendpoint":1,"clusterid":61184,"transid":79,"options":0,"radius":30,"len":13,"data":{"type":"Buffer","data":[17,210,0,1,0,14,2,0,4,0,0,0,2]}}' failed with status '(0x11: BUFFER_FULL)' (expected '(0x00: SUCCESS)'))' error 2024-08-14 20:43:32z2m: Publish 'set' 'power_on_behavior' to '0xa4c13899e4e73e35' failed: 'Error: ZCL command 0xa4c13899e4e73e35/1 manuSpecificTuya.dataRequest({"seq":1,"dpValues":[{"dp":14,"datatype":2,"data":[0,0,0,0]}]}, {"timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":true,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (SREQ '--> AF - dataRequest - {"dstaddr":35344,"destendpoint":1,"srcendpoint":1,"clusterid":61184,"transid":80,"options":0,"radius":30,"len":13,"data":{"type":"Buffer","data":[17,211,0,1,0,14,2,0,4,0,0,0,0]}}' failed with status '(0x11: BUFFER_FULL)' (expected '(0x00: SUCCESS)'))'

I checked, and the SLZB-06P7 showed it is connected with Z2M. Therefore, I rebooted the adapter, and the Z2M stopped. I re-checked, and the SLZB-06P7 showed it is NOT connected with Z2M. When the Z2M watchdog kicked in, Z2M restarted, and the devices started to update their Last seen and SLZB-06P7 showing it is connected with Z2M. Logs attached Event3.zip

mrespin commented 1 month ago

Since restarting 30 minutes ago, there are only a few errors - extracted from the Log file: failed (Data request failed with error: 'No network route' (205))' - 4 times, network healing? failed (Timeout - 64685 - 1 - 8 - 6 - 11 after 10000ms)' - once failed with status '(0xc7: NWK_TABLE_FULL)' (expected '(0x00: SUCCESS)'))' - once failed (Data request failed with error: 'MAC channel access failure' (225))' - once Device Last seen (for less than a day) is about 50%, and it is growing up. The Last seen 1 day devices: after sending anything from exposes, the device Last seen becomes right now. All seems to be fine now.

mrespin commented 4 weeks ago

Hi, Is there anything else I can do to help fix the problem? Thank you.

Koenkk commented 3 weeks ago

@mrespin looking at your logs, it seems there are some very spammy devices. This could lead to a DDOS on the network, especially 0xbc026efffea7f1e4 is spamming a lot, try removing it from the network.

mrespin commented 3 weeks ago

Thank you Koen, Your input helps significantly by showing me how to sort the network problems correctly. Moreover, It shows that devices are guilty, not your software, which works perfectly - thank you again. Where to look for suggestions, like you did, in Z2M logs or Wireshark logs? I am a bit confused as this device (0xbc026efffea7f1e4) shows as Last seen one week ago. If I look at MQTT Explorer for spammy devices, will I see suspects (devices that send too often)? How often can a device send data and be treated as a DDOS attack (how many messages per minute will be a safe distinction between a bit-spammy device and possible DDOS)? If I force delete the offline device and block it from pairing again, will it unpair itself when it comes online, or do I need to wait for the device to become online to unpair it (and block it from pairing again to remove the device from the network)? I am grateful for all your help.

Koenkk commented 3 weeks ago

If you look at log2024-08-05.14-40-391.log and search for 0xbc026efffea7f1e4, you will get a lot of results:

info  2024-08-06 06:52:45: MQTT publish: topic 'z2m/0xbc026efffea7f1e4', payload '{"ac_frequency":50,"current":0,"device":{"applicationVersion":68,"dateCode":"","friendlyName":"0xbc026efffea7f1e4","hardwareVersion":1,"ieeeAddr":"0xbc026efffea7f1e4","manufacturerID":4098,"manufacturerName":"_TZE200_qhlxve78","model":"SPM01V2","networkAddress":43853,"powerSource":"Mains (single phase)","stackVersion":0,"type":"Router","zclVersion":3},"energy":71.11,"last_seen":"2024-08-06T05:52:45.557Z","linkquality":91,"power":0,"power_factor":100,"produced_energy":0,"voltage":225.6}'
info  2024-08-06 06:52:45: MQTT publish: topic 'z2m/0xbc026efffea7f1e4', payload '{"ac_frequency":50,"current":0,"device":{"applicationVersion":68,"dateCode":"","friendlyName":"0xbc026efffea7f1e4","hardwareVersion":1,"ieeeAddr":"0xbc026efffea7f1e4","manufacturerID":4098,"manufacturerName":"_TZE200_qhlxve78","model":"SPM01V2","networkAddress":43853,"powerSource":"Mains (single phase)","stackVersion":0,"type":"Router","zclVersion":3},"energy":71.11,"last_seen":"2024-08-06T05:52:45.598Z","linkquality":91,"power":0,"power_factor":100,"produced_energy":0,"voltage":225.6}'
info  2024-08-06 06:52:45: MQTT publish: topic 'z2m/0xbc026efffea7f1e4', payload '{"ac_frequency":50,"current":0,"device":{"applicationVersion":68,"dateCode":"","friendlyName":"0xbc026efffea7f1e4","hardwareVersion":1,"ieeeAddr":"0xbc026efffea7f1e4","manufacturerID":4098,"manufacturerName":"_TZE200_qhlxve78","model":"SPM01V2","networkAddress":43853,"powerSource":"Mains (single phase)","stackVersion":0,"type":"Router","zclVersion":3},"energy":71.11,"last_seen":"2024-08-06T05:52:45.642Z","linkquality":91,"power":0,"power_factor":100,"produced_energy":0,"voltage":225.6}'
info  2024-08-06 06:52:45: MQTT publish: topic 'z2m/0xa4c1384453a39a76', payload '{"countdown":0,"device":{"applicationVersion":72,"dateCode":"","friendlyName":"0xa4c1384453a39a76","hardwareVersion":1,"ieeeAddr":"0xa4c1384453a39a76","manufacturerID":4417,"manufacturerName":"_TZ3000_tqlv4ug4","model":"TS0001_switch_module","networkAddress":1097,"powerSource":"Mains (single phase)","stackVersion":0,"type":"Router","zclVersion":3},"last_seen":"2024-08-06T05:52:45.665Z","linkquality":91,"power_on_behavior":"previous","state":"ON","switch_type":"toggle"}'
info  2024-08-06 06:52:45: MQTT publish: topic 'z2m/0xbc026efffea7f1e4', payload '{"ac_frequency":50,"current":0,"device":{"applicationVersion":68,"dateCode":"","friendlyName":"0xbc026efffea7f1e4","hardwareVersion":1,"ieeeAddr":"0xbc026efffea7f1e4","manufacturerID":4098,"manufacturerName":"_TZE200_qhlxve78","model":"SPM01V2","networkAddress":43853,"powerSource":"Mains (single phase)","stackVersion":0,"type":"Router","zclVersion":3},"energy":71.11,"last_seen":"2024-08-06T05:52:45.687Z","linkquality":91,"power":0,"power_factor":100,"produced_energy":0,"voltage":225.6}'
info  2024-08-06 06:52:45: MQTT publish: topic 'z2m/0xbc026efffea7f1e4', payload '{"ac_frequency":49.98,"current":0,"device":{"applicationVersion":68,"dateCode":"","friendlyName":"0xbc026efffea7f1e4","hardwareVersion":1,"ieeeAddr":"0xbc026efffea7f1e4","manufacturerID":4098,"manufacturerName":"_TZE200_qhlxve78","model":"SPM01V2","networkAddress":43853,"powerSource":"Mains (single phase)","stackVersion":0,"type":"Router","zclVersion":3},"energy":71.11,"last_seen":"2024-08-06T05:52:45.733Z","linkquality":91,"power":0,"power_factor":100,"produced_energy":0,"voltage":225.6}'
info  2024-08-06 06:52:45: MQTT publish: topic 'z2m/0xbc026efffea7f1e4', payload '{"ac_frequency":49.98,"current":0,"device":{"applicationVersion":68,"dateCode":"","friendlyName":"0xbc026efffea7f1e4","hardwareVersion":1,"ieeeAddr":"0xbc026efffea7f1e4","manufacturerID":4098,"manufacturerName":"_TZE200_qhlxve78","model":"SPM01V2","networkAddress":43853,"powerSource":"Mains (single phase)","stackVersion":0,"type":"Router","zclVersion":3},"energy":71.11,"last_seen":"2024-08-06T05:52:45.774Z","linkquality":91,"power":0,"power_factor":100,"produced_energy":0,"voltage":225.8}'
info  2024-08-06 06:52:45: MQTT publish: topic 'z2m/0xbc026efffea7f1e4', payload '{"ac_frequency":49.98,"current":0,"device":{"applicationVersion":68,"dateCode":"","friendlyName":"0xbc026efffea7f1e4","hardwareVersion":1,"ieeeAddr":"0xbc026efffea7f1e4","manufacturerID":4098,"manufacturerName":"_TZE200_qhlxve78","model":"SPM01V2","networkAddress":43853,"powerSource":"Mains (single phase)","stackVersion":0,"type":"Router","zclVersion":3},"energy":71.11,"last_seen":"2024-08-06T05:52:45.821Z","linkquality":91,"power":0,"power_factor":100,"produced_energy":0,"voltage":225.8}'
info  2024-08-06 06:52:45: MQTT publish: topic 'z2m/0xbc026efffea7f1e4', payload '{"ac_frequency":49.98,"current":0,"device":{"applicationVersion":68,"dateCode":"","friendlyName":"0xbc026efffea7f1e4","hardwareVersion":1,"ieeeAddr":"0xbc026efffea7f1e4","manufacturerID":4098,"manufacturerName":"_TZE200_qhlxve78","model":"SPM01V2","networkAddress":43853,"powerSource":"Mains (single phase)","stackVersion":0,"type":"Router","zclVersion":3},"energy":71.11,"last_seen":"2024-08-06T05:52:45.866Z","linkquality":94,"power":0,"power_factor":100,"produced_energy":0,"voltage":225.8}'
info  2024-08-06 06:52:46: MQTT publish: topic 'z2m/0xa4c1385120db76f1', payload '{"countdown":0,"device":{"applicationVersion":72,"dateCode":"","friendlyName":"0xa4c1385120db76f1","hardwareVersion":1,"ieeeAddr":"0xa4c1385120db76f1","manufacturerID":4417,"manufacturerName":"_TZ3000_tqlv4ug4","model":"TS0001_switch_module","networkAddress":14233,"powerSource":"Mains (single phase)","stackVersion":0,"type":"Router","zclVersion":3},"last_seen":"2024-08-06T05:52:46.214Z","linkquality":91,"power_on_behavior":"on","state":"OFF","switch_type":"toggle"}'
warn  2024-08-06 06:52:47: Failed to ping '0xa4c138298f1a755e' (attempt 1/1, Read 0xa4c138298f1a755e/1 genBasic(["zclVersion"], {"timeout":10000,"disableResponse":false,"disableRecovery":true,"disableDefaultResponse":true,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (Data request failed with error: 'No network route' (205)))
info  2024-08-06 06:52:48: MQTT publish: topic 'z2m/0x385b44fffed7ede6', payload '{"current":3.78,"device":{"applicationVersion":69,"dateCode":"","friendlyName":"0x385b44fffed7ede6","hardwareVersion":1,"ieeeAddr":"0x385b44fffed7ede6","manufacturerID":4098,"manufacturerName":"_TZ3000_cayepv1a","model":"TO-Q-SY2-163JZT","networkAddress":3470,"powerSource":"Mains (single phase)","stackVersion":0,"type":"Router","zclVersion":3},"energy":13.55,"indicator_mode":"off/on","last_seen":"2024-08-06T05:52:48.024Z","linkquality":91,"over_current_breaker":"ON","over_current_threshold":65,"over_voltage_breaker":"ON","over_voltage_threshold":265,"power":840,"power_breaker":"ON","power_outage_memory":"restore","power_threshold":13,"state":"ON","temperature":33.03,"temperature_breaker":"ON","temperature_threshold":100,"under_voltage_breaker":"ON","under_voltage_threshold":75,"update":{"installed_version":-1,"latest_version":-1,"state":null},"update_available":null,"voltage":226}'
info  2024-08-06 06:52:48: MQTT publish: topic 'z2m/0x385b44fffed7ede6', payload '{"current":3.78,"device":{"applicationVersion":69,"dateCode":"","friendlyName":"0x385b44fffed7ede6","hardwareVersion":1,"ieeeAddr":"0x385b44fffed7ede6","manufacturerID":4098,"manufacturerName":"_TZ3000_cayepv1a","model":"TO-Q-SY2-163JZT","networkAddress":3470,"powerSource":"Mains (single phase)","stackVersion":0,"type":"Router","zclVersion":3},"energy":13.55,"indicator_mode":"off/on","last_seen":"2024-08-06T05:52:48.029Z","linkquality":83,"over_current_breaker":"ON","over_current_threshold":65,"over_voltage_breaker":"ON","over_voltage_threshold":265,"power":840,"power_breaker":"ON","power_outage_memory":"restore","power_threshold":13,"state":"ON","temperature":33.03,"temperature_breaker":"ON","temperature_threshold":100,"under_voltage_breaker":"ON","under_voltage_threshold":75,"update":{"installed_version":-1,"latest_version":-1,"state":null},"update_available":null,"voltage":226}'
info  2024-08-06 06:52:50: MQTT publish: topic 'z2m/0xa4c13899e4e73e35', payload '{"backlight_mode":"normal","brightness":254,"countdown":0,"device":{"applicationVersion":64,"dateCode":"","friendlyName":"0xa4c13899e4e73e35","hardwareVersion":1,"ieeeAddr":"0xa4c13899e4e73e35","manufacturerID":4417,"manufacturerName":"_TZE200_3p5ydos3","model":"TS0601_dimmer","networkAddress":35344,"powerSource":"Mains (single phase)","stackVersion":0,"type":"Router","zclVersion":3},"last_seen":"2024-08-06T05:52:50.384Z","light_type":"led","linkquality":91,"max_brightness":254,"min_brightness":20,"power_on_behavior":"off","state":"ON"}'
info  2024-08-06 06:52:50: MQTT publish: topic 'z2m/0xbc026efffea7f1e4', payload '{"ac_frequency":49.98,"current":0,"device":{"applicationVersion":68,"dateCode":"","friendlyName":"0xbc026efffea7f1e4","hardwareVersion":1,"ieeeAddr":"0xbc026efffea7f1e4","manufacturerID":4098,"manufacturerName":"_TZE200_qhlxve78","model":"SPM01V2","networkAddress":43853,"powerSource":"Mains (single phase)","stackVersion":0,"type":"Router","zclVersion":3},"energy":71.11,"last_seen":"2024-08-06T05:52:50.630Z","linkquality":87,"power":0,"power_factor":100,"produced_energy":0,"voltage":225.8}'
info  2024-08-06 06:52:50: MQTT publish: topic 'z2m/0xbc026efffea7f1e4', payload '{"ac_frequency":49.98,"current":0,"device":{"applicationVersion":68,"dateCode":"","friendlyName":"0xbc026efffea7f1e4","hardwareVersion":1,"ieeeAddr":"0xbc026efffea7f1e4","manufacturerID":4098,"manufacturerName":"_TZE200_qhlxve78","model":"SPM01V2","networkAddress":43853,"powerSource":"Mains (single phase)","stackVersion":0,"type":"Router","zclVersion":3},"energy":71.11,"last_seen":"2024-08-06T05:52:50.670Z","linkquality":91,"power":0,"power_factor":100,"produced_energy":0,"voltage":225.8}'
info  2024-08-06 06:52:50: MQTT publish: topic 'z2m/0xbc026efffea7f1e4', payload '{"ac_frequency":49.98,"current":0,"device":{"applicationVersion":68,"dateCode":"","friendlyName":"0xbc026efffea7f1e4","hardwareVersion":1,"ieeeAddr":"0xbc026efffea7f1e4","manufacturerID":4098,"manufacturerName":"_TZE200_qhlxve78","model":"SPM01V2","networkAddress":43853,"powerSource":"Mains (single phase)","stackVersion":0,"type":"Router","zclVersion":3},"energy":71.11,"last_seen":"2024-08-06T05:52:50.711Z","linkquality":91,"power":0,"power_factor":100,"produced_energy":0,"voltage":225.8}'
info  2024-08-06 06:52:50: MQTT publish: topic 'z2m/0xbc026efffea7f1e4', payload '{"ac_frequency":49.98,"current":0,"device":{"applicationVersion":68,"dateCode":"","friendlyName":"0xbc026efffea7f1e4","hardwareVersion":1,"ieeeAddr":"0xbc026efffea7f1e4","manufacturerID":4098,"manufacturerName":"_TZE200_qhlxve78","model":"SPM01V2","networkAddress":43853,"powerSource":"Mains (single phase)","stackVersion":0,"type":"Router","zclVersion":3},"energy":71.11,"last_seen":"2024-08-06T05:52:50.752Z","linkquality":91,"power":0,"power_factor":100,"produced_energy":0,"voltage":225.8}'
info  2024-08-06 06:52:50: MQTT publish: topic 'z2m/0xbc026efffea7f1e4', payload '{"ac_frequency":49.98,"current":0,"device":{"applicationVersion":68,"dateCode":"","friendlyName":"0xbc026efffea7f1e4","hardwareVersion":1,"ieeeAddr":"0xbc026efffea7f1e4","manufacturerID":4098,"manufacturerName":"_TZE200_qhlxve78","model":"SPM01V2","networkAddress":43853,"powerSource":"Mains (single phase)","stackVersion":0,"type":"Router","zclVersion":3},"energy":71.11,"last_seen":"2024-08-06T05:52:50.798Z","linkquality":91,"power":0,"power_factor":100,"produced_energy":0,"voltage":225.8}'
info  2024-08-06 06:52:50: MQTT publish: topic 'z2m/0xbc026efffea7f1e4', payload '{"ac_frequency":49.98,"current":0,"device":{"applicationVersion":68,"dateCode":"","friendlyName":"0xbc026efffea7f1e4","hardwareVersion":1,"ieeeAddr":"0xbc026efffea7f1e4","manufacturerID":4098,"manufacturerName":"_TZE200_qhlxve78","model":"SPM01V2","networkAddress":43853,"powerSource":"Mains (single phase)","stackVersion":0,"type":"Router","zclVersion":3},"energy":71.11,"last_seen":"2024-08-06T05:52:50.840Z","linkquality":91,"power":0,"power_factor":100,"produced_energy":0,"voltage":225.7}'
info  2024-08-06 06:52:50: MQTT publish: topic 'z2m/0xbc026efffea7f1e4', payload '{"ac_frequency":49.98,"current":0,"device":{"applicationVersion":68,"dateCode":"","friendlyName":"0xbc026efffea7f1e4","hardwareVersion":1,"ieeeAddr":"0xbc026efffea7f1e4","manufacturerID":4098,"manufacturerName":"_TZE200_qhlxve78","model":"SPM01V2","networkAddress":43853,"powerSource":"Mains (single phase)","stackVersion":0,"type":"Router","zclVersion":3},"energy":71.11,"last_seen":"2024-08-06T05:52:50.882Z","linkquality":91,"power":0,"power_factor":100,"produced_energy":0,"voltage":225.7}'
info  2024-08-06 06:52:50: MQTT publish: topic 'z2m/0xbc026efffea7f1e4', payload '{"ac_frequency":49.98,"current":0,"device":{"applicationVersion":68,"dateCode":"","friendlyName":"0xbc026efffea7f1e4","hardwareVersion":1,"ieeeAddr":"0xbc026efffea7f1e4","manufacturerID":4098,"manufacturerName":"_TZE200_qhlxve78","model":"SPM01V2","networkAddress":43853,"powerSource":"Mains (single phase)","stackVersion":0,"type":"Router","zclVersion":3},"energy":71.11,"last_seen":"2024-08-06T05:52:50.924Z","linkquality":91,"power":0,"power_factor":100,"produced_energy":0,"voltage":225.7}'

I would remove it from power and see if your network stabilises after that.

mrespin commented 1 week ago

Thank you, Koen, I unpaired three suspected devices and blocked them from joining again. PS: When the problem occurs, many devices spam the network. This seems to be an effect, not the cause. I re-paired most devices (7 did not pair after un-pair). I am still monitoring the issue (I have a Zigbee sniffer connected to a Wireshark running on a VM to keep a log). The network is less busy, but the problem still comes back. I admit there are many Tuya devices (77 now), but the three noisy ones are unpaired from the network. As mentioned, I removed availability in Z2M and see only the "last seen". Removing this future helped recover the network (Z2M restart works), but it is not the solution to my problem. I noticed today (5/9/24) at 8:26+ that the Zigbee network stopped. I changed the coordinator power from 10dBm to 5dBm and restarted both Z2M and the controller—restarting helped. I have attached the log. Can you please help me find out what is wrong? log2024-09-02.12-14-59.log For a test, I changed Last seen to disable (default).

Koenkk commented 1 week ago
mrespin commented 1 week ago

I only have two more files without a start, as I had a rolling log (rolling now disabled). I do not have any external converters. I noticed that I had one device whose Wireshark Zigbee messages speed increased from 10 to 100's per second when connected to power. When the device was disconnected, the rate did not change. To fix this, I need to restart the coordinator and Z2M. After replacing the device (with a brand new one), the network problem returned within minutes. The starting log from the previous restart (the same network) is in the following message.

mrespin commented 1 week ago

The starting log from the previous restart (the same network): log2024-09-02.12-08-52.log

Koenkk commented 1 week ago

Could you see if the crash still happens with the following:

docker exec -it CONTAINER_NAME sh
apk add nano
nano /app/node_modules/zigbee-herdsman/dist/controller/model/device.js
# Change `const isDefaultResponse = frame.header.isGlobal && frame.command.name === 'defaultRsp';` to `const isDefaultResponse = false;`
# save and exit nano
exit
docker restart CONTAINER_NAME
mrespin commented 1 week ago

Thank you Koen, It stopped after 17 minutes after the change was implemented, and docker restarted. Log: log.log

My idea is to split the network, using two instances of the Zigbee2MQTT with two coordinators (on different channels). Am I on the right track?

mrespin commented 1 week ago

I reran it with a debug log. Worked for 1 h 14 min. I have Wireshark as well (without Zigbee decryption). log2.zip

mrespin commented 1 week ago

log2B log2 including Coordinator restart. log3 new log - stopped working after 4 minutes log2B.zip log3.log

Koenkk commented 1 week ago

ooks perfect, with no or minimal background noises and no corruption in the Zigbee network. However, the Zigbee network is busy as hell, with 117 devices talking.

mrespin commented 1 week ago

Thank you Koen,

  1. Logs are always intact; however, sometimes I click something to wake up the device. PS When just start Z2M, most of the devices: Last seen is in days, when click anything that usually changes Last seen to now.
  2. I use SLZB-06P7, which has more RAM.
  3. I use PoE SLZB-06P7 over LAN. I just connected the second coordinator and splitting the network.
  4. I quite often recreate container (to make sure I use the newest Z2M) - easy in Portainer. Do I need to keep your change to the device.js?
mrespin commented 1 week ago

I experienced the issue with both instances after splitting the network. Log4N1 log with the original network Zibee channel 25 Log4N2 log with the second network Zigbee channel 21 log4N1.log log4N2.log

mrespin commented 1 week ago

I redeployed both containers with the latest stable version (before it was Dev) and reapplied your code, and so far it is working ok (50 minutes).

mrespin commented 1 week ago

The problem is still there. Another log Network 1 log5N1.log

mrespin commented 1 week ago

Interestingly, in this Log, the error "failed with status '(0x11: BUFFER_FULL)' (expected '(0x00: SUCCESS)'))" was repeated 155 times. log6N1.log

Koenkk commented 6 days ago

Do you by any chance have another coordinator laying around? (e.g. an USB one). If yes, can you check if that coordinator gives the same behaviour?

mrespin commented 6 days ago

Strangely, as of yesterday, the network is running (it's been 15 hours now). I just wanted to let you know that your fix is still in place. However, a few devices are still not responding. Currently, there are 86 devices in Network 1; 24 devices were last seen in less than a minute, 43 in less than an hour, and 15 in less than a day. When you read the start, I tried swapping three different coordinators (two USBs and one PoE), and they did not make any changes. When it happens again, I will swap between two connected PoE coordinators.

Floyd-Schramm commented 6 days ago

Same problem here. I have Zigbee2MQTT installed in a privileged LXC on a ProxMox host. My SONOFF Zigbee 3.0 USB Dongle Plus (CC2652P + CP2102N) is connected via an extension cord and placed about 1,5m away from the ProxMox host.

Due to the NWK_TABLE_FULL errors on dongle version 20230507 i already updated to a prerelease version of the 20240710 version on the 8th of August. Since then, everything was working perfectly without any issues until a few days ago. Now i have to replug the dongle multiple times a day and restart Zigbee2MQTT to get everything up and running again. After a few hours all devices stop responding and go offline. When trying to do anything i get the error 0x11 BUFFER_FULL

Today i flashed the dongle with the official release of 20240710 -> BUFFER_FULL I also installed version 20221102 like mentioned from @Koenkk in another thread -> No specific error message but devices timeout and are not responding

Here is the log including Zigbee2MQTT start with prerelease version of 20240710, errors with BUFFER_FULL and restarting Zigbee2MQTT. log.log

Floyd-Schramm commented 6 days ago

Like suggested i also flashed another identical Sonoff Zigbee Stick with the same firmware. And again, problem is not resolved like you are able to see in the attached log: log.log

Floyd-Schramm commented 6 days ago

Update: another crash with original Zigbee sick and latest official firmware after almost exactly one hour (start at 15:40 and crash at ~16:36): image

The log files are split into log1-x from folder '2024-09-07.15-39-01': log1-1.log log1-2.log

and log2-x from folder '2024-09-07.15-46-30': log2-1.log log2-2.log log2-3.log

Would you be able to provide a firmware version with a larger buffer for testing @Koenkk ?

Koenkk commented 5 days ago

@Floyd-Schramm your network has a lot of spamming devices, can you first disable the default responses? See https://github.com/Koenkk/zigbee2mqtt/issues/23571#issuecomment-2332495941

Floyd-Schramm commented 5 days ago

@Koenkk I've installed Zigbee2MQTT via a ProxMox helper Script (https://tteck.github.io/Proxmox/#zigbee2mqtt-lxc) and it's not running via Docker. Like mentioned in the comment, I changed the one line in devices.js under /opt/zigbee2mqtt/node_modules/zigbee-herdsman/dist/controller/model/ and restarted Zigbee2MQTT. How can I check if the change was implemented successfully? Here is a current log: log.log

The Zigbee network is now also very slow, and the few devices that are online respond only sometimes, and sometimes report timeouts. With the beta firmware, it was running for weeks and weeks without any issues, new devices or any changes at all, and now suddenly everything falls apart... EDIT: The UI also became very slow and mostly does not respond to mouse clicks for navigating to a device page

mrespin commented 5 days ago

Till now, it works fairly well. Few 0x11: BUFFER_FULL. I changed the baud rate to 460800, and after a day, SLZB-06P7 stopped working. Network 1 returned to the USB SONOFF Plus Dongle (fw:20230507).

mrespin commented 4 days ago

For Network 2, I changed the baud rate to 230400 for the second PoE controller, which stopped working the next morning. Therefore, I changed the baud rate back to 115200. For Network 1, I loaded another USB SONOFF Dongle Plus (fw: 20221226). I reapply @Koenkk patch "disable the default responses". Keep monitoring. Koen, I have over 2GB of Z2M logs and some Wireshark matching data (while the problem started). Would you like them? I got the impression that the Zigbee network increases traffic considerably when the problem happens (as I said before, spammy devices seem to be an effect, not the cause). I rechecked all spammy devices elsewhere, and they are working correctly.

Floyd-Schramm commented 4 days ago

And again... log.log

mrespin commented 3 days ago

The SONOFF USB Dongle Plus (fw: 20221226) works fine. Is there firmware version 20221226 for P7, as Network 2 still freezes (SLZB-06P7)?

Koenkk commented 1 day ago

@Floyd-Schramm can you try the 20221226 firmware?

@mrespin do you get BUFFER_FULL errors or SRSP - AF - dataRequest after 6000ms errors?

Floyd-Schramm commented 1 day ago

@Koenkk Interesting update. I've reinstalled Zigbee in Docker LXC, applied your fix and installed FW 20221226. As I can remember, this did not really fix the problem as the network was very slow and after a few minutes unresponsive. I however noticed that my Canon Wifi printer was in a sort of shut down loop, which I fixed by unplugging the printer completly. This may be absolutly unrelated but after this, the network worked and it now seems to be running sort of smooth. A few devices keep dropping off and a few others where unable to recover from the downtime, so a device reset was necessary. Some of my smart plugs also started spamming messages every second further slowing down the network, which could be fixed by unplugging and plugging them back in. Currently I still see few errors, sometimes even when the initial action did go trough and the light for example did turn on: 'No network route' (205) 'MAC no ack' (233) Timeout I am still trying to find more spammy devices and get all other devices up and running. If you are interested I can provide further logs and zigbee sniffs if you are interested.

mrespin commented 17 hours ago

Network 1 (USB 20221226, 92 devices) works without touching. Since the last restart (increased power from 5dBm to 10), running for 48h, errors:

Since using (USB 20221226), I had no BUFFER_FULL errors or SRSP-AF-dataRequest after 6000ms errors.

FYI, In Network 2 (SLZB-06P7, 20240716, 28 devices), errors in last 48h (increased power from 5dBm to 10): SRSP - AF - dataRequest after 6000ms repeated 292 times, No error BUFFER_FULL. PS @Koenkk patch applied to both Networks.