Koenkk / zigbee2mqtt

Zigbee 🐝 to MQTT bridge 🌉, get rid of your proprietary Zigbee bridges 🔨
https://www.zigbee2mqtt.io
GNU General Public License v3.0
11.98k stars 1.67k forks source link

Some devices randomly become unresponsive - consistent across installations - HA and z2m up to date - Sonoff stick - VM #15944

Closed freaked1234 closed 1 year ago

freaked1234 commented 1 year ago

What happened?

Several different devices in my network become unresponsive at random times. Switches and lights turn on/off delayed or not at all but they still show up as connected with good reception. The z2m log is pretty quiet most of the time.

Previously I had HA running on a RasPi 4 using a conbee II and I noticed the same behavior that I am facing on my current setup. Current setup: HA + z2m latest version VM with enough resources Sonoff USB CC2652P flashed to latest FW, tried with and without extension on USB 2.0 port Very strong mesh, 41 devices in total, around 15 routers

z2m config: homeassistant: true mqtt: server: mqtt://core-mosquitto:1883 user: xxx password: xxx serial: port: /dev/ttyUSB0 frontend: port: 8099 advanced: homeassistant_legacy_entity_attributes: false legacy_api: false legacy_availability_payload: false device_options: legacy: false devices: devices.yaml groups: groups.yaml

What did you expect to happen?

Ofc devices to react when asked to.

How to reproduce it (minimal and precise)

Start HA, sometimes happens immedietly, sometimes takes hours, reboots dont seem to do much

Zigbee2MQTT version

1.29.0-1

Adapter firmware version

CC1352P2_CC2652P_launchpad_coordinator_20220219

Adapter

SONOFF Zigbee 3.0 USB Dongle Plus ZigBee 3.0 TI CC2652P + CP2102N Coordinator CC2652

Debug log

log(1).txt

sdalu commented 1 year ago

It seems I have the same problem (Sonoff USB CC2652P). It was working fine on 1.28.4, problems occurred once upgraded to 1.29.0

Koenkk commented 1 year ago

Can you provide the herdsman debug log of this?

See https://www.zigbee2mqtt.io/guide/usage/debug.html on how to enable the herdsman debug logging. Note that this is only logged to STDOUT and not to log files.

freaked1234 commented 1 year ago

Can you provide the herdsman debug log of this?

See https://www.zigbee2mqtt.io/guide/usage/debug.html on how to enable the herdsman debug logging. Note that this is only logged to STDOUT and not to log files.

Hey, ty for helping. I did herdlog.txt not get how to download the full log in my usecase (HA OS) so I saved a few copy-pastes to a file hoping this helps. Looks like there is a lot going on, mostly caused by the climate sensors.

edit: what I find most strange is that some devices of the exact same model always seem to work flawless while others always act up.

example 1: I use several mains powered switches to control dumb lights and ceiling fans. While the kitchen llights always work one of the ceiling fans is among the devices that always act up even tho it is the same switch and they are pretty close.

example 2: i have zigbee 3.0 led controllers in all rooms - same model, same FW. While most of them act up when the system enters "unresponsive state" it is pretty random which of them stops working first.

the kitchen lights (mains dumb switch) are my goto check, to see if the whole network is down because this switch seems to always work...

trackhacs commented 1 year ago

i think i have the same - just that the whole network gets lost. happens since i upgraded from 1.24 to 1.28. reaction times gets slow and slower - still nothing happens at all anymore.

used docker compose to test 1.29 -> same result, except that web interface is still reachable (with 1.28 also the web interface would get lost).

tried to activate herdlog, but the console just freeze - looks like my little µSSD doesnt cope with that load.

trackhacs commented 1 year ago

its starts to hang on:

zigbee2mqtt  | Zigbee2MQTT:error 2023-01-05 09:57:09: Publish 'set' 'state' to 'powerDesk' failed: 'Error: Command 0x50325ffffe531b17/2 genOnOff.on({}, {"sendWhen":"immediate","timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":false,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (sendZclFrameToEndpointInternal error)'
zigbee2mqtt  | Zigbee2MQTT:debug 2023-01-05 09:57:09: Error: Command 0x50325ffffe531b17/2 genOnOff.on({}, {"sendWhen":"immediate","timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":false,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (sendZclFrameToEndpointInternal error)
zigbee2mqtt  |     at EZSPAdapter.sendZclFrameToEndpointInternal (/app/node_modules/zigbee-herdsman/src/adapter/ezsp/adapter/ezspAdapter.ts:465:19)
zigbee2mqtt  |     at processTicksAndRejections (node:internal/process/task_queues:95:5)
zigbee2mqtt  |     at Queue.executeNext (/app/node_modules/zigbee-herdsman/src/utils/queue.ts:32:32)
Koenkk commented 1 year ago

@freaked1234 this is not the herdsman debug logging, I also don't see any errors in your log.

See https://www.zigbee2mqtt.io/guide/usage/debug.html on how to enable the herdsman debug logging. Note that this is only logged to STDOUT and not to log files.

trackhacs commented 1 year ago

i verified that 1.29 completely brakes Sonoff Stick Plus-E - I am going to open new ticket.

how it supposed to look like? like this:

zigbee2mqtt  | Zigbee2MQTT:debug 2023-01-05 10:30:48: Received MQTT message on 'zigbee2mqtt/bridge/request/permit_join' with data '{"device":null,"time":254,"transaction":"oofkg-1","value":true}'
zigbee2mqtt  | Zigbee2MQTT:info  2023-01-05 10:30:48: Zigbee: allowing new devices to join.
zigbee2mqtt  | Zigbee2MQTT:debug 2023-01-05 10:30:48: Received Zigbee message from 'Coordinator', type 'commandNotification', cluster 'greenPower', data '{"data":[25,3,2,11,254,0],"type":"Buffer"}' from endpoint 242 with groupID null, ignoring since it is from coordinator
zigbee2mqtt  | Zigbee2MQTT:info  2023-01-05 10:30:48: MQTT publish: topic 'zigbee2mqtt/bridge/response/permit_join', payload '{"data":{"time":254,"value":true},"status":"ok","transaction":"oofkg-1"}'
zigbee2mqtt  | Zigbee2MQTT:debug 2023-01-05 10:31:45: Received MQTT message on 'zigbee2mqtt/bridge/request/permit_join' with data '{"device":null,"time":254,"transaction":"oofkg-2","value":false}'
zigbee2mqtt  | Zigbee2MQTT:info  2023-01-05 10:31:45: Zigbee: disabling joining new devices.
zigbee2mqtt  | Zigbee2MQTT:error 2023-01-05 10:31:45: Request 'zigbee2mqtt/bridge/request/permit_join' failed with error: 'Connection not initialized'
zigbee2mqtt  | Zigbee2MQTT:debug 2023-01-05 10:31:45: Error: Connection not initialized
zigbee2mqtt  |     at Ezsp.execCommand (/app/node_modules/zigbee-herdsman/src/adapter/ezsp/driver/ezsp.ts:505:19)
zigbee2mqtt  |     at Driver.permitJoining (/app/node_modules/zigbee-herdsman/src/adapter/ezsp/driver/driver.ts:621:26)
zigbee2mqtt  |     at Object.func (/app/node_modules/zigbee-herdsman/src/adapter/ezsp/adapter/ezspAdapter.ts:249:35)
zigbee2mqtt  |     at Queue.executeNext (/app/node_modules/zigbee-herdsman/src/utils/queue.ts:32:42)
zigbee2mqtt  |     at /app/node_modules/zigbee-herdsman/src/utils/queue.ts:21:18
zigbee2mqtt  |     at new Promise (<anonymous>)
zigbee2mqtt  |     at Queue.execute (/app/node_modules/zigbee-herdsman/src/utils/queue.ts:19:16)
zigbee2mqtt  |     at EZSPAdapter.permitJoin (/app/node_modules/zigbee-herdsman/src/adapter/ezsp/adapter/ezspAdapter.ts:234:34)
zigbee2mqtt  |     at Controller.permitJoinInternal (/app/node_modules/zigbee-herdsman/src/controller/controller.ts:274:32)
zigbee2mqtt  |     at Controller.permitJoin (/app/node_modules/zigbee-herdsman/src/controller/controller.ts:234:9)
zigbee2mqtt  | Zigbee2MQTT:info  2023-01-05 10:31:45: MQTT publish: topic 'zigbee2mqtt/bridge/response/permit_join', payload '{"data":{},"error":"Connection not initialized","status":"error","transaction":"oofkg-2"}'
cloudbr34k84 commented 1 year ago

im having lots of issues - https://github.com/Koenkk/zigbee2mqtt/issues/15973

RubenKelevra commented 1 year ago

@freaked1234 did the fix released in 1.29.2 fix your issue? :)

freaked1234 commented 1 year ago

@freaked1234 did the fix released in 1.29.2 fix your issue? :)

Nope, after installing and rebooting the issue persists right after startup. I follwed the steps to get the herdsman log but after entering the SSH command I dont know where to find the .txt file.

I really need to find a fix for this issue fast or else I will have to exchange my whole setup. At this point HA and z2m make my life worse and it is a constant annoyance. All my lights and a lot of other devices are controlled via zigbee and at this point I need to take my phone out and do 5 manual clicks just to dim my living room lights and I need at least 3 manual clicks to turn them off...

I am willing to pay 50€ to anyone who can help me fix this!

edit: i think i finally got the log: log.txt

main issue description again: "turn on livingroomlight" --> HA shows lights are on; lights remainoff I need to switch lights off, wait 5 sec, switch them on again and hopefully it works same for dimming It is only a few switches that work relyably

freaked1234 commented 1 year ago

(did not edit so new info gets noticed) At least I got some good news - after the update the reported issues seem to happen less often, so I would guess that you are on the right track. Sadly it still happens at a frequency that cant be overlooked.

Within the first 15min after a fresh reboot it is buggy and laggy as ever but then it gets better. I also noticed that HA uses a lot more RAM than it used to. I remember it settling at around 3 GB but now it was using 100% of the 4 GB I alocated. I added 2 GB and some CPU power now. (Glad I am not using a RasPi anymore). I will update when I know the new RAM usage.

New log after improvement here.

log.txt

freaked1234 commented 1 year ago

Final Status after update: The VM settles in at 2.8 GB RAM now that I allocated 6 GB, even though after the first 2 reboots it hit 100% load at 4 GB... The improvement is definetly coming from the update. Before I had a 10% chance that lights would turn on/off at first try. Now it is at 50-60%. As you see, it is still too bad to live withand I either need a fix or swap my complete setup to something that doesnt make my life worse, BUT at least it looks like you are on the right track. I still offer 50€ for a solution!!

fresnoboy commented 1 year ago

I was also seeing the same issue with some devices periodically becoming unreachable and timing out with the 1.29.x releases. Going back to 1.28.4 worked just fine however. 1.29.2 was better than 1.29.1, but still had issues.

freaked1234 commented 1 year ago

@Koenkk Any official news on this issue? Every day I come closer to just trash this stupid stuff... at this point it is nothing more than a nuisance. Replacing 50 switches/sensors may not be cheap but i see no other option. Just now I was sitting in the bedroom, reading a book, when the lights turned off and would not turn back on. So I had to stumble my way across the room to get to my mobile just to play around with the switch for one minute straight just to turn the lights back on! fuck this! I am so over it....

Is there any recommendation? Get a different stick? Different firmware? Anything at all?

fresnoboy commented 1 year ago

@Freaked1234 Reverting to 1.28.4 worked just fine to restore stability. Is it still having an issue after reverting?

freaked1234 commented 1 year ago

@fresnoboy problem is that I just made a clean install when I upgraded to a VM from my Pi a few weeks ago, so I dont have a Backup that far back. I searched the web and it seems that there is no reasonable way to install older versions on HA OS as far as I can see. If there is I would like to try!

fresnoboy commented 1 year ago

@freaked1234 This is not really a z2m issue. Best to cover this in the HA forums. Depending on your HA environment, you just need to do a github pull for version 1.28.4 instead of latest, but the specific really matter depending on how you installed it.

cloudbr34k84 commented 1 year ago

@Koenkk Any official news on this issue? Every day I come closer to just trash this stupid stuff... at this point it is nothing more than a nuisance. Replacing 50 switches/sensors may not be cheap but i see no other option. Just now I was sitting in the bedroom, reading a book, when the lights turned off and would not turn back on. So I had to stumble my way across the room to get to my mobile just to play around with the switch for one minute straight just to turn the lights back on! fuck this! I am so over it....

Is there any recommendation? Get a different stick? Different firmware? Anything at all?

I'm hearing you. I have 140 devices and devices are just dropping on and off the network. It's also my network table is full.

In the process of trying to repair my devices in a better order

tracetechnical commented 1 year ago

@freaked1234 agreed. Im actively trying to move away from z2m to something more stable, it is bringing nothing but issues right now.

cloudbr34k84 commented 1 year ago

@freaked1234 agreed. Im actively trying to move away from z2m to something more stable, it is bringing nothing but issues right now.

What are you moving to?? Zha??

tracetechnical commented 1 year ago

@freaked1234 agreed. Im actively trying to move away from z2m to something more stable, it is bringing nothing but issues right now.

What are you moving to?? Zha??

Still trying to find something at the moment. May even write my own barebones solution with the way things are going. I'm not using HomeAssistant, so won't be using ZHA.

Z2M is great when it works, attrocious when it doesn't.

Koenkk commented 1 year ago

@freaked1234 looking at your logs I can suggest two possible improvements:

sdalu commented 1 year ago

I upgraded the z-stack to 20221226, now I seems to have the same problem when running z2m 1.28.4, which was working fine before with 202202xx.

I upgrade z2m to 1.30.0, I still get unresponsive device. For example a philips hue remote.

fresnoboy commented 1 year ago

@sdalu Did you change the device TX power when you changed the firmware? The older version had no ability to set TX power. If TX power is set too high, devices will think they can hear the coordinator, but the coordinator can not hear the device. Even though the Sonoff-P can go to 20 dbm, I keep mine at 5 dbm, so the stick is forced to use relay paths instead of going direct. That resulted in a more reliable mesh for me.

sdalu commented 1 year ago

@fresnoboy My house is not that big, and everything was working fine before. But I will try to lower TX power to 5dBm and see what's happen. Should I also enable CTS/RTS ?

fresnoboy commented 1 year ago

@sdalu I don't have CTS/RTS checked. I seem to remember the hardware supporting it but not the driver, but that may have changed since they first shipped. On the zstack page it seems to indicate you need flip 2 dip switches on the pcb to enable it.

freaked1234 commented 1 year ago

@Koenkk I installed the latest z2m update and the new coordinator FW and the whole network seems more stable now.

After still having issues with my LEDcontrollers I installed a switch that power-cycles one of my (GLEDOPTPO) controllers once a day. For now (just been a short time though) everything seems to work flawless except the LED-controllers that dont get a daily reset. They are very common RGBCWWW 3.0 led controllers in a black case that are sold under different names. GL-C-008P

2 Questions:

Will report back within a few days when I know the network remains stable an the power cycle solves the unresponsiveness.

Koenkk commented 1 year ago

@freaked1234 If I remember correctly, the Gledopto light controllers are not of very good quality, they use the famous CC2531 chip which cannot handle large networks (it will crash), as Zigbee is a mesh, this impact your network performance. I can recommend the TuYa led controllers, e.g. https://www.zigbee2mqtt.io/devices/TS0501B.html

Is there a way to limit the update rate on the temp/hum/co2/voc sensors to reduce the spam?

No (not for these TuYa devices)

freaked1234 commented 1 year ago

@Koenkk OK replacing the climate sensors is not a big deal but for now the one LED-controller that gets power cycled once a day seems to work fine, so I am not willing to invest another few hundred € to replace all controllers. Just opened up a spare controller and it contains a ZS3L module so as far as I understand this should already be the improved version. https://developer.tuya.com/en/docs/iot/zs3l?id=K97r37j19f496

If there is a way to reset the devices using z2m I could try and report back. Since 24V RGB CCT controllers are pretty expensive as is this could help many others. (may I also suggest adding your info on these devices in the device description on the z2m website)

Koenkk commented 1 year ago

@freaked1234 you can try by sending to zigbee2mqtt/DEVICE_ID/set payload {"factory_reset": ""}

freaked1234 commented 1 year ago

@Koenkk when I do this and switch to the z2m ui tab I get a red error pop-up: No converter available for 'factory_reset' ("")"

Koenkk commented 1 year ago

That's strange, this converter should be available for all devices out-of-the-box. Alternatively you can trigger it via the z2m frontend -> device -> dev console:

Screenshot 2023-02-05 at 15 48 13
freaked1234 commented 1 year ago

@Koenkk Ty, that seems to work and for now it also seems to bring back the controllers to a responsive state. Need to test for a few days though.

Any idea how I could shedule this since the command doesnt seem to work? I´d have to manually input this for 6 devices once a day.

this is what I put in terminal: mosquitto_pub -t zigbee2mqtt/0x5c...../set´ -m{ "factory_reset": "" }´

this is what I put in "dev-tools"--> "services": Service: MQTT:Publish Topic: zigbee2mqtt/0x5c0272fffefff5d9/set [x] payload: { "factory_reset": "" }

fresnoboy commented 1 year ago

@freaked1234 If the issue is the LED strips controller is only unreliable when you have a large number of zigbee devices on the PAN, why not run 2 different zigbee networks, one for the LED strips and any needed repeaters, and another for all the other zigbee devices? I did this, even on the same zigbee channel, during a conversion from deconz to z2m, but you could run two different z2m instances on 2 different devices, or even on the same device with some tweaks. Or run zha on one and z2m on the other.

You'd need two different radios of course - but the sonoff units are pretty cheap.

Koenkk commented 1 year ago

@freaked1234 to trigger this via mqtt publish the payload: {"cluster": 0, "command": 0} to zigbee2mqtt/DEVICE_ID/set

freaked1234 commented 1 year ago

Update: OK the latest updates to z2m and coordinator FW seem to have fixed most issues. I also found out that power cycling the LED-controllers I have every other day helps to keep them responsive. Would be great if you could find a fix so I dont need to have 6 extra switches to do the power cycling though. On top of that I am sure that there are lots of people out there using the same LED-Controllers having the same issues without realizing what the matter is.

Koenkk commented 1 year ago

@freaked1234 I cannot fix that the Gledopto controllers crash, this is an issue of the device firmware not of z2m. Other users might have less traffic going through the Gledopto devices which might trigger this issue.

tracetechnical commented 1 year ago

@Koenkk and there is no way to blacklist a router device or otherwise provide route weighting?

Koenkk commented 1 year ago

@tracetechnical no that is not possible in Zigbee. The hop (router) decides what the next path is.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days

maxime1992 commented 1 year ago

Up

VincentSC commented 1 year ago

Maybe related: every time node-red-contrib-zigbee2mqtt requests an updated state from Z2M, the system becomes unresponsive for a minute. I've written down my experiences and observations in https://github.com/andreypopov/node-red-contrib-zigbee2mqtt/issues/114

It seems that Z2M queues up all kinds of checks triggered by an external call (Node-Red or HA). And the delay gets larger with more (problematic) devices around.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days

maxime1992 commented 1 year ago

:wave:

VincentSC commented 1 year ago

What I found out it is a combination of:

I've improved it, by moving all wifi-devices away from the coordinator, removing all devices I did not use for 2 months, and moving to a much faster device.

What I also found is:

Solutions I still want to try:

VincentSC commented 1 year ago

Rereading the initial issue. I think I have a separate issue. Sorry for hijacking :(

tracetechnical commented 1 year ago

@VincentSC i think i have a similar issue to you, so when you raise your issue, please link it here. I'm having a similar issue with occasional slowness, but have just realised my High-gain zigbee antenna on my co-ordinator is very near a wireless AP....

VincentSC commented 1 year ago

Ok, I will leave me research-notes here, till it's very clear there are different issues.

@tracetechnical Do you use the same plugin on nodered? If not, can you share some statistics of your environment/setup?

VincentSC commented 1 year ago

Big progress. It was the 2.4Ghz wifi. As almost everything in our house is wifi 5 and 6, I put it to channel 1.

So a channel-change (without needing to repair everything) would be really helpful. Not sure why it's on channel 11 by default.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days