Koenkk / zigbee2mqtt

Zigbee 🐝 to MQTT bridge 🌉, get rid of your proprietary Zigbee bridges 🔨
https://www.zigbee2mqtt.io
GNU General Public License v3.0
11.75k stars 1.64k forks source link

Ember not working properly #23096

Open Onepamopa opened 2 months ago

Onepamopa commented 2 months ago

What happened?

Long story short - I flashed the most-recent firmware for the coordinator (zbdongle-e) - https://github.com/darkxst/silabs-firmware-builder/tree/main/firmware_builds/zbdonglee --> ncp-uart-hw-v7.4.2.0-zbdonglee-230400.gbl

Before you start with "you should've flashed 115200" - I already did, I had the same problems I'm having @ 230400. At the moment 230400 is running with "ezsp" and baudrate set to 230400 without any problems - all devices / routers are in the network, meshed, talking, etc.

As soon as I switch to ember (that's the only config change I'm doing) - half the routers aren't linking, most devices aren't linking as well, meshing is ... well non-existent (at least not on the map).

Here's the map @ ezsp: image

And here's the map @ ember (after 15 hours uptime): image

As you can see - 3 of the sockets (routers) aren't even linking, there's no meshing, half the end-devices aren't linking as well.

As soon as I switch back to ezsp - everything starts working perfectly. And just to clarify - the only change I'm making in the config is ezsp <-> ember. One works, the other doesn't.

What did you expect to happen?

I expected ember to at least work as ezsp (or better) - it didn't.

How to reproduce it (minimal and precise)

I don't know if you could reproduce it... No idea what to tell you.

Btw, you can ignore these 2 devices: "Door Sensor" and "T&H" - they are not here at the moment.

Zigbee2MQTT version

1.38.0

Adapter firmware version

7.4.2 [GA]

Adapter

zbdongle-e

Setup

Addon on home assistant (HA in a Proxmox VM with plenty of resources).

Debug log

Due to "There was an error creating your Issue: body is too long, body is too long (maximum is 65536 characters). " - I'm adding the debug log as a file.

z2m.txt

Nerivec commented 2 months ago

From the couple of lines in your log file (seems you copied the content of the issue in the file, it only has a couple of log lines), looks like you are affected by https://github.com/Koenkk/zigbee2mqtt/issues/22453. See if some of the procedures given by other users can help in your case. But it's a weird one... Since you don't have many devices, I'll suggest you reset Z2M and re-pair all devices, see if that works better. (You can do that by setting network configs to GENERATE it will automatically re-create a brand new network for you on next start. docs)

advanced:
  pan_id: GENERATE
  ext_pan_id: GENERATE
  network_key: GENERATE
  # set to the best channel for your environment
  channel: 25
Onepamopa commented 2 months ago

@Nerivec I don't see anything encouraging in 22453. People switching back to ezsp.

Here's my 2 cents on the subject - don't deprecate ezsp until you are certain ember works.

Nerivec commented 2 months ago

Look into the earlier comments, people tried various things to help debug the issue. Seems some had "broken" networks that mostly worked on ezsp but won't on ember, which requires creating a new network. Others appear to have been affected by interferences (at the coordinator), and fixing just that got rid of the issue entirely... Unfortunately the wide variety of fixes and the fact ezsp lacks some related implementations (it just "looks" like switching back fixes the issue) doesn't help narrow things down.

ember works well for most. It's been running for months without ever doing so much as a timeout in my own network. I also get regular feedback from some very large networks... This bug is very annoying indeed, but since neither darkxst nor I can reproduce it, it makes it near impossible to debug further until someone affected can really tackle it (would need to sniff, and debug the firmware to see what's going on)... In the meantime, ezsp is deprecated but it's not going anywhere, it just won't be the focus of updates anymore.

7floor commented 2 months ago

@Nerivec my 2 cents:

In my case, there is no "broken network", as I created a fresh new network with ember and it doesn't work. I mean, I just got my first ever Sonoff Dongle-E, flashed it with 7.4.3 and installed s2m 1.39.0 for the first time ever, configured as ember. Then I paired a few Tuya thermometers, all worked at beginning. Then after some time I start getting those "Broadcast failed" errors, devices would not report, after removal I could not re-pair them anymore. Swithching to ezsp resolved all my issues.

My setup is: HASS OS on RPi 4 Z2M v 1.39.0 as an addon Sonoff Dongle-E 7.4.3.0 no hw flow control

Nerivec commented 2 months ago

@7floor That my own "live" setup you seem to have, except using an Intel NUC instead of the PI and I'm still on 7.4.1 (not much difference between the two though from silabs notes). Any chance you can try the latest edge version? Concurrency was implemented, it should be a lot better at handling spam, and also a lot of rework derived from v8 support.

Latest reports I've seen on the broadcast issue seem to be centered around the same scenario "it worked at first, and then began having issues after some time", which suggests something is slowly breaking down somewhere, likely around the adapter, something that ezsp must not be implementing.

If the edge version doesn't help, and you have some time to spare to test a few things, find me on Discord, you can DM me from zigbee2mqtt server.

7floor commented 2 months ago

@Nerivec I have no idea how I can try it, since like I said I'm on HASSIO with official addon https://github.com/zigbee2mqtt/hassio-zigbee2mqtt

Update: Never mind, I've installed the edge version of hassio addon and switched to ember. Will watch it for some time then let you know.

Onepamopa commented 2 months ago

@Nerivec I have no idea how I can try it, since like I said I'm on HASSIO with official addon https://github.com/zigbee2mqtt/hassio-zigbee2mqtt

We keep telling them there's a problem, they keep telling us "it works on my live setup" ......

7floor commented 1 month ago

@Nerivec ok, here's some results. It's important to note I have a very simple network as I'm just starting with z2m and experimenting. I only added 5 tuya thermo-hygrometers laying on the table 3 meters from coordinator. Been running edge addon with ember for a couple of days (since you suggested to do so) absolutely fine. No single warning in logs, devices were reporting consistent readings. I even thought that the issues were resolved in edge version, until... I started moving my devices around. One of thermometers went to basement (poor link quality) and stopped reporting. I brought it back near to coordinator but it didn't start reporting. I reset it (normally these thermometers report immediately on reset) - but this time it didn't. I decided to re-pair it and opened network - and immediately got that broadcast errors. And device disappeared from network and could not pair anymore.

So my conclusion is that ember stack is somehow not tolerant against bad communication with devices when link quality is low.

Nerivec commented 1 month ago

That kind of network-level move happens without the involvement of Z2M (the Zigbee network -routers, coordinator- deals with this on its own). How long did you let the device sit in the new location before moving it again? If it had to move through the mesh to find a new route to the coordinator, it can take some time to settle down (it's technically pretty fast, but some devices have been known to be really bad at this). I assume these were end devices you moved, from your description. Note that if you move a router like this, that can seriously impact the network and may require a lot longer to settle down (couple hours usually, the network is forced to heal itself).

What do you mean by "reset"? Just pushing the button to wake it up, or something specific to that device model?

The behavior with the re-pair however may indicate something about that broadcast issue. I'll try to replicate your exact procedure, see if I can finally reproduce the error. Can you share the logs from these tests (debug level hopefully)?

7floor commented 1 month ago

Device was at "bad" location for about 45 minutes until I realized it doesn't report for too long. And there are no routers whatsoever in my network - just a coordinator and 5 end devices of the same type (tuya thermometers on batteries). So no mesh, no topology changes here, it's just a star topology of 5 devices to coordinator. As to reset, yes, it's just a short press on the button to wake it up, they usually report immediately (but not this time)

I have no logs unfortunately.