Koenkk / zigbee-herdsman

A Node.js Zigbee library
MIT License
482 stars 300 forks source link

Check for a changed network address when routing fails in zstack #1183

Closed deviantintegral closed 2 months ago

deviantintegral commented 2 months ago

This follows up from discussion at https://github.com/Koenkk/zigbee2mqtt/discussions/15874#discussioncomment-10542315.

With this PR, when my zbmini changes network addresses herdsman can fix it up. This will work well with availability checks. It only works in response to a command from zigbee2mqtt, and not if you trigger a message by flipping the switch itself. If anyone has ideas on making that work too, I'd love to hear them!

This fails two tests, so it's possible this actually introduces a bug. I'm going to run with this on my local network for a few days and see if anything odd happens.

saveriol commented 2 months ago

This would fix #20045 for z-stack adapters (but the issue is present in ember too so if someone wants to have a look at it, I can test it.)

saveriol commented 2 months ago

I think the tests fails because here the test assumes 6 znp calls but you're adding a call in your PR here. So correctly there are 7 znp calls.

Same for the other failed test.

deviantintegral commented 2 months ago

I fixed that locally, and there's other differences in the fixtures I haven't investigated yet to see if they make sense or are actual regressions.

Koenkk commented 2 months ago

CC @Nerivec for the Ember part

Nerivec commented 2 months ago

Seems like something that should be moved higher up the chain, not duplicated in each adapter (it's a regular ZDO req)?

deviantintegral commented 2 months ago

Seems like something that should be moved higher up the chain, not duplicated in each adapter (it's a regular ZDO req)?

So you're thinking of something like doing this at https://github.com/Koenkk/zigbee-herdsman/blob/1ff1dc77bd5637c05f2d8865cbbb850371c3a645/src/controller/model/endpoint.ts#L292 and perhaps exposing some new public methods on the individual adapter classes?

Nerivec commented 2 months ago

We're going to move the ZDO requests out of adapters, so the nwk address request would become a function of Device and then somewhere in the logic, we'd call it to fix the scenario behind this issue. Seems like something that would be served better as a reaction to an event, instead of forcing it during a failed request. Maybe when a message from an unknown device is received? Do you still have a log from the ZBMini when Z2M doesn't recognize it anymore? To see what it still tries to communicate to Z2M. Can the behavior be replicated identically (with any device) by manually changing a nwk address in the database, then starting Z2M, so the actual address doesn't match what Z2M knows?

saveriol commented 2 months ago

@Nerivec if it's useful, with Vimar devices I can pretty much replicate the issue every time I power them off.

Koenkk commented 2 months ago

@saveriol interesting, could you provide the debug log?

See this on how to enable debug logging.

saveriol commented 2 months ago

@Koenkk yes of course! Here's what I done:

2024-09-15 09:47:06 enabled debug log (log attached here: "log of poweroff") 2024-09-15 09:48:20 powered off all 14592.0 devices (they have "Luce" as prefix in the description) 2024-09-15 09:50:00 powered on all 14592.0 devices 2024-09-15 10:00:00 these devices are marked as offline in Z2M devices tab:

Luce bagno (0x9E45)

Luce specchio bagno (0x81F6)

Luce corridoio (0x33AA)

Luce terrazzo studio (0xA250)

Luce specchio bagno piccolo (0xFEC5)

Luce camera (0x2CAB)

Luce bagno piccolo (0x0E63)

Luce terrazzo sala (0x06F8)

Luce sala TV (0xE23B)

Luce studio (0x553E)

some other 14592.0 are back online after power on:

Luce terrazzo camera (0xB519)

Luce cucina (0x872F)

Luce sala divano (0xC3CE)

2024-09-15 10:04:00 I start the "recovery" procedure of going around the home and toggling devices and looking in the logs for "Data is from unknown device with address" so I can replace the IDs in the database

cp database.db database.db.backup.20240915

#stop Z2M at 10.15.00

#Luce bagno
sed --in-place 's/\"nwkAddr\":40517,/\"nwkAddr\":16349,/' database.db

#Luce specchio bagno
sed --in-place 's/\"nwkAddr\":33270,/\"nwkAddr\":17548,/' database.db

#Luce corridoio
sed --in-place 's/\"nwkAddr\":13226,/\"nwkAddr\":25327,/' database.db

#Luce terrazzo studio
sed --in-place 's/\"nwkAddr\":41552,/\"nwkAddr\":11746,/' database.db

#Luce specchio bagno piccolo
sed --in-place 's/\"nwkAddr\":65221,/\"nwkAddr\":26338,/' database.db

#Luce camera
sed --in-place 's/\"nwkAddr\":11435,/\"nwkAddr\":46119,/' database.db

#Luce bagno piccolo
sed --in-place 's/\"nwkAddr\":3683,/\"nwkAddr\":63743,/' database.db

#Luce terrazzo sala
sed --in-place 's/\"nwkAddr\":1784,/\"nwkAddr\":2131,/' database.db

#Luce sala TV
sed --in-place 's/\"nwkAddr\":57915,/\"nwkAddr\":35491,/' database.db

#Luce studio
sed --in-place 's/\"nwkAddr\":21822,/\"nwkAddr\":41487,/' database.db

#start Z2M at 10.15.35 (log attached here: "log after fix")

2024-09-15 10:16:46 all devices are back online. Downloading both logs log after fix.log log of poweroff.log

Koenkk commented 2 months ago

I think we should do a ZDO network address request by ieeeAddr here: https://github.com/Koenkk/zigbee-herdsman/blob/master/src/controller/controller.ts#L722, need to wait a bit for the ZDO refactor is done, as this request is currently not supported yet (https://github.com/Koenkk/zigbee-herdsman/pull/1187)

Nerivec commented 2 months ago

@Koenkk That's what I was thinking too, but there is the issue of close-by networks using default network config that trigger this codepath like crazy. Obviously, it's a "bad setup" example, but adding the ZDO request on top would add to the useless spamming. Been a bit busy to look into it in more details though. Like you said, the ZDO refactor should make this easier wherever it ends up being triggered, and will fix it for all adapters at once.

Koenkk commented 2 months ago

@Nerivec I was thinking to do it only once for every unknown networkAddress

Nerivec commented 2 months ago

It's not unknown as far as Z2M is concerned though, it's just no longer matching that of the network (no way to know unless we fail a request, then do the ZDO to see if the address changed). Also you need the EUI64 to do the ZDO (which is not available from everywhere in the code). It could be done more efficiently from availability (on offline, send ZDO to confirm if really offline), but that would require it to be enabled for devices with that issue... I'll run some tests once the ZDO refactor is over, see if I can force Z2M to falter like with that ZBMINI.

deviantintegral commented 2 months ago

Excellent, thanks for integrating this into the other PR. I've switched over to the dev branch and will report if I see anything odd from the refactor.