Koenkk / zigbee2mqtt

Zigbee 🐝 to MQTT bridge 🌉, get rid of your proprietary Zigbee bridges 🔨
https://www.zigbee2mqtt.io
GNU General Public License v3.0
11.34k stars 1.62k forks source link

Joining new devices fails after a few days of uptime #3177

Closed sjorge closed 3 years ago

sjorge commented 4 years ago

Bug Report

What happened

I tried to join a new device, it failed.

The network/coordinator usb has been up for a about 1-1.5days now, joining new devices fail. Physically unplugging and replugging the usb (with z2m stopped ofcourse) re allows joining of new devices.

I did a capture of one of those failed join requests: https://pkg.blackdot.be/extras/zigbee/failed_join.pcapng

Not sure where it fails afterwards, it looks like z2m doesn't respond to the update device at all?

What did you expect to happen

The device to be able to join.

How to reproduce it (minimal and precise)

I can always reproduce this after a about 1-3 days after plugging in/out the USB device (rebooting the server does not help)

Debug Info

zigbee2mqtt version: 1.12.0-dev (commit #d52d520) CC253X firmware version: {"type":"zStack12","meta":{"transportrev":2,"product":0,"majorrel":2,"minorrel":6,"maintrel":3,"revision":20190608}}

sjorge commented 4 years ago

While I'm waiting a bit to test to see if it is fixed, Tho85 on discord seems to have a similar problem and that stated after pairing a Xiaomi Light Sensor...

Which I was also using up until a week ago, I wonder if the spammy nature of the sensor plays a part in breaking the Z-Stack firmwares timer or something.

sjorge commented 4 years ago

https://pkg.blackdot.be/cores/zigbee/srctr_pair_fai.pcapng

The changes didn't help, the bulb gets kicked out immediately.

Koenkk commented 4 years ago

Please try with this firmware: CC2531ZNP-Prod_source_routing_20200530.hex.zip

It won't fix the issue, but during joining it will send leave request to non-existing addresses, this will help me to determine where in the code things go wrong. After flashing first check if you are still able to join a device (to make sure I didn't mess up).

sjorge commented 4 years ago

I just realized something... I don't think unplugged the coordinator after applying the patch... Would that have a possible effect on the result?

As it might have already set the timer, it expired. The new join doesn't set the timer but it may have had an expired timer value lingering around?

I'll try to flash the stick somewhere next week.

Koenkk commented 4 years ago

@sjorge yes, that could definitely cause an issue. So please make sure to do this. Nevertheless still try with that new firmware as it will provide extra information in case it still fails.

sjorge commented 4 years ago

New firmware flashed, now we wait a bit and try too pair. So if I understood correctly you added the equalivant of a debug print to the firmware bu sending a remove to a particular unique set of non existing devices in different paths of the code?

sjorge commented 4 years ago

The firmware is not very responsive...

zigbee2mqtt:error 2020-05-31 19:14:13: Failed to ping 'bedroom/bed_lamp/bulb'
zigbee2mqtt:info  2020-05-31 19:14:13: MQTT publish: topic 'zigbee2mqtt/bedroom/desk_lamp/remote', payload '{"battery":74,"linkquality":42,"update_available":false,"click":"off"}'
zigbee2mqtt:info  2020-05-31 19:14:15: MQTT publish: topic 'zigbee2mqtt/bedroom/desk_lamp/remote', payload '{"battery":74,"linkquality":42,"update_available":false,"click":"on"}'
zigbee2mqtt:info  2020-05-31 19:14:27: MQTT publish: topic 'zigbee2mqtt/masterbedroom/lamp/bulb2/availability', payload 'online'
zigbee2mqtt:info  2020-05-31 19:14:27: MQTT publish: topic 'zigbee2mqtt/masterbedroom/lamp/bulb1/availability', payload 'online'
zigbee2mqtt:info  2020-05-31 19:14:28: MQTT publish: topic 'zigbee2mqtt/masterbedroom/lamp', payload '{"state":"OFF","brightness":254}'
zigbee2mqtt:info  2020-05-31 19:14:28: MQTT publish: topic 'zigbee2mqtt/masterbedroom/lamp/bulb2', payload '{"state":"OFF","linkquality":47,"update_available":false,"brightness":254}'
zigbee2mqtt:info  2020-05-31 19:14:28: MQTT publish: topic 'zigbee2mqtt/masterbedroom/lamp', payload '{"state":"OFF","brightness":254}'
zigbee2mqtt:info  2020-05-31 19:14:28: MQTT publish: topic 'zigbee2mqtt/masterbedroom/lamp/bulb1', payload '{"update_available":false,"state":"OFF","linkquality":31,"brightness":254}'
zigbee2mqtt:error 2020-05-31 19:14:35: Failed to read state of 'masterbedroom/lamp/bulb1' after reconnect
zigbee2mqtt:error 2020-05-31 19:14:36: Failed to read state of 'masterbedroom/lamp/bulb2' after reconnect
zigbee2mqtt:info  2020-05-31 19:14:48: MQTT publish: topic 'zigbee2mqtt/bathroom/lamp/bulb/availability', payload 'online'
zigbee2mqtt:info  2020-05-31 19:14:49: MQTT publish: topic 'zigbee2mqtt/bathroom/lamp', payload '{"state":"OFF","brightness":254,"color_temp":400,"color":{"x":0.4599,"y":0.4106}}'
zigbee2mqtt:info  2020-05-31 19:14:49: MQTT publish: topic 'zigbee2mqtt/bathroom/lamp/bulb', payload '{"state":"OFF","linkquality":42,"update_available":false,"brightness":254,"color_temp":400,"color":{"x":0.4599,"y":0.4106}}'
zigbee2mqtt:info  2020-05-31 19:14:58: MQTT publish: topic 'zigbee2mqtt/masterbedroom/repeater/availability', payload 'online'
zigbee2mqtt:info  2020-05-31 19:14:59: MQTT publish: topic 'zigbee2mqtt/masterbedroom/repeater', payload '{"linkquality":42,"update_available":false}'
zigbee2mqtt:error 2020-05-31 19:15:12: Failed to read state of 'bathroom/lamp/bulb' after reconnect

About 1/10 changes mode actually make it through

sjorge commented 4 years ago

https://pkg.blackdot.be/cores/zigbee/errors.pcapng

There is one near packet 330 but a few before that, but they scrolled by to fast. Note, this is not during pairing... this is just existing devices dropping and failing to rejoin after the device announcement... going to refresh the normal source routing firmware tomorrow morning as nothing is working atm.

Got my issue tabs messed up.

sjorge commented 4 years ago

Back on the vanilla source routing firmware, network was back to normal/stable after ~5min or so.

Still have the diff to controller.js applied so 🤞 the first time it failed because I had not rest the stick, we'll know in a day or 5 I guess.

sjorge commented 4 years ago

Tried to pair a device today, same issue :( so the change to non timer version didn't seem to help. Adaptor was reset when I reflashed normal src routing firmware when the debug one broke everything.

Koenkk commented 4 years ago

Can you sniff the traffic of the debug one? (just to see where it fails)

sjorge commented 4 years ago

https://pkg.blackdot.be/cores/zigbee/errors.pcapng was captures with the debug firmware when it was locking up. Which was almost immediately after a unplug, plug, start z2m

Koenkk commented 4 years ago

It seems that my changed indeed didn't work, I will try to figure out an alternative way to do debugging.

FaBjE commented 4 years ago

It seems that my changed indeed didn't work, I will try to figure out an alternative way to do debugging.

Is USART 0 or USART 1 still free in the software? It would require a little soldering, but I believe you can route the Rx & Tx to the (tiny) extension header on the side of the CC2531 USB Dongle (Table 7-1. Peripheral I/O Pin Mapping) and the "Additional test pins" from the schematic If you connect a FTDI cable to that you can do debug printing from the dongle.

It is a very common way to debug embedded software, so it would surprise me if there isn't something in the TI software stack for that already.

sjorge commented 4 years ago

Soldering is a bit beyond my skill level... based on my succes rate.

~ sjorge

On 7 Jun 2020, at 11:52, FaBjE notifications@github.com wrote:

 It seems that my changed indeed didn't work, I will try to figure out an alternative way to do debugging.

Is USART 0 or USART 1 still free in the software? It would require a little soldering, but I believe you can route the Rx & Tx to the (tiny) extension header on the side of the CC2531 USB Dongle (Table 7-1. Peripheral I/O Pin Mapping) and the "Additional test pins" from the schematic If you connect a FTDI cable to that you can do debug printing from the dongle.

It is a very common way to debug embedded software, so it would surprise me if there isn't something in the TI software stack for that already.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

FaBjE commented 4 years ago

If it is a feasible solution to help Koenkk debugging I can do some soldering and we can arrange some postage/shipping scheme

Koenkk commented 4 years ago

@FaBjE what is required for this? Iar workbench only?

@sjorge I've modified the firmware a bit, hopefully it doesn't crash now: CC2531ZNP-Prod_source_routing_20200609.zip

FaBjE commented 4 years ago

@Koenkk Just some code. Compiler has nothing to do with it. (I'm well familiar with IAR, Not so much with the TI Z-Stack)

For debug printing, you just configure the uart in code at 115200 or anything. and put printf statements in the code. https://e2e.ti.com/support/wireless-connectivity/zigbee-and-thread/f/158/t/169089

The processor pins output uart at TTL level, so you need a converter from TTL to pc, something like this is the easiest to use. https://www.antratek.nl/ftdi-usb-to-serial-ttl-cable-3-3v-ttl

This requires a little soldering on the CC2531 usb dongle pcb as it doesn't have a ready-to-use connector for it, But it has the appropriate processor pins on the pinheader pads at the side.

Than you just open a COM terminal on your PC an can see what is going on in the software ;)

Koenkk commented 4 years ago

@FaBjE thanks for sharing this, very interested in this and will probably will be useful in the future.

FaBjE commented 4 years ago

@FaBjE thanks for sharing this, very interested in this and will probably will be useful in the future.

Happy to help :) If you need any assistance integrating this, let me know.

sjorge commented 4 years ago

I’ll flash the new firmware this weekend

sjorge commented 4 years ago

Flash the firmware but immediately the network fell apart.

I will upload a small capture later where I use a remote to toggle a group of bulbs, then try to change it via mqtt, then try to change a single bulb. Then turn them back on via the remote.

Going to reflash and get everything back up and running first.

sjorge commented 4 years ago

https://pkg.blackdot.be/cores/zigbee/bad2.pcapng

I think I know what is causing the issue though, after every device announce from an actually device, e.g. the hue motion sensor. The firmware sends a flood of debug announcements and I think that trips up the network... so the device never rejoins.

I also noticed the ikea remotes don't send a announcement when I push a button so probably why they work fine. The first time A bulb sends some data (attReport) it also announces itself again but it gets met with a flood of messages and it never rejoins.

I flashed the std source routing firmware again and only the devices that had tried to rejoin, had issue. The once that had not yet send a message were fine.

It might slower but perhaps we could toggle the led to green and red at 2 most likely points where you think the issues is and turn it of at startup.

It's a lot less info to go on, but it would not send zigbee traffic that could cause other devices to be unhappy?

Koenkk commented 4 years ago

I don't understand yet why it doesn't work, your sniff doesn't contain any messages send from the coordinator, probably the code crashes or something.

Anyway, based on the idea of @FaBjE I found a way to output debug logging to zigbee-herdsman. Try with this firmware: CC2531ZNP-Prod_20200614.zip , now run zigbee2mqtt with DEBUG=zigbee-herdsman:adapter:zStack:znp:AREQ npm start and when pairing a device you will see this logging:

image

sjorge commented 4 years ago

I'll reflash later today.

So I just run z2m as normal, once I noticed joining no longer works I stop and start it manually with the debugging? (given only a znc reset or physical unplug/replug fixes it, and a z2m restart has no effect)

Koenkk commented 4 years ago

Always run with this logging (there won't be any overhead) and just check the log when joining fails for the first time.

sjorge commented 4 years ago

Of to a good start, this firmware is stable! I will adapt my init script to run with the extra debugging now.

sjorge commented 4 years ago

Everything stopped working today, got a bit of the log. https://gist.github.com/sjorge/bcf51e3d4b8269788bf61aa8b927b339

Seems to be a memory issue on the ZNC, reset via mqtt did not work. Had to unplug and wait a bit before replugging.

Not sure this is due to the logging being added, or that his just highlighted the actually issue were some memory is never cleared. A full association table also sounds like a memory thing.

Koenkk commented 4 years ago

Forgot to deallocate the send debug message, should be fixed with: CC2531ZNP-Prod.zip

sjorge commented 4 years ago

Will flash tomorrow, just got home. Lets hope that one runs for a bit longer until the pairing issue manifests.

sjorge commented 4 years ago

Updated the stick, now we wait

sjorge commented 4 years ago

Still working, but looks like my CC got unplugged an replugged a day ago, probably while someone was loading the laundry :( Been a bit AFK too, so no update for now.

sjorge commented 3 years ago

Very odd, I was able to join a bulb today after 6 days of uptime. Were there any other changes aside from the debug printing?

Koenkk commented 3 years ago

@sjorge no there we no other changes. What debug logging did you got? Theoretically: the added debug statements increase the execution time, in case this bug turns out to be a race condition this could actually "solve" it.

sjorge commented 3 years ago

On 2020-07-01 22:04, Koen Kanters wrote:

@sjorge [1] no there we no other changes. What debug logging did you got? Theoretically: the added debug statements increase the execution time, in case this bug turns out to be a race condition this could actually "solve" it.

-- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub [2], or unsubscribe [3].

I already turned of my laptop, I'll pair the bulb again tomorrow and capture the output to a file.

-- ~ sjorge

Links:

[1] https://github.com/sjorge [2] https://github.com/Koenkk/zigbee2mqtt/issues/3177#issuecomment-652620178 [3] https://github.com/notifications/unsubscribe-auth/AAC4WEMKHTXWOZSRQGT5UDTRZOJFXANCNFSM4LRO5I4Q

sjorge commented 3 years ago

Still works fine after 10 days, so maybe it is some timing thing.

https://gist.github.com/sjorge/7bab8456eb865848db3e5ed9c4dd55fc here is the log, it wasn't fully scewed in so 2 device announces, but it joined fine after I tightened it in the socket without doing anything else.

Koenkk commented 3 years ago

Looks good, I also see the logging I've added. I'm not sure how to continue here yet. Maybe wait for some more days or see if anybody else can confirm it has been fixed with this FW?

FaBjE commented 3 years ago

If the firmware is "stable" I can give it a shot. I need to find a moment to do the initial setup though.

Edit: Can the debug log be enabled/logged in the docker container?

sjorge commented 3 years ago

I noticed no difference in stability with this or the normal firmware, so I'd say it's stable. Not sure about docker as I don't use it.

FaBjE commented 3 years ago

@Koenkk Do you know if I can enable the debug logging in the docker container?

Koenkk commented 3 years ago

@FaBjE yes, see https://www.zigbee2mqtt.io/information/debug.html#docker

sjorge commented 3 years ago

I just got a new device to test delivered, so will have another go at pairing a totally new device today or tomorrow.

sjorge commented 3 years ago

Uptime has been 2 weeks, still working fine with debug logging turned on!

Koenkk commented 3 years ago

Good, TI asked me to figure out how much delay has to be added, I will try to provide 2 new firmwares this weekend where debug logging is replaced with 2 fixed delays.

FaBjE commented 3 years ago

Just updated my coordinator to CC2531ZNP-Prod.zip. I had to add pan_id: 0x1a63 to the config. Apparently the old one pan_id: 6755 was changed. The zigbee network is online again and seems good.

I added "- DEBUG=zigbee-herdsman:adapter:zStack:znp:AREQ" to my docker-compose file. And I see the adapter debug messages (displaying all messages payload) to the console. But they are not saved to the log.txt file.

I will try to join a new device in a few days.

Koenkk commented 3 years ago

Can you try this firmware: CC2531ZNP-Prod_20200720.zip (has some builtin sleeps without debug logging)

sjorge commented 3 years ago

Mostly done moving stuff over to my new place now, once I have my ZNC back up and running, I'll flash and try the new firmware.

sjorge commented 3 years ago

I flashed the new firmware, going to pair some bulbs now.

sjorge commented 3 years ago

Pairing isn't working well, I managed to pair only 2 bulbs but now it's locked up. I'm going to revert to vanilla source routing and finish pairing everything as I have lots of bulbs to go

sjorge commented 3 years ago

It seems to mostly timeout in the interviewing/configuration steps