Koenkk / zigbee2mqtt

Zigbee 🐝 to MQTT bridge 🌉, get rid of your proprietary Zigbee bridges 🔨
https://www.zigbee2mqtt.io
GNU General Public License v3.0
11.75k stars 1.64k forks source link

Can only pair devices when close to the coordinator (ZZH) #7762

Closed hoggerz closed 3 years ago

hoggerz commented 3 years ago

Hi,

I have recently found this condition, running latest dev branch using a ZZH. I can only pair or re-pair devices if I’m very close to the co-ordinatior. Is there any reason for this? I’ve not actually added any new devices, but when I tried to re-pair a temperature sensor after the battery died , when pairing there was nothing at all in the logs when pairing in its original location until I moved it next to the co-ordinator. There it paired first time. There are a total of 77 devices in my network plus the coordinator, of these, 31 are router devices (IKEA bulbs and 3 IKEA signal repeaters) so there should have been a router close enough.

hoggerz

What happened

Can’t pair when near a router device, only when very close to the cooordinator.

What did you expect to happen

Previously this has never been an issue, although I’ve not had to pair anything recently and I’m usually on Dev branch.

How to reproduce it (minimal and precise)

Enable joining and attempt to pair near a router that’s not close to the coordinator.

Debug info

Zigbee2MQTT version: 1.19.1-dev commit: 217ce221 Adapter hardware: CC26X2R1 - ZZH Adapter firmware version: 20210430

Koenkk commented 3 years ago

Can you provide a sniff of this? https://www.zigbee2mqtt.io/how_tos/how_to_sniff_zigbee_traffic.html

hoggerz commented 3 years ago

Hi Koen,

Thanks, see below, this time it was a brand new IKEA remote I paired (W2045) exactly the same thing happens Key: C5:B8:29:A7:FB:8E:E9:DC:02:7E:EB:7F:39:F0:06:EE End device ID (when it eventually paired right next to the coordinator): 0x842e14fffe59ee88 Let me know if you need anything else. Two (unsuccessful) pairing captures near Ikea bulbs and Ikea signal repeater in my lounge One pairing (successful) right next to ZZH stick

PairingNearRouter.zip PairingNearRouter2.zip PairingNearCoordinator.zip

Koenkk commented 3 years ago

Thanks, something goes wrong at the coordinator side. Can you try flashing this firmware and provide me the herdsman debug logging when pairing close to a router?

See https://www.zigbee2mqtt.io/information/debug.html#zigbee-herdsman-debug-logging on how to enable the herdsman debug logging. Note that this is only logged to STDOUT and not to log files.

znp_CC26X2R1_LAUNCHXL_tirtos_ccs.hex.zip

hoggerz commented 3 years ago

Thanks Koen,

I’m away until Friday, but I’ll do it once I’m back home and get you the logs over as soon as I can.

hoggerz

seaverd commented 3 years ago

@Koenkk

I am having the same issues as @hoggerz. When you look through the issues and discussions, several people are having very similar issues with the zigbee2mqtt and the zzh usb stick (see links at bottom of post).

I reached out to Omer the founder of the zzh stick, provided logs and he indicated that he believed it was a software/firmware issue. Please note that I also tried ZHA and got the same results (realize they both use the same zigbee-herdsman coordinator firmware...so wondering if that is where the issue lies). I am willing to provide herdsman logs, but don't want to derail @hoggerz issue...but figured I would offer assistance since he is not available until Friday.

I am running zigbee2mqtt in docker and have added -e DEBUG=zigbee-herdsman* to the docker run command. When doing so does this include the herdsman debug within the regular log? I attached a sample log file...its a little messy as I had to change my usb device mappings. Can you just verify that it includes the herdsman debugging...if so, I can start a clean network and provide logs, sniffing, etc. to keep this issue moving forward...or if necessary can also make a minimal network with just one of the routers below.

315b86a525be.log

I am also wondering if it makes sense with people who are having this issue to list the routers that they have on their network in case there is some common thread of a device causing issues with the network.

My routers are as follows:

Here are some links to the other discussions and issues that match my experience: Issues: [https://github.com/Koenkk/zigbee2mqtt/issues/7722] [https://github.com/Koenkk/zigbee2mqtt/issues/7083] [https://github.com/Koenkk/zigbee2mqtt/issues/7266] [https://github.com/Koenkk/zigbee2mqtt/issues/6837] [https://github.com/Koenkk/zigbee2mqtt/issues/7609]

Discussions: [https://github.com/Koenkk/zigbee2mqtt/discussions/7381] [https://github.com/Koenkk/zigbee2mqtt/issues/7722] [https://github.com/Koenkk/zigbee2mqtt/issues/7651] [https://github.com/Koenkk/zigbee2mqtt/discussions/5941]

Koenkk commented 3 years ago

@seaverd I'm not sure what to do with the log you provided, I see a successful pair in it but totally missing the context. If providing any logging, please keep it as short as possible (e.g. start capturing when putting the device in pairing mode). Explain what you did any preferably also add a sniff: https://www.zigbee2mqtt.io/how_tos/how_to_sniff_zigbee_traffic.html

seaverd commented 3 years ago

@Koenkk

Attached is a short zigbee2mqtt log file set to debug inclusive of herdman logs. Also is a sniff during an attempted pairing.

Zigbee2mqtt docker container was shut down and ZZH usb stick removed for 30 seconds(on an extension cable). All routers listed in my above post were already paired with the coordinator, but I unplugged them while the docker container was stopped. ZZH was plugged back in and container was started. I then plugged in/restored power to all 6 routers. I then went to zigbee2mqtt and disabled the join all and switched it to join (TeaKettle) which is the Xiaomi plug that is closest to the device that I am about to sniff. Once the join was active, I started the sniff. The endpoint device, bathroomtemp would not pair. I then disabled the join and set it to join all devices, then attempted to pair the bathroomtemp device. The pair was successful. At this point I stopped the sniff...so it should have only two attempted pairs. Based on my previous post this device will stay connected to the network for a couple of hours...then it will completely drop. I did restart the sniff after saving the one attached to this post and can post it once this device drops if that is helpful.

If you need the network key it is: 1C:A9:C9:90:FA:94:46:6D:6E:19:58:93:0E:7C:1C:A8 bathroomtemp device is: 0x00158d00044a711b (0xA367)

If happens with other devices, just trying to keep the sample small per your request. Let me know if you need anything else.

I just saw you posted an alternate coordinator firmware to try. I can test that tomorrow to see if there is any improvement.

Thanks, Dan

zigbee2mqttlog.txt bathroomtemp_sniff.gz

Koenkk commented 3 years ago

@seaverd I expected bathroom_sniff.gz to contain a wireshark pcapng file but it contained 315b86a525be.log, was this a mistake?

seaverd commented 3 years ago

@Koenkk

So sorry! That was a mistake. See attached sniff
bathroomtemp_only.zip

seaverd commented 3 years ago

@Koenkk Interested in what you find. Oddly enough the bathroom temp device has not dropped out yet. Wondering if It was powering everything down before pairing or the fact that I am normally repairing all 20 of my endpoints and this time I have only paired/added a single device.

Attached is the network map…wondering why bathroomtemp, kitchenmotion and kitchenmotion2 are not shown with a connection to either a router or the coordinator.

CEF35AA0-0D22-4DA4-A8F9-964AD2BEAEE2

Koenkk commented 3 years ago

I've investigated the sniff. The funny thing is that the device joins via the same device in both the failed and successful attempt. The first attempt fails because the coordinator does not follow up with a Transport Key after the Update Device. Can you flash the firmware from this post: https://github.com/Koenkk/zigbee2mqtt/issues/7762#issuecomment-860830987 and then provide the herdsman debug log + sniff with this fw?

About the Xiaomi battery powered devices:

hoggerz commented 3 years ago

Hi Koen,

Just flashed the new firmware and repeated what I did previously, I've also included a wireshark capture as well from the same time. Let me know If you need anything else.

pairingdebug.zip PairingDebug.pcap.zip

hoggerz

Koenkk commented 3 years ago

@hoggerz in your sniff I see exactly the same problem as @seaverd and @sjorge have. The Update Device is not followed up by Transport Key from the coordinator making the join fail.

@hoggerz @sjorge indeed started to wonder if it has something to do with the new backup mechanism. @hoggerz did you ever reflash your coordinator after updating to 1.19.0 and before flashing the debug fw I provided? (What was the last time you reflashed your coordinator not couting flashing the debug fw?)

hoggerz commented 3 years ago

Hi Koen,

Last time I updated the coordinator was 08/05/21 with CC2652R_coordinator_20210430, I would have been running on whatever the latest dev branch currently released around that date.

hoggerz

sjorge commented 3 years ago

@hoggerz @sjorge indeed started to wonder if it has something to do with the new backup mechanism. @hoggerz did you ever reflash your coordinator after updating to 1.19.0 and before flashing the debug fw I provided? (What was the last time you reflashed your coordinator not couting flashing the debug fw?)

Seems a likely candidate yes, my issues started when testing CC1352P2_CC2652P_other_coordinator_20210430.hex, my simple test was rejoining a light fixture with 4 GU10 which failed. I'm on dev, I am fairly certain it was post new backup/restore merge.

Reverting to CC1352P2_CC2652P_other_coordinator_20210120.hex fixed it at the time. Less certain, but I think at least my USB was unplugged when I was recabling some stuff in my rack, and the next time I wanted to join a bulb for testing it also stopped working on that firmware.

Since then I've been on a mix of those two and a few of your debug ones. I've had other distraction the past few days so i did not dig further into the new backup/restore code, but as I mention at least one a few occasions it is also rather slow at startup. I don't remember if these were after a flash or after a usb unplug. Of course when I tried to trigger it with logging it didn't have the behavior. My plan was to closely look at the code and see the different code paths it can take as there seem to be different stages it can start/continue from, I was going focus on cold (usb unplugged for a while) vs warm (usb still plugged in and running) start. But yeah not had the time to dig into it.

I did verify pairing super close to the zzhp-lite has a very high success rate. But once I am about 50cm away it nearly impossible.

Koenkk commented 3 years ago

@hoggerz that makes the backup procedure itself being the issue less likely. This was merged on 13 May (https://github.com/Koenkk/zigbee-herdsman/pull/303), the last time the backup procedure was triggered on your side was 08/05/21 which is before this.

What could still be the problem is that 20210430 firmware causes, messes something up in the memory and when restoring this on the old firmware this problem stays.

sjorge commented 3 years ago

What could still be the problem is that 20210430 firmware causes, messes something up in the memory and when restoring this on the old firmware this problem stays.

I use -ewv as flags to cc-bsl, would that not completely wipe the memory on the device?

Koenkk commented 3 years ago

@sjorge -evw does, but if the corrupt part is in the backup it will be restored again (it's just speculation, I don't know if it is the actual issue)

sjorge commented 3 years ago

@sjorge -evw does, but if the corrupt part is in the backup it will be restored again (it's just speculation, I don't know if it is the actual issue)

Hmmm I did start seeing ext_pan is reversed warnings, I did end I could try reversing the PAN in the configure.yaml and try again?

hoggerz commented 3 years ago

Hi Koen,

Tried on the older firmware, no luck either, here's the logs from that anyway:

pairing-olderFW.zip PairingOldFW.zip

Here's the logs from the Newer debug firmware, same thing, wouldn't pair.

PairingNewDebug.zip pairing-Newdebug.zip

Anything else you want me to try?

hoggerz

sjorge commented 3 years ago

Just out of interest, how long does the restore step stake after flashing a different firmware?

seaverd commented 3 years ago

@Koenkk

Thanks for your reply and I am aware of the issues with xiaomi devices. I haven’t tried new batteries, so good suggestion. Last night I decided to add the balance of my devices to the network. Prior to adding the devices I updated to the firmware you posted in this thread and started a sniff while pairing. Unfortunately dinner time fell in the middle of pairing so the log file and sniff are long (couple of hours). That being said, the first device I added, I attempted to pair it via a router…it didn’t pair so I then switched to pair all for all of the balance of the devices.

All devices appeared to pair correctly and right after pairing functioned as expected. However devices started dropping off the network. Oddly of my 22 devices it seems the first 11 paired seem to have stayed connected. I am not home and can post the logs if you wish….but they are long….but maybe they need to be to see why devices are dropping.

Let me know if you would like me to post or repair the network from scratch to keep the logs shorter.

Koenkk commented 3 years ago

Hmmm I did start seeing ext_pan is reversed warnings, I did end I could try reversing the PAN in the configure.yaml and try again?

@sjorge revering the pan id in the configuration.yaml won't change anything on the coordinator side. This has been put in place because < z2m 1.19.0 set a reversed pan id so now both reversed and non reversed are allowed

@hoggerz I see some debug logging now, can you check again with this fw? (will not fix it but contains some more logging).
znp_CC26X2R1_LAUNCHXL_tirtos_ccs.hex.zip

@seaverd device dropping of the network is a separate issue (lets stay to joining only issues in this issue). Please create a new issue. To debug, provide a sniff when triggering the sensor after it has dropped from the network (e.g. click the reset button once on a xiaomi device).

Note: I will be on a short holiday this weekend so will respond on Monday again.

hoggerz commented 3 years ago

Hi Koen,

Thanks, here's captures & logging with the newer firmware, let me know If its any help and If you need anything else.

PairingNewDebug2.zip pairing-Newdebug2.zip

hoggerz

Koenkk commented 3 years ago

@hoggerz can you try with this fw? (may fix it) znp_CC26X2R1_LAUNCHXL_tirtos_ccs.hex.zip

hoggerz commented 3 years ago

Hi Koen,

Thanks, I won’t be back home until Friday again I’m afraid, but I’ll let you know as soon as I can. I appreciate your help with this.

hoggerz

seaverd commented 3 years ago

@Koenkk

I flashed the new fw, was unable to join a specific router. See attached zip files. One is for joining my TeaKettle router...the second is when Join All is selected..the join all paired but will likely drop. I will create another issue and hit reset button once per your request.

zigbee2mqtt_join_all_paired.zip zigbee2mqtt_teakettle_unable to join.zip

seaverd commented 3 years ago

@Koenkk

Just to clarify, I installed the firmware you asked @hoggerz to test and provided my logs and sniff files in the post above. The new firmware did not solve my joining issue.

Please advise if you need me to test anything else.

Koenkk commented 3 years ago

@seaverd in zigbee2mqtt_teakettle_unable to join.zip the reason for join not working does seem to be on the joining device side.

After the joining device does the beacon request, device respond with a beacon but the joining device does not do the assocation request so the join is not initiated. It does do this in the zigbee2mqtt_join_all_paired.zip:

Screenshot 2021-06-24 at 11 02 24
seaverd commented 3 years ago

@Koenkk

Just to clarify, do you think the issue is with the endpoint device or is it the router device. To me is seems odd that I can rarely get a device to join via a router (never had this issue when on my original cc2531).

Would it make sense to force remove a routing device and then sniff the join to make sure that the join is happening properly at that level?

My routing devices are listed in first post above, I am not aware of any known incompatibilities….would it make sense to create a new network with say…just the ikea tradfri outlets as I know a lot of people use them and then see if I can permit join through the ikea outlet?

Just looking for a next step as since moving to the zzh, I have never gotten my network to work as well as it did with the cc2531. About the only way my network works well is if I don’t pair any routers and just join endpoints directly to the coordinator….then once endpoints are done, then pair routers. Issue is I never get benefits from routers and my endpoints at the fringes over time will drop due to poor signal strength. Other issue is if I need to add a new endpoint I have to take routers offline so new device pairs with coordinator and doing this is just not practical.

sjorge commented 3 years ago

@koenkk I went over the major changes since the new backup stuff landed to rule them out, and try without them, no effect I still can't join when further than a few cm from the stick.

I reverted:

Not 100% confident, but pretty sure these are not to blame. I'm going to try and replicate with my zzh I have for testing (not my zzhp-lite I have for prod) bu flashing the februari firmware, joining 2 routers then space them out and try to join a device on the far end of the line without LOS to the zzh.

If that works (hopefully) I will stop z2m, upgrade to the newer firmware, start z2m and try again.

Any holes with that plan?

Edit: some observation I forgot to mention

  "coordinator_ieee": "00124b00228120b5",
  "extended_pan_id": "31d4571f2aefd6c3",

Thats from my current coordinator backup.... should these not be the same? My dev network backup has them matching.

Edit 2: With a newly provision network on my dev coordinator both coordinator_ieee and extended_pan_id also match in the first backup!

Using fw 20210120 I paired a tradfri repeater and then a bulb about 5m from the coordinator. Will let it run for a bit, stop z2m and upgrade the fw tomorrow to 20210430 and remove the furthest bulb and replace it with a GU10 one to see if that still works.

sjorge commented 3 years ago

@Koenkk I managed to succesfully replicate it with my spare zzh that I use for my dev network and 2 tradfri routers (one bulb and a repeater)

  1. flash CC2652R_coordinator_20210120.hex
  2. remove coordinator_backup.json and database.db
  3. create a configuration.yaml with pan_id and such set to GENERATE
    # ...
    advanced:
    pan_id: GENERATE
    network_key: GENERATE
    channel: 11
    # ...
  4. start z2m
  5. join a router and move it ~5m away
  6. join a router about ~2m from from the first router you now have a working network that can join far from the coordinator
  7. stop z2m
  8. switch to CC2652R_coordinator_20210430.hex using -ewv to wipe the zzh
  9. start z2m
  10. wait for ~5min (probably not needed)
  11. stop z2m
  12. switch to CC2652R_coordinator_20210120.hex using -ewv to wipe the zzh
  13. start z2m
  14. remove the furhest router
  15. try to add a new or re-add the same router at the same location, this now consistently fails

I am out of time for now, next time I got a bit of time to dig into this I will use CC2652R_coordinator_20210120 twice that should either work, meaning backup/restore is good or also fail, meaning backup/restore is somehow to blame.

DiggaTS commented 3 years ago

Damn ... i have the Same Problem. coordinator is the Conbee Stick. I cant oair devices on an o router.

Koenkk commented 3 years ago

I found out that when permitting joining only via a specific device joining would always fail. This happens because the permit join message is only send to that specific router, the coordinator does not receive it so it will never transport the network key to the joining device. I've fixed this issue now in the firmware.

@sjorge can you try if it also fixes the issue in your case? However I don't expect it to fix it when you did a permit join all (not a specific router)

sjorge commented 3 years ago

@sjorge can you try if it also fixes the issue in your case? However I don't expect it to fix it when you did a permit join all (not a specific router)

No effect, I also tried by picking a bulb on my so about 20cm form my test bulb, no change.

I have not had time to repeat the above test with flashing the same firmware or starting with the newer firmware yet.

seaverd commented 3 years ago

@Koenkk

Shame on me for not reading your entire post. Flashed the firmware you posted for @sjorge zzhp on my zzh as I am anxious to get this issue resolved. Totally my fault, now my zzh is bricked. Good thing I ordered the debug kit for the zzh when I originally bought it. Just ordered the debugger from TI. Willing to test any zzh firmware once I get the debugger early next week.

Koenkk commented 3 years ago

@castorw probably found the cause of this issue, he is currently looking for a fix (some info is missing from the backup). I confirmed that this issue only happens after reflashing the stick.

sjorge commented 3 years ago

Hmm that would make sense my other test I did with flashing the same firmware with -ewr also had the issue, which I wasn't expecting so I had planned to redo it before posting.

DiggaTS commented 3 years ago

I have the Same Problems with a DeconZ Stick and a new CC2652RB Stick (Slaesh's CC2652RB stick ) ... I cant Bind any Devices over a Router. I can only bind directly to a Coordinator ... With the Slaesh´s Stick i have bad lqi too... (Gartenbeleuchtung_Lampe are 2 Meter Air to Air ) Card

DiggaTS commented 3 years ago

Damn ... this is the worst case scenario. My motion sensors are parents are not the bulb that are 10cm away ... they catch the coordinator 7 Meters away, hurra! I wait 1/2 days ... maybe... Card

sjorge commented 3 years ago

@castorw probably found the cause of this issue, he is currently looking for a fix (some info is missing from the backup). I confirmed that this issue only happens after reflashing the stick.

Is it missing in the backup? Or just not restored. If the former, that would imply recreating the network right?

I was also able to replicate with flash, erase, flash same firmware too now so it doesn't seem firmware related at all (so the newer ones are probable indeed fine)

Koenkk commented 3 years ago

@sjorge it is in the backup before the flash, not restored correctly and therefore not in the new backup after the flash.

sjorge commented 3 years ago

@sjorge it is in the backup before the flash, not restored correctly and therefore not in the new backup after the flash.

OK that would mean a repair then unless there is a backup from the backup, gonne be a pain with the in the wall stuff but I could maybe shuffle these to my dev coordinator 🤞

Koenkk commented 3 years ago

You will probably only need to repair the routers that you want to pair via. Maybe a /set factory_reset will do it.

sjorge commented 3 years ago

Oh so in theory I could start with the routers close to the coordinator and repair them and work my way outward so all routers are repaired 👍 thats worth a shot. I'm getting the missing info in the in devices list where it stores per router key material that is missing in my current backup except for one newly paired router.

So I'm guessing its this is the problematic part:

[
      {
          "link_key": {
            "key": "51c0371456e5a8051f9722c114823428",
            "rx_counter": 0,
            "tx_counter": 0
         }
     }
]

It's only present for one bulb I joined next to the coordinator but missing for all other entries. If so it should be easy to check once the bug is fixed.

MattWestb commented 3 years ago

Updating like key is only being done of Zigbee 3 (Zigbee PRO more exact) deices that requesting one new trust center link key then pairing. HA 1.x and ZLL (if not being Zigbee PRO) is not doing that and is using the "materkey" that is well known.

So for your IKEA devices is only the old CWS 600 lm and the old motion sensor that is not requesting the key update.

The frame counter must also being OK or the devices is trowing the package then it think its on replay attack.

Also with the new TI radios look for corrupted IEEE that is saved in the NVR that can happening and making problems.

castorw commented 3 years ago

@sjorge As we found out the firmwares behave pretty differently. In some cases the firmware fills up the TCLK table end-to-start or randomly. I iterate the table from beginning to start. I will work on this when I get a chance so we could make the backup properly.

Hankanman commented 3 years ago

Hey All, reporting the same issue here, have to pair devices right next to the coordinator (zzh), I bought two dedicated router devices, which pair fine, and appear to route stuff (but im just going on the map here) but still unable to pair devices to them. My network is growing (see below) and wanted to use routers to extend range into the garden, but I can't join them from there as it is out of range of the coordinator. I have read all the above, what is the latest to help with testing or are we awaiting a fix at the moment?(happy to buy another zzh for sniffing or whatever.

Screenshot 2021-07-12 115326
ant-ds commented 3 years ago

I seem to be having the same issue, but with a CC2531 stick as main coordinator. Plenty of routers around my house as well. I need to get within about 30 cm of the stick for it to recognize anything.

I am running the Z-Stack_Home_1.2 (default) firmware of 27 november 2020. I reflashed my stick recently but that didn't fix the issue. My Z2M HA addon is on version 1.20.0-1.

In december 2020, I was running Z2M through docker and didn't have this issue (can't find the version I was using). I hope this information can be of any help.

sjorge commented 3 years ago

So I repaired every device using join via and snaking my way out form the coordinator, took something like 4h. After restarting it's broken again though 🤔 shouldn't it only happen on flashing a new firmware not on z2m restart?