Koenkk / Z-Stack-firmware

Compilation instructions and hex files for Z-Stack firmwares
MIT License
2.33k stars 643 forks source link

Firmware 20221226 causes some NWK_TABLE_FULL errors with only 32 devices in network (19 routers) #418

Closed joaquinvacas closed 1 year ago

joaquinvacas commented 1 year ago

image

This Philips Hue light always worked fine till this afternoon after I updated firmware to 20221226.

joaquinvacas commented 1 year ago

image

Also, network is very slow, no hardware changes but firmware.

MattL0 commented 1 year ago

I also got this error with 20221226 , I can’t remember why and when.

20221102 has been the most stable so far for me with sonoff 3.0 cc2652p usb stick

bruvv commented 1 year ago

Yup me too. Hardware: CC2652P (Ebyte E72-2G4M20S1E) After flashing the new firmware I also made sure nvram was clean (node zStackEraseAllNvMem.js /dev/tty.wchusbserial14310 = Clearing all NVMEM items finished, deleted 58 items)

Edit: Alright I might have made a mistake where I flashed the launchpad firmware although I should use the normal one, retrying now.

Edit 2: so far it has been online for 2 hours and repaired every device. No issues so far.

Edit 3: New firmware seems fine when flashed with the correct firmware.

joaquinvacas commented 1 year ago

Yup same issue here running sonoff 3.0. I even went as far as deleting and restarting totally. So removing mqtt, removing zigbee2mqtt, deleting all the files and repairing all the devices. but it wasn't pairing at all anymore. With no errors and no debug logs

If it is of any use:

info  2023-01-31 17:17:13: Logging to console and directory: '/config/zigbee2mqtt/log/2023-01-31.17-17-12' filename: log.txt
info  2023-01-31 17:17:13: Starting Zigbee2MQTT version 1.29.2 (commit #unknown)
info  2023-01-31 17:17:13: Starting zigbee-herdsman (0.14.83-hotfix.0)
info  2023-01-31 17:17:17: zigbee-herdsman started (resumed)
info  2023-01-31 17:17:17: Coordinator firmware version: '{"meta":{"maintrel":1,"majorrel":2,"minorrel":7,"product":1,"revision":20221226,"transportrev":2},"type":"zStack3x0"}'
info  2023-01-31 17:17:17: Set transmit power to '10'
info  2023-01-31 17:17:17: Currently 0 devices are joined:
warn  2023-01-31 17:17:17: `permit_join` set to  `true` in configuration.yaml.
warn  2023-01-31 17:17:17: Allowing new devices to join.
warn  2023-01-31 17:17:17: Set `permit_join` to `false` once you joined all devices.
info  2023-01-31 17:17:17: Zigbee: allowing new devices to join.
info  2023-01-31 17:17:18: Connecting to MQTT server at mqtt://core-mosquitto:1883
info  2023-01-31 17:17:18: Connected to MQTT server
info  2023-01-31 17:17:18: MQTT publish: topic 'zigbee2mqtt/bridge/state', payload '{"state":"online"}'
info  2023-01-31 17:17:18: Started frontend on port 0.0.0.0:8099
info  2023-01-31 17:17:18: MQTT publish: topic 'zigbee2mqtt/bridge/state', payload '{"state":"online"}'
info  2023-01-31 17:17:18: Zigbee2MQTT started!
info  2023-01-31 17:19:12: Zigbee: disabling joining new devices.
info  2023-01-31 17:19:12: MQTT publish: topic 'zigbee2mqtt/bridge/response/permit_join', payload '{"data":{"time":254,"value":false},"status":"ok","transaction":"ea565-1"}'
info  2023-01-31 17:19:14: Zigbee: allowing new devices to join.
info  2023-01-31 17:19:14: MQTT publish: topic 'zigbee2mqtt/bridge/response/permit_join', payload '{"data":{"time":254,"value":true},"status":"ok","transaction":"ea565-2"}'
info  2023-01-31 17:19:18: Succesfully changed options
info  2023-01-31 17:19:18: MQTT publish: topic 'zigbee2mqtt/bridge/response/options', payload '{"data":{"restart_required":false},"status":"ok","transaction":"ea565-3"}'
info  2023-01-31 17:19:20: Zigbee: disabling joining new devices.
info  2023-01-31 17:19:20: MQTT publish: topic 'zigbee2mqtt/bridge/response/permit_join', payload '{"data":{"time":254,"value":false},"status":"ok","transaction":"ea565-4"}'
info  2023-01-31 17:19:22: Zigbee: disabling joining new devices.
info  2023-01-31 17:19:22: Succesfully changed options
info  2023-01-31 17:19:22: MQTT publish: topic 'zigbee2mqtt/bridge/response/options', payload '{"data":{"restart_required":false},"status":"ok","transaction":"ea565-5"}'
info  2023-01-31 17:19:24: Zigbee: allowing new devices to join.
info  2023-01-31 17:19:24: Succesfully changed options
info  2023-01-31 17:19:24: MQTT publish: topic 'zigbee2mqtt/bridge/response/options', payload '{"data":{"restart_required":false},"status":"ok","transaction":"ea565-6"}'
info  2023-01-31 17:20:17: Succesfully changed options
info  2023-01-31 17:20:17: MQTT publish: topic 'zigbee2mqtt/bridge/response/options', payload '{"data":{"restart_required":false},"status":"ok","transaction":"ea565-7"}'

Same here, using ITead Sonoff 3.0 stick, forgot to mention but can tell I'm using launchpad fw.

WojtaszekMarek commented 1 year ago

Same here, Sonoff Dongle Plus I've got 80 (80 routers) entities, 8 directly paired with coordinator

joaquinvacas commented 1 year ago

Not sure if @Koenkk is aware of this. Tomorrow is Z2M new release day and I think he's going to announce this fw as stable.

Koenkk commented 1 year ago

@joaquinvacas

joaquinvacas commented 1 year ago

Tried both cc-flasher (there's a Docker container with it, used it lot of times in the past) and the TI Flasher tool using Windows. Neither of them had an apparent problem.

Using launchpad coordinator 20221226 firmware with CC2562P from ITead Sonoff Zigbee 3.0 USB Dongle.

cloudbr34k84 commented 1 year ago

its far from stable... im runniung 141 devices around 120 routers devices and for weeks i have been getting this NWK Table Full error. Now since i have updated the firmware im back to square one with Failed to connect to the adapter (Error: SRSP - SYS - ping after 6000ms) and Z2m cant even start up

Koenkk commented 1 year ago

@joaquinvacas that is strange indeed, I cannot come up with an immediate solution. When TI releases an new SDK I will compile and firmware and you can try it out. For now you can safely stay on 20220219.

joaquinvacas commented 1 year ago

@joaquinvacas that is strange indeed, I cannot come up with an immediate solution. When TI releases an new SDK I will compile and firmware and you can try it out. For now you can safely stay on 20220219.

Don't worry man, you already do a lot of work! Just reporting because there were no hw changes so only fw can make those errors.

I'll roll back to 20220219 until then. ☺️

RaNo99 commented 1 year ago

I've just encountered the same problem. Out of a sudden, NWK_TABLE_FULL started randomly occurring, and the network became sluggish and unreliable. I correlate it with manually restarting a few devices on the network, which is something I've been doing many times before, without any issues.

What helped, was re-pairing one of the routers and one of end devices (!). The network works as a charm again for over a day now.

Coordinator CC2652P with CC1352P2_CC2652P_other_coordinator_20220219.zip + 9 routers + 4 end devices.

hellcry37 commented 1 year ago

for me Z-Stack_3.x.0_coordinator_20221226 firmware was beyond bad, my network dropped offline entirely. I go back to latest stable and repaired everything, cold not even salvage the network.

JeffPixelSplash commented 1 year ago

I also got this error with 20221226 buit can remember why and when.

20221107 has been the msot stable so far for me with sonoff 3.0 cc2652p usb stick

Do you have any clue how I can find that version? I flashed to 20221226 and it's been a disaster. I seriously need to find some version of the firmware which works or my WAF is going to plummet.

MattL0 commented 1 year ago

yes here

CC1352P2_CC2652P_launchpad_coordinator_20221102.zip

MattL0 commented 1 year ago

Looks like i made some typos here. I am on 20221102*

nickrbogdanov commented 1 year ago

@Koenkk How do you recommend debugging these firmware images after building + flashing them? Is there a way to get console output, diagnostic information, and/or "breadcrumbs" out of the TI microcontroller to figure out what events led up to a random failure (such as a NWK_TABLE_FULL error that happens after several days)?

I just upgraded from 20220219 to 20221226 because on my mid-sized network (~40 devices) the ZC was constantly dropping reports. 20221226 seems to have fixed this for me, possibly due to increasing MAX_NEIGHBOR_ENTRIES. But if there are other issues on 20221226 I'd like to be able to debug them. I do have a J-Link probe and can order the TI dev boards if needed.

Koenkk commented 1 year ago

@nickrbogdanov there are some generic docs about debugging: https://software-dl.ti.com/ccs/esd/documents/users_guide/ccs_debug-main.html, but I believe this error comes from the proprietary part of the firmware and can therefore not be debugged by us.

guillaume042 commented 1 year ago

i switch here as mine had been flaged as duplicate. Not better for my side with a downgrade of firmware. Still loosing devices.

Too many devices ? Total 77 Appareils terminaux: 45 Routeurs: 32

I try to switch off the wifi 2.4Ghz completly. I change the USB cable to the SONOFF. I try without USB cable.

Each times : I reconnect all the devices but i got NWK table error and devices are going offline.

guillaume042 commented 1 year ago

Just a little feedback. I've downgraded. I've "recreated" the network from scratch, changing zigbee channel, clearing coordinator etc etc. Wifi is stop in the house. neighbors wifi is not overlapping zigbee channel. 2 hours after, first NWK Table Full errors and i got 2 devices offline. It also seems that some end devices (like motion sensor) even flagged online don't send data.

Koenkk commented 1 year ago

Can you try the 20221102 firmware? https://github.com/Koenkk/Z-Stack-firmware/tree/develop/coordinator/Z-Stack_3.x.0/bin

guillaume042 commented 1 year ago

Can you try the 20221102 firmware? https://github.com/Koenkk/Z-Stack-firmware/tree/develop/coordinator/Z-Stack_3.x.0/bin

As we say in french : "Les grands esprits se rencontrent". I just push the 20221102 dev branch one hour ago. I'm repairing all end devices and i will see if it is better.

guillaume042 commented 1 year ago

Ok it's getting better. Still got some errors in the logs and the networks is a little 'laggy' but devices are staying online.

Error 2023-03-26 11:41:04Publish 'set' 'state' to 'chaut_lumiere' failed: 'Error: Command 0x847127fffe104661/1 genOnOff.on({}, {"sendWhen":"immediate","timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":false,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (SREQ '--> ZDO - extRouteDisc - {"dstAddr":56358,"options":0,"radius":30}' failed with status '(0xc7: NWK_TABLE_FULL)' (expected '(0x00: SUCCESS)'))' Error 2023-03-26 11:42:41Publish 'set' 'state' to 'chaut_lumiere' failed: 'Error: Command 0x847127fffe104661/1 genOnOff.on({}, {"sendWhen":"immediate","timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":false,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (SREQ '--> ZDO - extRouteDisc - {"dstAddr":56358,"options":0,"radius":30}' failed with status '(0xc7: NWK_TABLE_FULL)' (expected '(0x00: SUCCESS)'))' Error 2023-03-26 11:43:41Publish 'set' 'state' to 'chaut_lumiere' failed: 'Error: Command 0x847127fffe104661/1 genOnOff.on({}, {"sendWhen":"immediate","timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":false,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (SREQ '--> ZDO - extRouteDisc - {"dstAddr":56358,"options":0,"radius":30}' failed with status '(0xc7: NWK_TABLE_FULL)' (expected '(0x00: SUCCESS)'))' Error 2023-03-26 11:45:18Publish 'set' 'state' to 'chaut_lumiere' failed: 'Error: Command 0x847127fffe104661/1 genOnOff.on({}, {"sendWhen":"immediate","timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":false,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (Timeout - 56358 - 1 - 7 - 6 - 11 after 10000ms)'

cloudbr34k84 commented 1 year ago

@Koenkk same issue, trying to update firmware using TI flash Programmer. I get this message below?? any ideas?? Initiate access to target: COM4 using 2-pin cJTAG.

Reading file: C:/Users/cloud/Documents/Zigbee/firmware/CC1352P2_CC2652P_launchpad_coordinator_20221102/CC1352P2_CC2652P_launchpad_coordinator_20221102.hex. Unknown record type: 3. Reset target ... Reset of target successful.

guillaume042 commented 1 year ago

@Koenkk same issue, trying to update firmware using TI flash Programmer. I get this message below?? any ideas?? Initiate access to target: COM4 using 2-pin cJTAG.

Reading file: C:/Users/cloud/Documents/Zigbee/firmware/CC1352P2_CC2652P_launchpad_coordinator_20221102/CC1352P2_CC2652P_launchpad_coordinator_20221102.hex. Unknown record type: 3. Reset target ... Reset of target successful.

Mayby it won't help but i use this for my sonoff: docker run --rm --device /dev/ttyUSB0:/dev/ttyUSB0 -e FIRMWARE_URL=https://github.com/Koenkk/Z-Stack-firmware/raw/develop/coordinator/Z-Stack_3.x.0/bin/CC1352P2_CC2652P_launchpad_coordinator_20221102.zip ckware/ti-cc-tool -ewv -p /dev/ttyUSB0 --bootloader-sonoff-usb

ERjouy commented 1 year ago

TI flash programmer doesn't work with the latest firmware (error: Unknown record type: 3.), you need to use, for example, the following python script: https://github.com/JelmerT/cc2538-bsl.

For the Sonoff Dongle-p key, I used the following command line: python3.10 .\cc2538-bsl.py -evw -p COM3 --bootloader-sonoff-usb "C:\Users\manu\Desktop\CC1352P2_CC2652P_launchpad_coordinator_20221226.hex"

Regarding the disconnections mentioned, I have another Sonoff Dongle-p key configured as a router, and it would become offline after a few hours. Reconnecting the key would solve the problem. I decided to change the channel from 11 to 25 by using zigpy-cli (energy_scan command) to search for a less crowded channel, and since then, the key remains online. I also changed the channel with the same tool (change-channel command --channel 25).

guillaume042 commented 1 year ago

Ok it's getting better. Still got some errors in the logs and the networks is a little 'laggy' but devices are staying online.

Error 2023-03-26 11:41:04Publish 'set' 'state' to 'chaut_lumiere' failed: 'Error: Command 0x847127fffe104661/1 genOnOff.on({}, {"sendWhen":"immediate","timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":false,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (SREQ '--> ZDO - extRouteDisc - {"dstAddr":56358,"options":0,"radius":30}' failed with status '(0xc7: NWK_TABLE_FULL)' (expected '(0x00: SUCCESS)'))' Error 2023-03-26 11:42:41Publish 'set' 'state' to 'chaut_lumiere' failed: 'Error: Command 0x847127fffe104661/1 genOnOff.on({}, {"sendWhen":"immediate","timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":false,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (SREQ '--> ZDO - extRouteDisc - {"dstAddr":56358,"options":0,"radius":30}' failed with status '(0xc7: NWK_TABLE_FULL)' (expected '(0x00: SUCCESS)'))' Error 2023-03-26 11:43:41Publish 'set' 'state' to 'chaut_lumiere' failed: 'Error: Command 0x847127fffe104661/1 genOnOff.on({}, {"sendWhen":"immediate","timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":false,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (SREQ '--> ZDO - extRouteDisc - {"dstAddr":56358,"options":0,"radius":30}' failed with status '(0xc7: NWK_TABLE_FULL)' (expected '(0x00: SUCCESS)'))' Error 2023-03-26 11:45:18Publish 'set' 'state' to 'chaut_lumiere' failed: 'Error: Command 0x847127fffe104661/1 genOnOff.on({}, {"sendWhen":"immediate","timeout":10000,"disableResponse":false,"disableRecovery":false,"disableDefaultResponse":false,"direction":0,"srcEndpoint":null,"reservedBits":0,"manufacturerCode":null,"transactionSequenceNumber":null,"writeUndiv":false}) failed (Timeout - 56358 - 1 - 7 - 6 - 11 after 10000ms)'

Not so stable i've got 3 routers offline and not responding after a few hour. (and more NWK table fll errors). A lot of end devices are also inop even online (motion sensors).

Edit : i'm trying to get some doc on this error but zigbee implementation is hard to understand for me. but it seems to have something about the number of routes in the mesh.

nickrbogdanov commented 1 year ago

Just to follow up: I've been running 20221226 for almost a month and it's been rock solid compared to 20220219. No problems at all with missing reports or dropped commands anymore.

So it's possible that whatever knobs were changed in 20221226 vastly improved reliability on my network topology, while breaking it for others.

It would be useful if the FW (or maybe a sniffer) could gather metrics on whatever anomalies are seen during operation, so that we can zero in on which settings might be to blame for the instability.

Another option might be to make several builds using different parameters, to narrow down which change(s) caused the regression.

cloudbr34k84 commented 1 year ago

TI flash programmer doesn't work with the latest firmware (error: Unknown record type: 3.), you need to use, for example, the following python script: https://github.com/JelmerT/cc2538-bsl.

For the Sonoff Dongle-p key, I used the following command line: python3.10 .\cc2538-bsl.py -evw -p COM3 --bootloader-sonoff-usb "C:\Users\manu\Desktop\CC1352P2_CC2652P_launchpad_coordinator_20221226.hex"

Regarding the disconnections mentioned, I have another Sonoff Dongle-p key configured as a router, and it would become offline after a few hours. Reconnecting the key would solve the problem. I decided to change the channel from 11 to 25 by using zigpy-cli (energy_scan command) to search for a less crowded channel, and since then, the key remains online. I also changed the channel with the same tool (change-channel command --channel 25).

Its not working with any firmware on either Sonoff Dongle i have. I dont know how to use python. everything was bloody fine 3 days ago.

guillaume042 commented 1 year ago

image

:-(

Help

guillaume042 commented 1 year ago

I really think there's something about the number of devices (no proofs just something i feel) cause sometimes a devices come back online but another one get offline.

nickrbogdanov commented 1 year ago

I really think there's something about the number of devices (no proofs just something i feel) cause sometimes a devices come back online but another one get offline.

The main problem I had with the old 20220219 firmware was that the ZC would drop reports from a device for several minutes at a time, and then magically recover. This cycle would keep repeating every 10 minutes or so. This suggested to me that it had trouble juggling the number of (peer?) devices that I had in my network, so it could only maintain state for a subset of them at once.

I have about 40 devices in my network.

guillaume042 commented 1 year ago

I really think there's something about the number of devices (no proofs just something i feel) cause sometimes a devices come back online but another one get offline.

The main problem I had with the old 20220219 firmware was that the ZC would drop reports from a device for several minutes at a time, and then magically recover. This cycle would keep repeating every 10 minutes or so. This suggested to me that it had trouble juggling the number of (peer?) devices that I had in my network, so it could only maintain state for a subset of them at once.

I have about 40 devices in my network.

32 routers - 41 end devices 14 devices offlines (mixed routers/end devices).

zen2 commented 1 year ago

I use this firmware 20221226 since it appears on master branch.

I have practically any problem with this firmware but:

So far, anyway, these errors doesn't have bring specific problems. But I can observe sometime some latency and I can see that globally devices LQI is more low than with precedent firmwares.

My network:

I wonder:

guillaume042 commented 1 year ago

Hello,

What should i do ? Changing my coordinator ? If yes which one ? Sorry i'm stuck with this issue and need to find something.

Regards

cloudbr34k84 commented 1 year ago

Sorry been meaning to jump on this thread. So I recently had issues where I was getting this error plus the 4000 & 6000 error. So I opted to redo my network of 130 devices. I also moved to channel 26. Now I added all my router devices first and everything was fast. As soon as I started adding battery devices I noticed ZigBee2mqtt frontend getting slower especially when I pressed the pair putton. Sometimes when I pressed it, it wouldn't respond or on a few occasions ZigBee2mqtt crashed.

guillaume042 commented 1 year ago

I've just got to this : https://github.com/Koenkk/Z-Stack-firmware/issues/375

@Koenkk is it posible to make it grow a little more ?

Koenkk commented 1 year ago

@guillaume042 growing the table more isn't the solution, it's already very big. Probably the ageing has to be changed, I will look into this a soon as TI releases their new SDK (which should be soon).

guillaume042 commented 1 year ago

@guillaume042 growing the table more isn't the solution, it's already very big. Probably the ageing has to be changed, I will look into this a soon as TI releases their new SDK (which should be soon).

Thank you ! :-)

zen2 commented 1 year ago

@Koenkk Is it possible to know how much the routing table is used ?

Koenkk commented 1 year ago

@zen2 AFAIK not

Koenkk commented 1 year ago

Let's continue in https://github.com/Koenkk/Z-Stack-firmware/issues/439