espressif / esp-zigbee-sdk

Espressif Zigbee SDK
Apache License 2.0
178 stars 31 forks source link

Network loss (TZ-147) #37

Closed diazmanuel closed 1 year ago

diazmanuel commented 1 year ago

Hello!!

I left the device in a network and the next day it lost the connection and could not reconnect, the signal that reaches the handler is ESP_ZB_BDB_SIGNAL_STEERING with status -1 as if the network to which it was connected had been deleted from the memory

Is there something I can do to solve it or to better diagnose the problem, any recommendations or a way to check if the network was erased from the memory of the chip

chshu commented 1 year ago

@diazmanuel Is it the issue that the network info was erases from flash unexpected? Could you share us the whole log when running into the issue?

diazmanuel commented 1 year ago

@chshu i think thats the issue, this case has happened to me on other occasions but I thought it was due to rewriting the chip, unfortunately I don't have the complete log because it happened at night while I wasn't watching it, I'll try to debug when this happens. On the other hand, isn't there a command to read the memory where the network and its key are stored? so I could check if it was really deleted or the error is due to something else

xieqinan commented 1 year ago

@diazmanuel ,

The ESP-Zigbee-SDK currently does not provide access memory APIs for users.

To better understand your issue, could you please provide more details about the specific environment you're working in? Here are some questions that will help me understand the context:

What is the network topology in your test, and what devices are present in the network? Is the end device experiencing periods of detachment from the Zigbee network, or is it failing to join the network for a certain period of time? Have you performed any operations related to Zigbee memory? In general, if network data is being written, the Zigbee stack may encounter failure during the network initialization process rather than during network steering.

Furthermore, to ensure Zigbee network security, the Zigbee coordinator typically sets a minimum commissioning time. After the coordinator closes the BDB (Base Device Behavior) commissioning, the end devices or routers may fail to steer the network or join the Zigbee network. In such cases, you should ensure that the end device joins the network during the period(It is 180 seconds in esp-zigbee-sdk) or restart the coordinator.

diazmanuel commented 1 year ago

hello @xieqinan , after some research, I think I have found the problem, on the board I was using up to 6 relays were activated, which produced an increase in current and due to the poor quality of the cables, a voltage protection was trigger and the chip reset. After the reset, the chip is unable to re-initialize the zigbee stack, so it is left out of the network. After the reset, the relays are not activated since they are not lach, even though the chip is manually restarted is unable to start the zigbee stack again.

In addition, after the restart, some errors appear in the log

Brownout detector was triggered

ESP-ROM:esp32c6-20220919
Build:Sep 19 2022
rst:0xc (SW_CPU),boot:0x6e (SPI_FAST_FLASH_BOOT)
Saved PC:0x4001975a
SPIWP:0xee
mode:DIO, clock div:2
load:0x40875720,len:0x181c
load:0x4086c410,len:0xd44
load:0x4086e610,len:0x2dbc
entry 0x4086c41a
I (27) boot: ESP-IDF v5.2-dev-823-g903af13e84 2nd stage bootloader
I (27) boot: compile time Jun 21 2023 11:41:52
I (29) boot: chip revision: v0.0
I (31) boot.esp32c6: SPI Speed      : 40MHz
I (36) boot.esp32c6: SPI Mode       : DIO
I (41) boot.esp32c6: SPI Flash Size : 2MB
I (46) boot: Enabling RNG early entropy source...
I (51) boot: Partition Table:
I (54) boot: ## Label            Usage          Type ST Offset   Length
I (62) boot:  0 nvs              WiFi data        01 02 00009000 00006000
I (69) boot:  1 phy_init         RF data          01 01 0000f000 00001000
I (77) boot:  2 factory          factory app      00 00 00010000 00089800
I (84) boot:  3 zb_storage       Unknown data     01 81 0009a000 00004000
I (92) boot:  4 zb_fct           Unknown data     01 81 0009e000 00000400
I (99) boot: End of partition table
I (103) esp_image: segment 0: paddr=00010020 vaddr=42058020 size=0d680h ( 54912) map
I (133) esp_image: segment 1: paddr=0001d6a8 vaddr=40800000 size=02970h ( 10608) load
I (139) esp_image: segment 2: paddr=00020020 vaddr=42000020 size=54184h (344452) map
I (272) esp_image: segment 3: paddr=000741ac vaddr=40802970 size=0af44h ( 44868) load
I (292) esp_image: segment 4: paddr=0007f0f8 vaddr=4080d8c0 size=01874h (  6260) load
I (303) boot: Loaded app from partition at offset 0x10000
I (304) boot: Disabling RNG early entropy source...
I (315) cpu_start: Unicore app
I (315) cpu_start: Pro cpu up.
W (325) clk: esp_perip_clk_init() has not been implemented yet
I (331) cpu_start: Pro cpu start user code
I (332) cpu_start: cpu freq: 160000000 Hz
I (332) cpu_start: Application information:
I (335) cpu_start: Project name:     HA_customized_switch
I (341) cpu_start: App version:      1
I (345) cpu_start: Compile time:     Jun 21 2023 11:41:43
I (351) cpu_start: ELF file SHA256:  5737b0cac...
I (356) cpu_start: ESP-IDF:          v5.2-dev-823-g903af13e84
I (363) cpu_start: Min chip rev:     v0.0
I (368) cpu_start: Max chip rev:     v0.99 
I (372) cpu_start: Chip rev:         v0.0
I (377) heap_init: Initializing. RAM available for dynamic allocation:
I (384) heap_init: At 408155A0 len 00067070 (412 KiB): D/IRAM
I (391) heap_init: At 4087C610 len 00002F54 (11 KiB): STACK/DIRAM
I (398) heap_init: At 50000010 len 00003FF0 (15 KiB): RTCRAM
I (405) spi_flash: detected chip: generic
I (409) spi_flash: flash io: dio
W (412) spi_flash: Detected size(4096k) larger than the size in the binary image header(2048k). Using the size in the binary image header.
I (426) sleep: Configure to isolate all GPIO pins in sleep state
I (432) sleep: Enable automatic switching of GPIO sleep configuration
I (439) coexist: coex firmware version: ebddf30
I (445) coexist: coexist rom version 5b8dcfa
I (450) app_start: Starting scheduler on CPU0
I (455) main_task: Started on CPU0
I (455) main_task: Calling app_main()
E (465) task_wdt: esp_task_wdt_init(593): TWDT already initialized
I (465) gpio: GPIO[4]| InputEn: 1| OutputEn: 0| OpenDrain: 0| Pullup: 0| Pulldown: 0| Intr:0 
I (475) gpio: GPIO[2]| InputEn: 0| OutputEn: 1| OpenDrain: 0| Pullup: 0| Pulldown: 0| Intr:0 
I (485) gpio: GPIO[3]| InputEn: 0| OutputEn: 1| OpenDrain: 0| Pullup: 0| Pulldown: 0| Intr:0 
I (495) gpio: GPIO[21]| InputEn: 1| OutputEn: 0| OpenDrain: 0| Pullup: 0| Pulldown: 0| Intr:2 
I (505) taskPerifericos: Perifericos Handler Task Created
I (505) phy_init: phy_version 202,b4b3263,May 17 2023,20:14:14
I (525) phy: error: chgp_cal no done!!!
error: pll_cal exceeds 2ms!!!
error: pll_cal exceeds 2ms!!!
I (615) taskZigbee: Zigbee Handler Task Created
I (625) taskZigbee: ZDO signal: 23, status: -1
I (625) taskZigbee: Zigbee stack initialized
I (715) taskPerifericos: perifericos Handler Task Initialized
W (1205) taskZigbee: Failed to initialize Zigbee stack (status: -1)
W (1785) taskZigbee: Failed to initialize Zigbee stack (status: -1)
W (2375) taskZigbee: Failed to initialize Zigbee stack (status: -1)
W (2955) taskZigbee: Failed to initialize Zigbee stack (status: -1)
W (3535) taskZigbee: Failed to initialize Zigbee stack (status: -1)
xieqinan commented 1 year ago

@diazmanuel , The issue can be explained as follows:

The overload current may cause the network data to be erased from the NVRAM (memory) of your Zigbee end device. In this scenario, the device mistakenly believes it still exists on the network and sends a "REBOOT SIGNAL" to request the application layer to reboot and reinitialize. You can use the log statement ESP_LOGW(TAG, "Failed to initialize Zigbee stack (signal %s, status: %s)", esp_zb_zdo_signal_to_string(sig_type), esp_err_to_name(err_status)); to verify this assumption.

diazmanuel commented 1 year ago

@xieqinan mm is something different, in fact, it does not lose the data as such, since after leaving it off for some time, it reconnects to the network, I think it has to do with the error that is displayed in the login, although I do not know what it is due to, but when that error disappears in the log, the system works again and connects correctly to the network without having to add it to said network again

I (525) phy: error: chgp_cal no done!!!
error: pll_cal exceeds 2ms!!!
error: pll_cal exceeds 2ms!!!
xieqinan commented 1 year ago

@diazmanuel , The problem seems to be quite tricky. If you don't mind, please copy the code below and replace the existing function void esp_phy_load_cal_and_init(void) in the ~/esp-idf/components/esp_phy/src/phy_init.c file. More helpful log message will be printed to address the problem. Please provide me with the specific log for both the rejoin success and initialization failure cases.

extern void pbus_print(void);
extern void phy_cal_print(void);
extern void phy_i2c_check(void);
extern void phy_reg_check(void);

void esp_phy_load_cal_and_init(void)
{
    char * phy_version = get_phy_version_str();
    ESP_LOGI(TAG, "phy_version %s", phy_version);

#if CONFIG_IDF_TARGET_ESP32S2
    phy_eco_version_sel(efuse_hal_chip_revision() / 100);
#endif
    esp_phy_calibration_data_t* cal_data =
            (esp_phy_calibration_data_t*) calloc(sizeof(esp_phy_calibration_data_t), 1);
    if (cal_data == NULL) {
        ESP_LOGE(TAG, "failed to allocate memory for RF calibration data");
        abort();
    }

#if CONFIG_ESP_PHY_REDUCE_TX_POWER
    const esp_phy_init_data_t* phy_init_data = esp_phy_get_init_data();
    if (phy_init_data == NULL) {
        ESP_LOGE(TAG, "failed to obtain PHY init data");
        abort();
    }

    esp_phy_init_data_t* init_data = (esp_phy_init_data_t*) malloc(sizeof(esp_phy_init_data_t));
    if (init_data == NULL) {
        ESP_LOGE(TAG, "failed to allocate memory for phy init data");
        abort();
    }

    memcpy(init_data, phy_init_data, sizeof(esp_phy_init_data_t));
    if (esp_reset_reason() == ESP_RST_BROWNOUT) {
        esp_phy_reduce_tx_power(init_data);
    }
#else
    const esp_phy_init_data_t* init_data = esp_phy_get_init_data();
    if (init_data == NULL) {
        ESP_LOGE(TAG, "failed to obtain PHY init data");
        abort();
    }
#endif

#if CONFIG_ESP_PHY_ENABLE_USB
    phy_bbpll_en_usb(true);
#endif

#ifdef CONFIG_ESP_PHY_CALIBRATION_AND_DATA_STORAGE
    esp_phy_calibration_mode_t calibration_mode = CONFIG_ESP_PHY_CALIBRATION_MODE;
    uint8_t sta_mac[6];
    if (esp_rom_get_reset_reason(0) == RESET_REASON_CORE_DEEP_SLEEP) {
        calibration_mode = PHY_RF_CAL_NONE;
    }
    esp_err_t err = esp_phy_load_cal_data_from_nvs(cal_data);
    if (err != ESP_OK) {
        ESP_LOGW(TAG, "failed to load RF calibration data (0x%x), falling back to full calibration", err);
        calibration_mode = PHY_RF_CAL_FULL;
    }

    ESP_ERROR_CHECK(esp_efuse_mac_get_default(sta_mac));
    memcpy(cal_data->mac, sta_mac, 6);
    esp_err_t ret = register_chipv7_phy(init_data, cal_data, calibration_mode);
    if (ret == ESP_CAL_DATA_CHECK_FAIL) {
        ESP_LOGW(TAG, "saving new calibration data because of checksum failure, mode(%d)", calibration_mode);
    }

    if ((calibration_mode != PHY_RF_CAL_NONE && err != ESP_OK) ||
            (calibration_mode != PHY_RF_CAL_FULL && ret == ESP_CAL_DATA_CHECK_FAIL)) {
        err = esp_phy_store_cal_data_to_nvs(cal_data);
    } else {
        err = ESP_OK;
    }
#else
    register_chipv7_phy(init_data, cal_data, PHY_RF_CAL_FULL);
#endif

#if CONFIG_ESP_PHY_IMPROVE_RX_11B
    ESP_LOGW(TAG, "PHY enable improve rx 11b");
    phy_improve_rx_special(true);
#endif

#if CONFIG_ESP_PHY_REDUCE_TX_POWER
    esp_phy_release_init_data(phy_init_data);
    free(init_data);
#else
    esp_phy_release_init_data(init_data);
#endif

    ESP_ERROR_CHECK(esp_deep_sleep_register_hook(&phy_close_rf));
#if !CONFIG_IDF_TARGET_ESP32
    ESP_ERROR_CHECK(esp_deep_sleep_register_hook(&phy_xpd_tsens));
#endif

    pbus_print();
    phy_cal_print();
    phy_i2c_check();
    phy_reg_check();

    free(cal_data); // PHY maintains a copy of calibration data, so we can free this
}
zwx1995esp commented 1 year ago

Tips: this four functions may give out a huge logs(maybe 600+ lines).

xieqinan commented 1 year ago

Hello,

I (525) phy: error: chgp_cal no done!!!
error: pll_cal exceeds 2ms!!!
error: pll_cal exceeds 2ms!!!

Does the phenomenon still occur on your device? If you have any questions regarding the issue, feel free to discuss them here. Let's work together to resolve it.

The problem seems to be quite tricky. If you don't mind, please copy the code below and replace the existing function void esp_phy_load_cal_and_init(void) in the ~/esp-idf/components/esp_phy/src/phy_init.c file. More helpful log message will be printed to address the problem. Please provide me with the specific log for both the rejoin success and initialization failure cases.

Furthermore, if you have managed to capture logs related to the issue, please provide them for us. Your logs would be highly valuable and helpful in troubleshooting and resolving the problem.

diazmanuel commented 1 year ago

hi @xieqinan At the moment I changed the way I feed my board so I am not having this problem, but if I feed it again in the other way it continues to happen,at the moment I couldn't modify this parameter because it delays the rest of the development, I am currently testing the system and I have some network loss problems. I will try to recover the logs (since the problem occurs after a couple of days) and add it here. I ask you not to close the issue and give me time to capture these new logs

xieqinan commented 1 year ago

Hi @diazmanuel , Thank you for your support regarding the issue. The problem will be monitored and maintained until it has been fixed.

diazmanuel commented 1 year ago

hi @xieqinan , After a few weeks of leaving the device connected, I was able to conclude that the problem is not a loss of the network. the error only occurs due to the low quality of the power supply, which ends up triggering a chip protection. For some unknown reason, once the chip restarts due to this protection, it is unable to initialize the Zigbee stack (as I mentioned earlier with the Logs) and thanks to the aging timeout that was set at 64 minutes, the device "lost" the connection to the network.

In summary, the problem is not a loss of connection, but rather the initialization of the zigbee stack after this protection is triggered.

chshu commented 1 year ago

@diazmanuel Thanks for the update, I will close this issue since it's actually power supply issue. Let us know if any further questions.