espressif / esp-zigbee-sdk

Espressif Zigbee SDK
Apache License 2.0
143 stars 23 forks source link

HA_color_dimmable_light example does not properly rejoin network after reboot (TZ-49) #1

Closed jurriaan closed 1 year ago

jurriaan commented 1 year ago

I'm currently trying the examples on a ESP32-C6, and it seems like the HA_color_dimmable_light does not properly rejoin the network after joining for the first time.

I do have the idea that it is due to the following code:

https://github.com/espressif/esp-zigbee-sdk/blob/b8dd9c5a249cc40208ef000aa1917f7ddc149a5f/examples/esp_zigbee_HA_sample/HA_color_dimmable_light/main/esp_zb_light.c#L87-L95

Which is the opposite way of handling compared to the esp_HA_customized_switch:

https://github.com/espressif/esp-zigbee-sdk/blob/b8dd9c5a249cc40208ef000aa1917f7ddc149a5f/examples/esp_zigbee_customized_devices/customized_client/main/esp_HA_customized_switch.c#L189-L209

After replacing the code in HA_color_dimmable_light with that setup (so that it starts network steering when the err_status != ESP_OK), it properly rejoins (except for some crashes on assert failed: zb_esp_radio_process zb_esp_mac.c:290 (false) panics, see https://github.com/espressif/esp-zboss-lib/issues/1).

likunqiao097304 commented 1 year ago

@jurriaan could you explain little more about how you make color dimmable light example to first joined and then failed to rejoin later? How the zigbee devices and roles you are setting? It would be good from my side to duplicate your failure situation. BTW, I tried run a color dimmable light and a color dimmable switch example on two esp32c6 board, it seems okay.

likunqiao097304 commented 1 year ago

@jurriaan also, how do you make the color_dimmable_light example int to err_status !=OK when in the case ESP_ZB_BDB_SIGNAL_DEVICE_FIRST_START/ESP_ZB_BDB_SIGNAL_DEVICE_REBOOT? Maybe you could try to do the idf.py -p PORT erase-flash before flash the light example? It will clean the NVRAM that saved before.

jurriaan commented 1 year ago

@jurriaan could you explain little more about how you make color dimmable light example to first joined and then failed to rejoin later? How the zigbee devices and roles you are setting? It would be good from my side to duplicate your failure situation. BTW, I tried run a color dimmable light and a color dimmable switch example on two esp32c6 board, it seems okay.

I will do some more tests tomorrow, but what I did was I paired the color dimmable light to my existing zigbee network. When I press reset/restart the ESP it disconnected and didn't rejoin the network. I had to permit joining to the network and pair the device again.

I found this behavior weird, so I started to look into the other examples, that's when I noticed that esp_HA_customized_switch.c has different logic in the handling of the ESP_ZB_BDB_SIGNAL_DEVICE_REBOOT signal. It basically did exactly the opposite of the HA_color_dimmable_light:

In the HA_color_dimmable_light, on a ESP_ZB_BDB_SIGNAL_DEVICE_REBOOT signal, when the status is ESP_OK, it starts network steering, while in the esp_HA_customized_switch it starts the networks steering when the status not is equal to ESP_OK

When I changed the HA_color_dimmable_light example so the condition was reversed, so network steering starts only when the status != ESP_OK, it started to rejoin properly on device restart.

likunqiao097304 commented 1 year ago

I will do some more tests tomorrow, but what I did was I paired the color dimmable light to my existing zigbee network. When I press reset/restart the ESP it disconnected and didn't rejoin the network. I had to permit joining to the network and pair the device again.

It is weird that when you reset/restart the ESP it always cause the ESP_ZB_BDB_SIGNAL_DEVICE_FIRST_START/ESP_ZB_BDB_SIGNAL_DEVICE_REBOOT error feedback. If you have two esp32c6 board, could you try one flash with color_dimmable_light, one flash with color_dimmable_switch, to try reset/restart if it could rejoin, then let's start to compare what is the different log or behavior between us.

I will check from my side to duplicate failure condition for the status !=ESP_OK, also check rejoin process if I could find.

jurriaan commented 1 year ago

I'm going to do some more test with the original color_dimmable_light, since I just noticed I have changed more of the original example during my tests.

It seems that calling esp_zb_start with true (auto_start enabled) also changes the behavior of the statuses returned. I enabled auto_start, since that's also what esp_HA_customized_switch does, and it seemed to me based on the documentation that it causes the zigbee stack to use the NVRAM to load/store parameters, but I might be interpreting it wrongly.

It initializes, load some parameters from NVRAM and proceed with startup. Startup means either Formation (for ZC), rejoin or discovery/association join.

I also modified the example to run in router mode instead of end device (since router mode is the common thing to use for powered lights).

I will do new tests with the original example and report back!

likunqiao097304 commented 1 year ago

I also modified the example to run in router mode instead of end device (since router mode is the common thing to use for powered lights).

This shouldn't be a issue. You can choose light to be a router or end-device

jurriaan commented 1 year ago

Report of my findings using the original example

Tested on esp-zigbee-sdk 1a52786a4dd972c153ffd77e4da8a3d33fa4eaac, esp-idf 732df630cd56826bbf79da6643f9aa18c33a17cca with esp-zboss-lib 0.3.1.

Initially I couldn't add the device to my network, noticed the default channel was 13, but my network is on 15. I updated the zb_fct partition with an updated channel mask of 0x8000, and now it seems like it starts to join in the ZHA integration of home assistant (which manages my zigbee network), however, during the initial phase of joining (device found - starting interview in home assistant) the esp crashes on the assert issue. This means it continuously tries to join my network, with a different NWK value.

Then I added esp_zb_set_network_channel(15); directly after the esp_zb_init() call and erased the flash to clear the zb_fct partition. After this change it doesn't immediatly crash with the assert issue but successfully joins to the network and starts working:

I (567) ESP_ZB_COLOR_DIMM_LIGHT: ZDO signal: 23, status: -1
I (567) ESP_ZB_COLOR_DIMM_LIGHT: Zigbee stack initialized
I (567) ESP_ZB_COLOR_DIMM_LIGHT: Start network steering
I (2517) ESP_ZB_COLOR_DIMM_LIGHT: ZDO signal: 50, status: 0
I (2527) ESP_ZB_COLOR_DIMM_LIGHT: Joined network successfully (Extended PAN ID: e5:18:52:1b:4a:ed:c0:01, PAN ID: 0x4d90)

When I restart the device, I do get a similar output to the above, which does give me the idea that it tries to fully rejoin the network every boot (please correct me if I'm wrong).

I disable permit join on my network, since the device should've been joined successfully now and restart the device again.

Now it stays in a loop of trying to join the network and crashing on the assert issue:

I (567) main_task: Returned from app_main()
I (577) ESP_ZB_COLOR_DIMM_LIGHT: ZDO signal: 23, status: -1
I (577) ESP_ZB_COLOR_DIMM_LIGHT: Zigbee stack initialized
I (577) ESP_ZB_COLOR_DIMM_LIGHT: Start network steering
I (3227) ESP_ZB_COLOR_DIMM_LIGHT: Network steering was not successful (status: -1)
I (6897) ESP_ZB_COLOR_DIMM_LIGHT: Network steering was not successful (status: -1)
I (10577) ESP_ZB_COLOR_DIMM_LIGHT: Network steering was not successful (status: -1)
I (14247) ESP_ZB_COLOR_DIMM_LIGHT: Network steering was not successful (status: -1)
I (17917) ESP_ZB_COLOR_DIMM_LIGHT: Network steering was not successful (status: -1)
I (21587) ESP_ZB_COLOR_DIMM_LIGHT: Network steering was not successful (status: -1)
I (25257) ESP_ZB_COLOR_DIMM_LIGHT: Network steering was not successful (status: -1)
I (28927) ESP_ZB_COLOR_DIMM_LIGHT: Network steering was not successful (status: -1)
I (32597) ESP_ZB_COLOR_DIMM_LIGHT: Network steering was not successful (status: -1)
I (36267) ESP_ZB_COLOR_DIMM_LIGHT: Network steering was not successful (status: -1)
I (39937) ESP_ZB_COLOR_DIMM_LIGHT: Network steering was not successful (status: -1)
I (43607) ESP_ZB_COLOR_DIMM_LIGHT: Network steering was not successful (status: -1)

assert failed: zb_esp_radio_process zb_esp_mac.c:290 (false) ......

I'm not fully aware of how this works internally, but I do get the idea that this example does not properly persist the network parameters (or something) and just tries to join a new network every boot.

Next thing I tried is using esp_zb_start(true) instead of esp_zb_start(false) and flipping the condition in the ESP_ZB_BDB_SIGNAL_DEVICE_REBOOT case. (based on the esp_HA_customized_switch` example)

I rejoin the device to the network:

I (567) main_task: Returned from app_main()
I (577) ESP_ZB_COLOR_DIMM_LIGHT: ZDO signal: 23, status: -1
I (3227) ESP_ZB_COLOR_DIMM_LIGHT: Start network steering
I (5887) ESP_ZB_COLOR_DIMM_LIGHT: Network steering was not successful (status: -1)
I (9557) ESP_ZB_COLOR_DIMM_LIGHT: Network steering was not successful (status: -1)
I (12427) ESP_ZB_COLOR_DIMM_LIGHT: ZDO signal: 50, status: 0
I (12457) ESP_ZB_COLOR_DIMM_LIGHT: Joined network successfully (Extended PAN ID: e5:18:52:1b:4a:ed:c0:01, PAN ID: 0x4d90)

Which worked just fine, so I disable permit join on my network and then try to reboot the device.

After the reboot the device becomes available immediately in my network!

I (464) main_task: Returned from app_main()
I (474) ESP_ZB_COLOR_DIMM_LIGHT: ZDO signal: 23, status: -1
I (2324) ESP_ZB_COLOR_DIMM_LIGHT: ZDO signal: 50, status: 0
E (7424) ESP_ZB_COLOR_DIMM_LIGHT: Failed to initialize Zigbee stack (status: 0)
I (52964) ESP_ZB_COLOR_DIMM_LIGHT: ZDO signal: 50, status: 0
E (53664) ESP_ZB_COLOR_DIMM_LIGHT: Failed to initialize Zigbee stack (status: 0)
I (102794) ESP_ZB_COLOR_DIMM_LIGHT: Light color x change to:24903
I (102794) ESP_ZB_COLOR_DIMM_LIGHT: Light color y change to:11140
I (105634) ESP_ZB_COLOR_DIMM_LIGHT: Light level change to:173

The error ESP_ZB_COLOR_DIMM_LIGHT: Failed to initialize Zigbee stack (status: 0) is a bit weird, but the device now works properly, and survives restarts without the need to fully rejoin to the network.

I still see the assert issue happening a few times, it did connect properly a few times, but after some reboots it ended up with this state again unfortunately:

I (567) main_task: Returned from app_main()
I (577) ESP_ZB_COLOR_DIMM_LIGHT: ZDO signal: 23, status: -1
I (3227) ESP_ZB_COLOR_DIMM_LIGHT: Start network steering
I (5887) ESP_ZB_COLOR_DIMM_LIGHT: Network steering was not successful (status: -1)

Just a guess here, but can it be that after a connection issue it forgets to connect to the network and tries to join a new network?

This is the final diff of the example:

diff --git a/examples/esp_zigbee_HA_sample/HA_color_dimmable_light/main/esp_zb_light.c b/examples/esp_zigbee_HA_sample/HA_color_dimmable_light/main/esp_zb_light.c
index 7f66485..2df9a85 100644
--- a/examples/esp_zigbee_HA_sample/HA_color_dimmable_light/main/esp_zb_light.c
+++ b/examples/esp_zigbee_HA_sample/HA_color_dimmable_light/main/esp_zb_light.c
@@ -86,7 +86,7 @@ void esp_zb_app_signal_handler(esp_zb_app_signal_t *signal_struct)
         break;
     case ESP_ZB_BDB_SIGNAL_DEVICE_FIRST_START:
     case ESP_ZB_BDB_SIGNAL_DEVICE_REBOOT:
-        if (err_status == ESP_OK) {
+        if (err_status != ESP_OK) {
             ESP_LOGI(TAG, "Start network steering");
             esp_zb_bdb_start_top_level_commissioning(ESP_ZB_BDB_MODE_NETWORK_STEERING);
         } else {
@@ -117,12 +117,13 @@ static void esp_zb_task(void *pvParameters)
     /* initialize Zigbee stack with Zigbee end-device config */
     esp_zb_cfg_t zb_nwk_cfg = ESP_ZB_ZED_CONFIG();
     esp_zb_init(&zb_nwk_cfg);
+    esp_zb_set_network_channel(15);
     /* set the color dimmable light device config */
     esp_zb_color_dimmable_light_cfg_t light_cfg = ESP_ZB_DEFAULT_COLOR_DIMMABLE_LIGHT_CONFIG();
     esp_zb_ep_list_t *esp_zb_color_dimmable_light_ep = esp_zb_color_dimmable_light_ep_create(HA_ESP_LIGHT_ENDPOINT, &light_cfg);
     esp_zb_device_register(esp_zb_color_dimmable_light_ep);
     esp_zb_device_add_set_attr_value_cb(attr_cb);
-    ESP_ERROR_CHECK(esp_zb_start(false));
+    ESP_ERROR_CHECK(esp_zb_start(true));
     esp_zb_main_loop_iteration();
 }
likunqiao097304 commented 1 year ago

I'm not fully aware of how this works internally, but I do get the idea that this example does not properly persist the network parameters (or something) and just tries to join a new network every boot.

This is what I am checking for.

After the reboot the device becomes available immediately in my network!

Good to hear it works. For the assert failed as I mentioned in the esp-zboss-lib, we will fix that.

jurriaan commented 1 year ago

Good to hear it works. For the assert failed as I mentioned in the esp-zboss-lib, we will fix that.

It does work initially in that configuration, but it does unfortunately lose connection again a few reboots/crashes later.

As mentioned can it be that also after a connection issue it forgets how to connect to the network and tries to join a new network?

Thanks for the quick responses! :)

likunqiao097304 commented 1 year ago

Good to hear it works. For the assert failed as I mentioned in the esp-zboss-lib, we will fix that.

It does work initially in that configuration, but it does unfortunately lose connection again a few reboots/crashes later.

As mentioned can it be that also after a connection issue it forgets how to connect to the network and tries to join a new network?

Thanks for the quick responses! :)

Rejoin issue has been fixed in the latest release, please take a look

jurriaan commented 1 year ago

Nice! it seems like the issue is fixed!

likunqiao097304 commented 1 year ago

Nice! it seems like the issue is fixed!

Could you close the issue if you see this is fixed