espressif / esp-matter

Espressif's SDK for Matter
Apache License 2.0
679 stars 155 forks source link

Groups don't survive reboot (CON-863) #713

Open jonsmirl opened 11 months ago

jonsmirl commented 11 months ago

Maybe an IPv6 issue?

1) Make a Matter group with a Device A as a light switch 2) Bind the switch to the group 3) Add Device B light to group -- both devices are on same wifi router. 4) Verify that Device A can send commands to device B. Like toggle the light on/off. 5) this works for me as expected,

5) Reboot device A and don't reboot Device B 6) Try using the grouped switch on Device A For me it doesn't work. 7) Reboot the wifi router. I have Amazon Eero router. 8) Now switch will work.

Is something wrong with how the IPv6 multicast was constructed? The CHIP log on device A says it is sending the request, but device B does not receive it.

I will need to build a promiscuous wifi sniffer to see if Device A is sending the packets. Right now I can't tell if Device A is not sending or if the router is not retransmitting. You guys likely have the right equipment around to check this.

Update -- MDNS works after reboot. Note that Matter is using IPv6 groups. But is MDNS using IPv6 or IPv4?

wqx6 commented 11 months ago

I cannot reproduce this error. Could you please share the commands you sent to the switch and the light and check if there is the log of chip[IN]: Joining Multicast Group with address UDP after you reboot the switch?

jonsmirl commented 11 months ago

My current opinion is that it is the router causing this. IPv6 groups did not work at all on my old router. Now I have bought a new Amazon Eero one. If we assume it is the router, is there anything that can be done from the ESP side to force the router to rebuild the group? I am still testing to try and understand what is happening.

It would be very useful if CHIP people put together some Android/IOS apps which could be used to test IPv6 multicast groups. Then you could use two or three phones to test routers. I suspect there are a lot of home routers with various IPv6 bugs.

jonsmirl commented 11 months ago

Can you try reflashing when you reboot? I have not changed the fabric ID and group number. I did not erase the partition with the commissioning data in it.

What I think is going on that when the node reboots and hooks to the multicast group, it is not hooking to the same group as before the reboot. Instead it is making a new one which is not connected to the previous group. I tried using my wifi sniffer to look at it, but everything is encrypted so I don't know what is going on.

I have six Eeros connected over Ethernet backbone. So the multi-cast packets have to propagate through that network for things to work right.

jonsmirl commented 11 months ago

What are these new errors, I started getting them after I updated CHIP.

I (359340729) chip[EVL]: Dropped 1 event from buffer with priority 1 and event number  0x0000000000070049 due to overflow: event priority_level: 1
I (359340989) chip[EVL]: Dropped 1 event from buffer with priority 1 and event number  0x000000000007004A due to overflow: event priority_level: 1
I (359341149) chip[EVL]: Dropped 1 event from buffer with priority 1 and event number  0x000000000007004B due to overflow: event priority_level: 1
I (359337479) chip[EM]: <<< [E:36749i S:34997 M:121391162] (S) Msg TX to 2:AB0DE00000000000 [4507] [UDP:192.168.1.72%st1:48013] --- Type 0001:05 (IM:ReportData)
I (359337619) chip[EM]: >>> [E:36749i S:34997 M:160426002 (Ack:121391162)] (S) Msg RX from 2:AB0DE00000000000 [4507] --- Type 0001:01 (IM:StatusResponse)
I (359337619) chip[IM]: Received status response, status is 0x00
I (359337619) chip[EM]: <<< [E:36749i S:34997 M:121391163 (Ack:160426002)] (S) Msg TX to 2:AB0DE00000000000 [4507] [UDP:192.168.1.72%st1:48013] --- Type 0000:10 (SecureChannel:StandaloneAck)
I (359340719) h: active_key 4
I (359340719) ep: Initial button pressed 1
I (359340719) esp_matter_attribute: ********** W : Endpoint 0x0005's Cluster 0x0000003B's Attribute 0x00000001 is 1 **********
I (359340719) ep: app_attribute_update_cb type 0 priv 0x3fcb5d94 EP 5 CL 59 AT 1
I (359340719) ep: app_attribute_update_cb type 0 priv 0x3fcb5d94 btype 1008009233 4
I (359340719) ep: app_attribute_update_cb type 0 priv 0x3fcb5d94 btype 17 4
I (359340719) ep: app_driver_attribute_update EP 5 CL 59 AT 1 PTR 0x3c159138
I (359340729) ep: app_attribute_update_cb type 1 priv 0x3fcb5d94 EP 5 CL 59 AT 1
I (359340729) ep: app_attribute_update_cb type 1 priv 0x3fcb5d94 btype 1008009233 4
I (359340729) chip[ZCL]: SwitchServer: OnInitialPress
I (359340729) chip[EVL]: Dropped 1 event from buffer with priority 1 and event number  0x0000000000070049 due to overflow: event priority_level: 1
I (359340739) esp_matter_attribute: ********** R : Endpoint 0x0005's Cluster 0x0000003B's Attribute 0x00000001 is 1 **********
I (359340739) chip[EM]: <<< [E:36750i S:34997 M:121391164] (S) Msg TX to 2:AB0DE00000000000 [4507] [UDP:192.168.1.72%st1:48013] --- Type 0001:05 (IM:ReportData)
I (359340839) chip[EM]: >>> [E:36750i S:34997 M:160426003 (Ack:121391164)] (S) Msg RX from 2:AB0DE00000000000 [4507] --- Type 0001:01 (IM:StatusResponse)
I (359340839) chip[IM]: Received status response, status is 0x00
I (359340839) chip[EM]: <<< [E:36750i S:34997 M:121391165 (Ack:160426003)] (S) Msg TX to 2:AB0DE00000000000 [4507] [UDP:192.168.1.72%st1:48013] --- Type 0000:10 (SecureChannel:StandaloneAck)
I (359340959) h: active_key 4
I (359340959) ep: Short button release 1
I (359340959) esp_matter_attribute: ********** R : Endpoint 0x0005's Cluster 0x00000006's Attribute 0x00000000 is 0 **********
I (359340959) chip[ZCL]: Toggle ep5 on/off from state 0 to 1
I (359340959) esp_matter_attribute: ********** R : Endpoint 0x0005's Cluster 0x00000006's Attribute 0x0000FFFC is 1 **********
I (359340959) esp_matter_attribute: ********** R : Endpoint 0x0005's Cluster 0x00000006's Attribute 0x00004001 is 0 **********
I (359340959) chip[ZCL]: On Command - OffWaitTime :  0
I (359340959) esp_matter_attribute: ********** W : Endpoint 0x0005's Cluster 0x00000006's Attribute 0x00004002 is 0 **********
I (359340959) ep: app_attribute_update_cb type 0 priv 0x3fcb5d94 EP 5 CL 6 AT 16386
I (359340969) ep: app_attribute_update_cb type 0 priv 0x3fcb5d94 btype 1008009233 4
I (359340969) ep: app_attribute_update_cb type 0 priv 0x3fcb5d94 btype 17 4
I (359340969) ep: app_driver_attribute_update EP 5 CL 6 AT 16386 PTR 0x3c159138
I (359340969) ep: app_attribute_update_cb type 1 priv 0x3fcb5d94 EP 5 CL 6 AT 16386
I (359340969) ep: app_attribute_update_cb type 1 priv 0x3fcb5d94 btype 1008009233 4
I (359340969) chip[ZCL]: On/Toggle Command - Stop Timer
I (359340969) esp_matter_attribute: ********** W : Endpoint 0x0005's Cluster 0x00000006's Attribute 0x00004000 is 1 **********
I (359340969) ep: app_attribute_update_cb type 0 priv 0x3fcb5d94 EP 5 CL 6 AT 16384
I (359340969) ep: app_attribute_update_cb type 0 priv 0x3fcb5d94 btype 1008009233 4
I (359340969) ep: app_attribute_update_cb type 0 priv 0x3fcb5d94 btype 17 4
I (359340969) ep: app_driver_attribute_update EP 5 CL 6 AT 16384 PTR 0x3c159138
I (359340969) ep: app_attribute_update_cb type 1 priv 0x3fcb5d94 EP 5 CL 6 AT 16384
I (359340969) ep: app_attribute_update_cb type 1 priv 0x3fcb5d94 btype 1008009233 4
I (359340969) esp_matter_attribute: ********** W : Endpoint 0x0005's Cluster 0x00000006's Attribute 0x00000000 is 1 **********
I (359340969) ep: app_attribute_update_cb type 0 priv 0x3fcb5d94 EP 5 CL 6 AT 0
I (359340969) ep: app_attribute_update_cb type 0 priv 0x3fcb5d94 btype 1008009233 4
I (359340969) ep: app_attribute_update_cb type 0 priv 0x3fcb5d94 btype 17 4
I (359340969) ep: app_driver_attribute_update EP 5 CL 6 AT 0 PTR 0x3c159138
I (359340969) ep: Power 1 RGB 255 204 165
I (359340969) ep: Update button position 5 power 1 R 255 G 204 B 165
I (359340969) ep: Button pos 5 LED 2:  63 41 51, rgb 255 204 165
I (359340979) ep: app_attribute_update_cb type 1 priv 0x3fcb5d94 EP 5 CL 6 AT 0
I (359340979) ep: app_attribute_update_cb type 1 priv 0x3fcb5d94 btype 1008009233 4
I (359340979) esp_matter_attribute: ********** W : Endpoint 0x0005's Cluster 0x00000005's Attribute 0x00000003 is 0 **********
I (359340979) ep: app_attribute_update_cb type 0 priv 0x3fcb5d94 EP 5 CL 5 AT 3
I (359340979) ep: app_attribute_update_cb type 0 priv 0x3fcb5d94 btype 1008009233 4
I (359340979) ep: app_attribute_update_cb type 0 priv 0x3fcb5d94 btype 17 4
I (359340979) ep: app_driver_attribute_update EP 5 CL 5 AT 3 PTR 0x3c159138
I (359340979) ep: app_attribute_update_cb type 1 priv 0x3fcb5d94 EP 5 CL 5 AT 3
I (359340979) ep: app_attribute_update_cb type 1 priv 0x3fcb5d94 btype 1008009233 4
I (359340979) esp_matter_attribute: ********** W : Endpoint 0x0005's Cluster 0x0000003B's Attribute 0x00000001 is 0 **********
I (359340979) ep: app_attribute_update_cb type 0 priv 0x3fcb5d94 EP 5 CL 59 AT 1
I (359340979) ep: app_attribute_update_cb type 0 priv 0x3fcb5d94 btype 1008009233 4
I (359340979) ep: app_attribute_update_cb type 0 priv 0x3fcb5d94 btype 17 4
I (359340979) ep: app_driver_attribute_update EP 5 CL 59 AT 1 PTR 0x3c159138
I (359340979) ep: app_attribute_update_cb type 1 priv 0x3fcb5d94 EP 5 CL 59 AT 1
I (359340979) ep: app_attribute_update_cb type 1 priv 0x3fcb5d94 btype 1008009233 4
I (359340979) chip[ZCL]: SwitchServer: OnShortRelease
I (359340989) chip[EVL]: Dropped 1 event from buffer with priority 1 and event number  0x000000000007004A due to overflow: event priority_level: 1
I (359340989) ep: group_cb
I (359340989) chip[-]: Group CB cluster 6 command 1
I (359340999) chip[EM]: <<< [E:36751i S:549 M:133068707] (G) Msg TX to 2:FFFFFFFFFFFF0225 [4507] [UDP:[FF35:40:FDFA:B000::1D00:225]:5540] --- Type 0001:08 (IM:InvokeCommandRequest)
I (359340999) ep: Command On
I (359341139) h: active_key 4
I (359341139) ep: Multipress Complete 1
I (359341149) chip[ZCL]: SwitchServer: OnMultiPressComplete
I (359341149) chip[EVL]: Dropped 1 event from buffer with priority 1 and event number  0x000000000007004B due to overflow: event priority_level: 1
I (359341749) esp_matter_attribute: ********** R : Endpoint 0x0005's Cluster 0x00000005's Attribute 0x00000006 is 0 **********
I (359341749) esp_matter_attribute: ********** R : Endpoint 0x0005's Cluster 0x00000005's Attribute 0x00000003 is 0 **********
I (359341749) esp_matter_attribute: ********** R : Endpoint 0x0005's Cluster 0x00000006's Attribute 0x00000000 is 1 **********
I (359341749) esp_matter_attribute: ********** R : Endpoint 0x0005's Cluster 0x00000006's Attribute 0x00004000 is 1 **********
I (359341749) esp_matter_attribute: ********** R : Endpoint 0x0005's Cluster 0x00000006's Attribute 0x00004002 is 0 **********
I (359341759) esp_matter_attribute: ********** R : Endpoint 0x0005's Cluster 0x0000003B's Attribute 0x00000001 is 0 **********
I (359341769) chip[EM]: <<< [E:36752i S:34997 M:121391166] (S) Msg TX to 2:AB0DE00000000000 [4507] [UDP:192.168.1.72%st1:48013] --- Type 0001:05 (IM:ReportData)
I (359341839) chip[EM]: >>> [E:36752i S:34997 M:160426004 (Ack:121391166)] (S) Msg RX from 2:AB0DE00000000000 [4507] --- Type 0001:01 (IM:StatusResponse)
I (359341839) chip[IM]: Received status response, status is 0x00
I (359341849) chip[EM]: <<< [E:36752i S:34997 M:121391167 (Ack:160426004)] (S) Msg TX to 2:AB0DE00000000000 [4507] [UDP:192.168.1.72%st1:48013] --- Type 0000:10 (SecureChannel:StandaloneAck)
I (359341869) app_main: Dump #17967 Free internal 108592 biggest block 43008 total 108592
I (359350589) ROUTE_HOOK: Ignore invalid ICMP packet
I (359350589) ROUTE_HOOK: Received RIO
I (359350589) ROUTE_HOOK: prefix FD87:31C3:3CE

here's my code calling the APIs

       // short press release
        ESP_LOGI(TAG, "Short button release %d", position);

        app::CommandHandler commandObj(nullptr);
        // Press moves Position from 0 (idle) to 1 (press)
        lock::chip_stack_lock(portMAX_DELAY);
        if (position == UP)
        {
            OnOff::Commands::On::DecodableType commandData;
            app::ConcreteCommandPath commandPath = {epLightID, OnOff::Id, OnOff::Commands::On::Id};
            emberAfOnOffClusterOnCallback(&commandObj, commandPath, commandData);
        }
        else
        {
            OnOff::Commands::Off::DecodableType commandData;
            app::ConcreteCommandPath commandPath = {epLightID, OnOff::Id, OnOff::Commands::Off::Id};
            emberAfOnOffClusterOffCallback(&commandObj, commandPath, commandData);
        }
        chip::app::Clusters::Switch::Attributes::CurrentPosition::Set(epLightID, 0);
        // LongPress event takes newPosition as event data
        switch_cluster::event::send_short_release(epLightID, position);
        lock::chip_stack_unlock();

        if (position == UP)
        {
            OnOff::Commands::On::Type commandDataT;

            client::command_handle_t cmd_handle;
            uint16_t local_endpoint_id = epLightID;
            cmd_handle.cluster_id = OnOff::Id;
            cmd_handle.command_id = OnOff::Commands::On::Id;
            cmd_handle.is_group = true;
            cmd_handle.command_data = &commandDataT;

            lock::status_t lock_status = lock::chip_stack_lock(portMAX_DELAY);
            if (lock_status == lock::FAILED)
            {
                ESP_LOGE(TAG, "Could not lock chip stack");
            }
            client::cluster_update(local_endpoint_id, &cmd_handle);
            if (lock_status == lock::SUCCESS)
            {
                lock::chip_stack_unlock();
            }
            ESP_LOGI(TAG, "Command On");
        }
        else
        {
            OnOff::Commands::Off::Type commandDataT;

            client::command_handle_t cmd_handle;
            uint16_t local_endpoint_id = epLightID;
            cmd_handle.cluster_id = OnOff::Id;
            cmd_handle.command_id = OnOff::Commands::Off::Id;
            cmd_handle.is_group = true;
            cmd_handle.command_data = &commandDataT;

            lock::status_t lock_status = lock::chip_stack_lock(portMAX_DELAY);
            if (lock_status == lock::FAILED)
            {
                ESP_LOGE(TAG, "Could not lock chip stack");
            }
            client::cluster_update(local_endpoint_id, &cmd_handle);
            if (lock_status == lock::SUCCESS)
            {
                lock::chip_stack_unlock();
            }
            ESP_LOGI(TAG, "Command Off");
        }
    }
jonsmirl commented 11 months ago

I increased the event buffer sizes and the Dropped error went away. Hopefully those buffers are in PSRAM.

jonsmirl commented 11 months ago

Need to sort out https://github.com/espressif/esp-matter/issues/721 and https://github.com/espressif/esp-matter/issues/714 before I can test more on this one.

dhrishi commented 3 weeks ago

@jonsmirl Did you get a chance to verify this? The above two issues you mentioned are closed. Please close the issue if resolved