espressif / esp-homekit-sdk

550 stars 99 forks source link

Revisit: Decryption error/Connection lost on ESP32 with multiple Apple devices running Home #18

Closed Mixiaoxiao closed 3 years ago

Mixiaoxiao commented 3 years ago

I am the author of Mixiaoxiao/Arduino-HomeKit-ESP8266.

This official HomeKit sdk I am using currently to build my newer projects, and I have made an Arduino library version of this sdk.

But I also got the issue same as https://github.com/espressif/esp-homekit-sdk/issues/14.

Named "Decryption error/Connection lost on ESP32 with multiple Apple devices running Home".

My issue is as follows:

  1. Both my iPhone and iPad (as a HomeKit hub) connected to ESP, both response ok
  2. My iPhone goes screen off (its WiFi will go sleep too)
  3. log show: Decryption error/Connection lost. Marking session as invalid
  4. then log show: two lines of HomeKit Session terminated (that is, both my iPhone and iPad connections are closed)
  5. My iPad shows “No response”

The problem is that, when my iPhone is disconnected (or just WiFi is not active), my iPad's(same iCloud ID with iPhone) connection will be closed too. This behavior is NOT as expected. Same as Brawrdon's comment: https://github.com/espressif/esp-homekit-sdk/issues/14#issuecomment-737817149

I figure out the related codes as follows:

  1. int hap_decrypt_data(...)
...
if (read_fn(frame->data, 2, context) < 2)   
    return hap_decrypt_error(session);
...
  1. hap_decrypt_error(hap_secure_session_t *session)
int hap_decrypt_error(hap_secure_session_t *session)
{
    ESP_MFI_DEBUG(ESP_MFI_DEBUG_INFO, "Decryption error/Connection lost. Marking session as invalid");
    if (session) {
        session->state = STATE_INVALID;
            hap_close_ctrl_sessions(session->ctrl);
    }
    return HAP_FAIL;
}
  1. void hap_close_ctrl_sessions(hap_ctrl_data_t *ctrl)
    for (i = 0; i < HAP_MAX_SESSIONS; i++) {
            if (!hap_priv.sessions[i])
                    continue;
        if (hap_priv.sessions[i]->ctrl == ctrl) {
            httpd_sess_trigger_close(hap_priv.server, hap_priv.sessions[i]->conn_identifier);
        }
    }

Consider that, in int hap_decrypt_data(...)

int len = read_fn(frame->data, 2, context); 

read_fn here indeed calls the hap_httpd_raw_recv, that is the recv (lwip_recv), try receiving data from tcp connection. if len is -1, means the tcp-socket is closed; if len is 0, means nothing received, that is reading is timeout or the peer's WiFi is not active (in sleep mode) now; if len is 1, means we only received 1 byte during a receiving-timeout time (SO_RCVTIMEO=0x1006=4102, in seconds, maybe?), here we can do consider that we really got the Decryption error/Connection lost. According to the protocol of HomeKit, here, at least 2 bytes should be received.

So, here, should be modified to

        int len = read_fn(frame->data, 2, context);
        if (len == 0) { //nothing received, but I think we should NOT consider this is a 'decrypt_error'
            return 0;
        }
        if (len < 2) {
            //len is -1 or 1
            //len = -1: socket disconnected
            //len =  1: try receiving 2 bytes timeout (SO_RCVTIMEO is set in esp_hap_ip_services.c)
            return hap_decrypt_error(session);
        }

Then, in hap_close_ctrl_sessions, all sessions with same ctrl (session->ctrl) will be closed. In my debug, both my iPhone and iPad are of same ctrl. All iOS devices with same iCloud ID will have a same ctrl. I am not sure, but it should be. Here, the problem come out. Only one device disconnected, but why close all sessions with this ctrl???

So, here, should be modified to

//only close the disconnected session
void hap_close_ctrl_sessions_fix(hap_secure_session_t *session)
{
    if (!session)
        return;
    int i;
    for (i = 0; i < HAP_MAX_SESSIONS; i++) {
            if (!hap_priv.sessions[i])
                    continue;
        if (hap_priv.sessions[i] == session) {
                    hap_report_event(HAP_EVENT_CTRL_DISCONNECTED, (session->ctrl->info.id),
                            sizeof((session->ctrl->info.id)));
                    httpd_sess_trigger_close(hap_priv.server, hap_priv.sessions[i]->conn_identifier);
        }
    }
}

Another thing is that, in WiFi connection with other devices, it is really difficult to exactly-know the connection is lost or not. The peer's WiFi can be in sleep mode(but the connection in lwip can NOT know that), and it' WiFi will response nothing (lwip will always receive 0 byte). So I think the keepalive of tcp is a good method to do this. When the peer's WiFi is in sleep mode for a long time, the lwip can know that and just drop it. This behavior is really meaningful for ESP8266 with small memory(ram), it can quickly release the tcp buffer memory when the peer is no longer active. And the keepalive option has been implemented both in maximkulkin/esp-homekit/server.c#L3705 and my Mixiaoxiao/Arduino-HomeKit-ESP8266/arduino_homekit_server.cpp#L3006.

I do know the keepalive is not allowed in HomeKit Protocol. HAP Non-Commercial Version section 6.2.3: "HAP accessory servers must not use keepalive messages, which periodically wake up iOS devices". But I do consider it is ok to make a config to enable keepalive for hobby projects.

I have just create a bugfix-commit including all above-mentioned modifications for my Arduino-HomeKit-ESP. Tested for 24 hours and works fine.

shahpiyushv commented 3 years ago

@Mixiaoxiao thanks for the detailed analysis. Rightly said, the below snippet indeed seems wrong and a root cause for the issues reported.

    for (i = 0; i < HAP_MAX_SESSIONS; i++) {
            if (!hap_priv.sessions[i])
                    continue;
        if (hap_priv.sessions[i]->ctrl == ctrl) {
            httpd_sess_trigger_close(hap_priv.server, hap_priv.sessions[i]->conn_identifier);
        }
    }

I have verified with your fix and it works as intended.

The changes in the check for return value of read_fn() may not be required though, since the code is normally supposed to reach here, only when the select call in http_server returns and the particular socket descriptor is ready to be read. Even if we return 0 from here, the http server code will close the session assuming that the peer has closed the connection.

Do you plan to raise a PR for these fixes, or should I add them myself? In my code, I have renamed hap_close_ctrl_sessions to hap_ctrl_close_all_sessions and added a new hap_ctrl_close_single_session as per the code you shared. I will also add a config option for keepalive, which will be enabled automatically for non MFi HomeKit applications.

Meanwhile, thanks for your efforts for the Arduino support :)

Mixiaoxiao commented 3 years ago

Thanks for your quick replay. It is ok to modify by yourself, following the sdk's code style. I consider it will be nice to make a comment in your modification with a link of this issuse analysis, if it is possible.

Also, I sugest the function names:

  1. change hap_close_ctrl_sessions (ctrl) to hap_close_sessions_of_ctrl (ctrl)
  2. create a new hap_close_session (session) to close a certain session (irrelevant with ctrl)
  3. the name of hap_decrypt_error may be not suitable here, since it is caused by a connection lost in most cases
drewandre commented 3 years ago

Hopefully this helps someone in the future, but I was receiving this error after assigning a value that was out of range for a characteristic that can only accept values of 0-100. I was using the provided hap_char_hue_create fxn to set the lightbulb hue but didn't see that within that fxn there's a hap_char_float_set_constraints fxn that shows that this specific characteristic has a lower bound of 0 and an upper of 360. It would be nice to be warned when passing a value that falls outside of characteristic ranges but it was pretty obvious after a minute of following the function calls.

shahpiyushv commented 3 years ago

@Mixiaoxiao , the specific fix and the keep alive config option have now been added. Thanks for debugging and the finding root cause of the issue.

Brawrdon commented 3 years ago

@Mixiaoxiao thanks for looking into this! After rebooting my router 700 times, I managed to solve the issue by assigning a static IP address to my ESP32. I've removed the static assignment to test this fix and it seems to be working. I'll report if I notice a crash!

Off topic but I just wanted to add that I'm also working on an Arduino version of this SDK and it's currently available via PlatformIO. 😄

shahpiyushv commented 3 years ago

Off topic but I just wanted to add that I'm also working on an Arduino version of this SDK and it's currently available via PlatformIO. 😄

@Brawrdon that's good to know. Thanks for your efforts :)