espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.49k stars 7.26k forks source link

Close the tcp connection, when server doesn't end the websocket gracefully (IDFGH-9274) #10657

Open ankit-thealchemist opened 1 year ago

ankit-thealchemist commented 1 year ago

Answers checklist.

IDF version.

v4.4.3

Operating System used.

Linux

How did you build your project?

VS Code IDE

If you are using Windows, please specify command line type.

None

Development Kit.

esp32 devtkit c v4

Power Supply used.

USB

What is the expected behavior?

We are developing the EV charger, using the OCPP protocol 1.6 which needed to be used over the websocket. A instance of a server having the issue, it sometimes closes the websocket without closing the TCP connection, as you can see in the attacched log. After that our websocket connection doesn't work. It just donot connect to the server again. I tried manually disconnecting the websocket and try to reconnect it, but it doesn't work.

On the other hand other linux based devices doesn't have the same issue, so its hard for me to convince the server team about the websocket server fault.

As a workaround, whenever we receive such error, we need to disconnect and reconnect the wifi. This seems to resolve the issue, but it takes some time in time critical event.

Is there is a solution to reset the TCP connection without resetting the wifi?

What is the actual behavior?

image

Steps to reproduce.

\

Debug Logs.

No response

More Information.

No response

ESP-YJM commented 1 year ago

@ankit-thealchemist Could you please offer the simple code that when your websocket client disconnected from the server, how do you re-connect? If you can offer some log, it may be more helpful.

ankit-thealchemist commented 1 year ago

@ESP-YJM Thanks for the speedy response! If I gave you the simple code, it won't work, because it gives error on my particular server. You can take the simple websocket example as the code. I just then disconnect the wifi by esp_wifi_disconnect(), then reconnect the wifi, it resovles the problem in reconnecting the websocket.

As far as I can understand about the problem is that, In the normal circumstances, if we close the websocket, we terminate the websocket connection by sending the close frame and then close the underlying TCP connection as a norm. In my case, server send the close frame, but donot send the request to close the TCP connection. Esp wait for closing the Tcp connection, and its wait forever to close the tcp connection. When I disconnect from wifi connection, it destroy the whole netif stack associated with the wifi adapter.It destroy all the tcp connection and due to this, the underlying tcp connection closes. After reconnecting the wifi, the whole tcp connection stack is recreated, which solves the problem, of reconnection with the web socket connection.

Please let me know if you need anything more.

ESP-YJM commented 1 year ago

@ankit-thealchemist I check the code, ESP not wait forever to close the TCP connection, it will wait about 1s and then close TCP itself. You can refer to code https://github.com/espressif/esp-idf/blob/v4.4.3/components/esp_websocket_client/esp_websocket_client.c#L700-L703, it will break the while cycle, and call esp_transport_close(client->transport) to close TCP. And then you only need call esp_websocket_client_start, it may connect websocket server. You can add some log to track.

chipweinberger commented 1 year ago

ESP-IDF only closest sockets using LRU purge.

this is generally what you want.

if you want some other behavior, you can convert the request object that you get in the HTTP handler into fd. see httpd_req_Xxx() functions

you can then close that socket whenever you like.

I also recommend taking a look at the http code. it's not that complicated. see here https://github.com/espressif/esp-idf/issues/10594#issuecomment-1399841433

ESP-YJM commented 1 year ago

Hi @chipweinberger , If I understand @ankit-thealchemist correct, this is not an http server issue, this is an esp_websocket_client component issue.

ankit-thealchemist commented 1 year ago

@ESP-YJM

@ankit-thealchemist I check the code, ESP not wait forever to close the TCP connection, it will wait about 1s and then close TCP itself. You can refer to code https://github.com/espressif/esp-idf/blob/v4.4.3/components/esp_websocket_client/esp_websocket_client.c#L700-L703, it will break the while cycle, and call esp_transport_close(client->transport) to close TCP. And then you only need call esp_websocket_client_start, it may connect websocket server. You can add some log to track.

image

yes it will break the while cycle and clear tcp layer and delete the task. The problem it never call out the 'WEBSOCKET_EVENT_CLOSED' event(when ret = 0, and unfortunately thats my case) and I will never be able to catch this event in the websocket event handler in main application . Due to this I never able to start the websocket connection again. Please provide the alternative resolution to this.

ankit-thealchemist commented 1 year ago

@ESP-YJM Any update on the issue?

ESP-YJM commented 1 year ago

@ankit-thealchemist You can remove break when ret is 0 and try.

hitecSmartHome commented 3 months ago

I have this exact same problem. I got a close frame from the server and it just drops the connection. My ESP stucks in this state and can not reconnect. Also it does not see any close so it thinks its still connected and every other send fails.

void WebsocketClient::handler(void* event_handler_arg, esp_event_base_t event_base, int32_t event_id, void* event_data) {
    esp_websocket_event_data_t* data = (esp_websocket_event_data_t*)event_data;

    switch (event_id) {
        case WEBSOCKET_EVENT_CONNECTED:
            if (connectCB) { connectCB(); }
            return;

        case WEBSOCKET_EVENT_DISCONNECTED:
            if (disconnectCB) { disconnectCB(); }
            return;

        case WEBSOCKET_EVENT_ERROR:
            if (errorCB) { errorCB(); }
            return;

        case WEBSOCKET_EVENT_DATA:
            handleWebSocketData(data);
            return;

        default:
            return;
    }
}

void WebsocketClient::handleWebSocketData(esp_websocket_event_data_t* data) {
    if (data->op_code == 0x08 && data->data_len == 2) {
        printf("[WebsocketClient] - Got Close data\n");
        return;
    }

    // Check for ping data
    if (data->op_code == 0x09) {
        printf("[WebsocketClient] - Got Ping data\n");
        return;
    }

    // Check for pong data
    if (data->op_code == 0x0A) {
        printf("[WebsocketClient] - Got Pong data\n");
        return;
    }

}

This is how I got here. I can't reinit. esp_websocket_client_close and esp_websocket_client_destroy can not be called from the event handler and they also does nothing outside of the handler either.

I also can't reinit the wifi because I'm using wired ethernet.

adityachd commented 3 weeks ago

Hi. I was having the same issue. The websocket server sends opcode 8 but does not close the tcp connection (after 1 hour as configured on the server I used). The server did not close the TCP connection and ESP would wait for TCP connection to get closed, which would never happen. To solve this, I configured the server to close the TCP connection after sending ws close request, opcode 8. Now ESP retries to make a websocket connection. For anyone else who has the same issue, I also achieved this same functionality by tweaking the esp IDF code a bit. Here is the function(inside esp_websocket_client.c) that had to be changed for achieving reconnection in such a case,

static void esp_websocket_client_task(void *pv)
{
    const int lock_timeout = portMAX_DELAY;
    esp_websocket_client_handle_t client = (esp_websocket_client_handle_t) pv;
    client->run = true;

    //get transport by scheme
    client->transport = esp_transport_list_get_transport(client->transport_list, client->config->scheme);

    if (client->transport == NULL) {
        ESP_LOGE(TAG, "There are no transports valid, stop websocket client");
        client->run = false;
    }
    //default port
    if (client->config->port == 0) {
        client->config->port = esp_transport_get_default_port(client->transport);
    }

    client->state = WEBSOCKET_STATE_INIT;
    xEventGroupClearBits(client->status_bits, STOPPED_BIT | CLOSE_FRAME_SENT_BIT);
    int read_select = 0;
    while (client->run) {
        if (xSemaphoreTakeRecursive(client->lock, lock_timeout) != pdPASS) {
            ESP_LOGE(TAG, "Failed to lock ws-client tasks, exiting the task...");
            break;
        }
        switch ((int)client->state) {
            case WEBSOCKET_STATE_INIT:
                if (client->transport == NULL) {
                    ESP_LOGE(TAG, "There are no transport");
                    client->run = false;
                    break;
                }
                if (esp_transport_connect(client->transport,
                                          client->config->host,
                                          client->config->port,
                                          client->config->network_timeout_ms) < 0) {
                    ESP_LOGE(TAG, "Error transport connect");
                    esp_websocket_client_abort_connection(client);
                    break;
                }
                ESP_LOGD(TAG, "Transport connected to %s://%s:%d", client->config->scheme, client->config->host, client->config->port);

                client->state = WEBSOCKET_STATE_CONNECTED;
                client->wait_for_pong_resp = false;
                esp_websocket_client_dispatch_event(client, WEBSOCKET_EVENT_CONNECTED, NULL, 0);

                break;
            case WEBSOCKET_STATE_CONNECTED:
                if ((CLOSE_FRAME_SENT_BIT & xEventGroupGetBits(client->status_bits)) == 0) { // only send and check for PING
                                                                                                          // if closing hasn't been initiated
                    if (_tick_get_ms() - client->ping_tick_ms > client->config->ping_interval_sec*1000) {
                        client->ping_tick_ms = _tick_get_ms();
                        ESP_LOGD(TAG, "Sending PING...");
                        esp_transport_ws_send_raw(client->transport, WS_TRANSPORT_OPCODES_PING | WS_TRANSPORT_OPCODES_FIN, NULL, 0, client->config->network_timeout_ms);

                        if (!client->wait_for_pong_resp && client->config->pingpong_timeout_sec) {
                            client->pingpong_tick_ms = _tick_get_ms();
                            client->wait_for_pong_resp = true;
                        }
                    }

                    if ( _tick_get_ms() - client->pingpong_tick_ms > client->config->pingpong_timeout_sec*1000 ) {
                        if (client->wait_for_pong_resp) {
                            ESP_LOGE(TAG, "Error, no PONG received for more than %d seconds after PING", client->config->pingpong_timeout_sec);
                            esp_websocket_client_abort_connection(client);
                            break;
                        }
                    }
                }

                if (read_select == 0) {
                    ESP_LOGV(TAG, "Read poll timeout: skipping esp_transport_read()...");
                    break;
                }
                client->ping_tick_ms = _tick_get_ms();

                if (esp_websocket_client_recv(client) == ESP_FAIL) {
                    ESP_LOGE(TAG, "Error receive data");
                    esp_websocket_client_abort_connection(client);
                    break;
                }
                break;
            case WEBSOCKET_STATE_WAIT_TIMEOUT:

                if (!client->config->auto_reconnect) {
                    ESP_LOGE(TAG, "Auto reconnect off after timeout");
                    client->run = false;
                    break;
                }
                if (_tick_get_ms() - client->reconnect_tick_ms > client->wait_timeout_ms) {
                    client->state = WEBSOCKET_STATE_INIT;
                    client->reconnect_tick_ms = _tick_get_ms();
                    ESP_LOGD(TAG, "Reconnecting...");
                }
                break;
            case WEBSOCKET_STATE_CLOSING:
                // if closing not initiated by the client echo the close message back
                if ((CLOSE_FRAME_SENT_BIT & xEventGroupGetBits(client->status_bits)) == 0) {
                    ESP_LOGD(TAG, "Closing initiated by the server, sending close frame");
                    esp_transport_ws_send_raw(client->transport, WS_TRANSPORT_OPCODES_CLOSE | WS_TRANSPORT_OPCODES_FIN, NULL, 0, client->config->network_timeout_ms);
                    xEventGroupSetBits(client->status_bits, CLOSE_FRAME_SENT_BIT);
                }
                break;
            default:
                ESP_LOGD(TAG, "Client run iteration in a default state: %d", client->state);
                break;
        }
        xSemaphoreGiveRecursive(client->lock);
        if (WEBSOCKET_STATE_CONNECTED == client->state) {
            read_select = esp_transport_poll_read(client->transport, 1000); //Poll every 1000ms
            if (read_select < 0) {
                ESP_LOGE(TAG, "Network error: esp_transport_poll_read() returned %d, errno=%d", read_select, errno);
                esp_websocket_client_abort_connection(client);
            }
        } else if (WEBSOCKET_STATE_WAIT_TIMEOUT == client->state) {
            // waiting for reconnecting...
            vTaskDelay(client->wait_timeout_ms / 2 / portTICK_RATE_MS);
        } else if (WEBSOCKET_STATE_CLOSING == client->state &&
                  (CLOSE_FRAME_SENT_BIT & xEventGroupGetBits(client->status_bits))) {
            ESP_LOGD(TAG, " Waiting for TCP connection to be closed by the server");
            // int ret = esp_transport_ws_poll_connection_closed(client->transport, 1000);
            int ret = esp_websocket_client_abort_connection(client);
            vTaskDelay(client->wait_timeout_ms / 2 / portTICK_RATE_MS);
            client->state = WEBSOCKET_STATE_INIT;
            client->reconnect_tick_ms = _tick_get_ms();
            ESP_LOGD(TAG, "Reconnecting..., run = %d, transport =%d",client->run,client->transport==NULL?0:1); 
            if (ret == 0) {
                // still waiting
                // break;
            }
            if (ret < 0) {
                ESP_LOGW(TAG, "Connection terminated while waiting for clean TCP close");
            }
            // client->run = false;
            // client->state = WEBSOCKET_STATE_UNKNOW;
            esp_websocket_client_dispatch_event(client, WEBSOCKET_EVENT_CLOSED, NULL, 0);
            // break;
        }
    }

    esp_transport_close(client->transport);
    xEventGroupSetBits(client->status_bits, STOPPED_BIT);
    client->state = WEBSOCKET_STATE_UNKNOW;
    vTaskDelete(NULL);
}

Hope this helps someone