espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
12.91k stars 7.09k forks source link

Websocket client disconnects and never reconnects (IDFGH-13129) #14072

Open hitecSmartHome opened 1 week ago

hitecSmartHome commented 1 week ago

Answers checklist.

IDF version.

4.4.7

Espressif SoC revision.

ESP-32 Wrover-E revision v3.0

Operating System used.

Windows

How did you build your project?

VS Code IDE

If you are using Windows, please specify command line type.

None

Development Kit.

ESP-32 Wrover-E

Power Supply used.

External 3.3V

What is the expected behavior?

Websocket client should reconnect

What is the actual behavior?

Websocket client does not reconnect

Steps to reproduce.

static void eventHandler(void* event_handler_arg, esp_event_base_t event_base, int32_t event_id, void* event_data) {
    // Access the instance through the provided argument
    WebsocketClient* instance = static_cast<WebsocketClient*>(event_handler_arg);
    instance->handler(event_handler_arg, event_base, event_id, event_data);
}

void WebsocketClient::init() {
    client = esp_websocket_client_init(&config);
    esp_websocket_register_events(
        client,
        WEBSOCKET_EVENT_ANY,
        reinterpret_cast<esp_event_handler_t>(eventHandler),
        static_cast<void*>(this));
    esp_websocket_client_start(client);
}

void WebsocketClient::handler(void* event_handler_arg, esp_event_base_t event_base, int32_t event_id, void* event_data) {
    esp_websocket_event_data_t* data = (esp_websocket_event_data_t*)event_data;

    switch (event_id) {
        case WEBSOCKET_EVENT_CONNECTED:
            if (connectCB) { connectCB(); }
            return;

        case WEBSOCKET_EVENT_DISCONNECTED:
            if (disconnectCB) { disconnectCB(); }
            return;

        case WEBSOCKET_EVENT_ERROR:
            if (errorCB) { errorCB(); }
            return;

        case WEBSOCKET_EVENT_DATA:
            handleWebSocketData(data);
            return;

        default:
            return;
    }
}

void WebsocketClient::handleWebSocketData(esp_websocket_event_data_t* data) {
    if (data->op_code == 0x08 && data->data_len == 2) {
        Serial.println("[WebsocketClient] - Got Close data");
        return;
    }

    // Check for ping data
    if (data->op_code == 0x09) {
        if(pingCB){ pingCB(); }
        return;
    }

    // Ensure payload_offset is within bounds
    if (data->payload_offset >= PACKET_BUFFER_SIZE) {
        Serial.println("[WebsocketClient] - Invalid payload offset");
        return;
    }

    // Check if incoming data will fit into the buffer
    if (data->payload_offset + data->data_len > PACKET_BUFFER_SIZE) {
        Serial.println("[WebsocketClient] - Buffer overflow");
        return;
    }

    // Copy the data to the packetBuffer
    memcpy(packetBuffer + data->payload_offset, data->data_ptr, data->data_len);

    // Check if all data has been received
    if (data->payload_offset + data->data_len == data->payload_len) {
        // Null-terminate the incoming data
        packetBuffer[data->payload_len] = '\0';
        // Process the complete data or notify the user
        if (frameCB && data->payload_len > 0) {
            frameCB(packetBuffer, data->payload_len);
        }
    }
}

Debug Logs.

[WebsocketClient] - Got Close data
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected
[ServerRouter] - Disconnected

More Information.

I'm using Arduino as a component of IDF with PlatformIO.

websocket config

webSocket.config.uri = config.serverWSURL;
webSocket.config.port = 443;
webSocket.config.transport = WEBSOCKET_TRANSPORT_OVER_SSL;
webSocket.config.path = "/test";
webSocket.config.headers = headers;
webSocket.config.task_stack = 6000;
webSocket.config.keep_alive_enable = true;
webSocket.config.disable_auto_reconnect = false;

I have tried to set a timer when the disconnect happens and reinit the socket after 30 sec

bool WebsocketClient::reInit(){
    esp_err_t err = esp_websocket_client_destroy(client);
    if(err != ESP_OK){
        return false;
    }
    init();
    return true;
}

Does not help.

gabsuren commented 1 week ago

@hitecSmartHome Thank you for raising this question. I attempted to reproduce the issue, but I was unable to. I executed both the client and server, and observed that the client successfully sent a connect event. When I shut down the server, I noticed that the client attempted to reconnect.

I tried ESP-IDF-v4.4 (which is quite old) and latest ESP-IDF and both worked fine. Please find below my example maybe it can help you

#include <stdio.h>
#include "esp_wifi.h"
#include "esp_system.h"
#include "nvs_flash.h"
#include "esp_event.h"
#include "protocol_examples_common.h"

#include "freertos/FreeRTOS.h"
#include "freertos/task.h"
#include "freertos/semphr.h"
#include "freertos/event_groups.h"

#include "esp_log.h"
#include "esp_websocket_client.h"
#include "esp_event.h"

#define NO_DATA_TIMEOUT_SEC 5

static const char *TAG = "WEBSOCKET";

static TimerHandle_t shutdown_signal_timer;
static SemaphoreHandle_t shutdown_sema;

static void shutdown_signaler(TimerHandle_t xTimer)
{
    ESP_LOGI(TAG, "No data received for %d seconds, signaling shutdown", NO_DATA_TIMEOUT_SEC);
    xSemaphoreGive(shutdown_sema);
}

#if CONFIG_WEBSOCKET_URI_FROM_STDIN
static void get_string(char *line, size_t size)
{
    int count = 0;
    while (count < size) {
        int c = fgetc(stdin);
        if (c == '\n') {
            line[count] = '\0';
            break;
        } else if (c > 0 && c < 127) {
            line[count] = c;
            ++count;
        }
        vTaskDelay(10 / portTICK_PERIOD_MS);
    }
}

#endif /* CONFIG_WEBSOCKET_URI_FROM_STDIN */

static void websocket_event_handler(void *handler_args, esp_event_base_t base, int32_t event_id, void *event_data)
{
    esp_websocket_event_data_t *data = (esp_websocket_event_data_t *)event_data;
    switch (event_id) {
    case WEBSOCKET_EVENT_CONNECTED:
        ESP_LOGI(TAG, "WEBSOCKET_EVENT_CONNECTED");
        break;
    case WEBSOCKET_EVENT_DISCONNECTED:
        ESP_LOGI(TAG, "WEBSOCKET_EVENT_DISCONNECTED");
        break;
    case WEBSOCKET_EVENT_DATA:
        ESP_LOGI(TAG, "WEBSOCKET_EVENT_DATA");
        ESP_LOGI(TAG, "Received opcode=%d", data->op_code);
        if (data->op_code == 0x08 && data->data_len == 2) {
            ESP_LOGW(TAG, "Received closed message with code=%d", 256*data->data_ptr[0] + data->data_ptr[1]);
        } else {
            ESP_LOGW(TAG, "Received=%.*s", data->data_len, (char *)data->data_ptr);
        }
        ESP_LOGW(TAG, "Total payload length=%d, data_len=%d, current payload offset=%d\r\n", data->payload_len, data->data_len, data->payload_offset);

        xTimerReset(shutdown_signal_timer, portMAX_DELAY);
        break;
    case WEBSOCKET_EVENT_ERROR:
        ESP_LOGI(TAG, "WEBSOCKET_EVENT_ERROR");
        break;
    }
}

static void websocket_app_start(void)
{
    esp_websocket_client_config_t websocket_cfg = {};

    shutdown_signal_timer = xTimerCreate("Websocket shutdown timer", NO_DATA_TIMEOUT_SEC * 1000 / portTICK_PERIOD_MS,
                                         pdFALSE, NULL, shutdown_signaler);
    shutdown_sema = xSemaphoreCreateBinary();

#if CONFIG_WEBSOCKET_URI_FROM_STDIN
    char line[128];

    ESP_LOGI(TAG, "Please enter uri of websocket endpoint");
    get_string(line, sizeof(line));

    websocket_cfg.uri = line;
    ESP_LOGI(TAG, "Endpoint uri: %s\n", line);

#else
    websocket_cfg.uri = CONFIG_WEBSOCKET_URI;
    websocket_cfg.task_stack = 6000;                                                                                    
    websocket_cfg.keep_alive_enable = true;                                                                             
    websocket_cfg.disable_auto_reconnect = false;

#endif /* CONFIG_WEBSOCKET_URI_FROM_STDIN */

    ESP_LOGI(TAG, "Connecting to %s...", websocket_cfg.uri);

    esp_websocket_client_handle_t client = esp_websocket_client_init(&websocket_cfg);
    esp_websocket_register_events(client, WEBSOCKET_EVENT_ANY, websocket_event_handler, (void *)client);

    esp_websocket_client_start(client);
    xTimerStart(shutdown_signal_timer, portMAX_DELAY);
    char data[32];
    int i = 0;
    while (i < 5) {
        if (esp_websocket_client_is_connected(client)) {
            int len = sprintf(data, "hello %04d", i++);
            ESP_LOGI(TAG, "Sending %s", data);
            esp_websocket_client_send_text(client, data, len, portMAX_DELAY);
        }
        vTaskDelay(1000 / portTICK_RATE_MS);
    }

    xSemaphoreTake(shutdown_sema, portMAX_DELAY);
    esp_websocket_client_close(client, portMAX_DELAY);
    ESP_LOGI(TAG, "Websocket Stopped");
    esp_websocket_client_destroy(client);
}

void app_main(void)
{
    ESP_LOGI(TAG, "[APP] Startup..");
    ESP_LOGI(TAG, "[APP] Free memory: %d bytes", esp_get_free_heap_size());
    ESP_LOGI(TAG, "[APP] IDF version: %s", esp_get_idf_version());
    esp_log_level_set("*", ESP_LOG_INFO);
    esp_log_level_set("WEBSOCKET_CLIENT", ESP_LOG_DEBUG);
    esp_log_level_set("TRANSPORT_WS", ESP_LOG_DEBUG);
    esp_log_level_set("TRANS_TCP", ESP_LOG_DEBUG);

    ESP_ERROR_CHECK(nvs_flash_init());
    ESP_ERROR_CHECK(esp_netif_init());
    ESP_ERROR_CHECK(esp_event_loop_create_default());

    /* This helper function configures Wi-Fi or Ethernet, as selected in menuconfig.
     * Read "Establishing Wi-Fi or Ethernet Connection" section in
     * examples/protocols/README.md for more information about this function.
     */
    ESP_ERROR_CHECK(example_connect());

    websocket_app_start();
}
SMi1984DE commented 1 week ago

Unfortunately, the debug output isn't very helpful, but I've still got a hunch here. It might be the same issue as in https://github.com/espressif/arduino-esp32/issues/9712, where the hostname couldn't be resolved anymore after automatically switching to IPv6. I experienced a similar problem with connections dropping after about half an hour to two hours, leading to a loop that required a reset. Best regards...

hitecSmartHome commented 1 week ago

Thank you for your help. The thing is that I have observed is that the server sends a data frame with a close data in it.

if (data->op_code == 0x08 && data->data_len == 2) {
    Serial.println("[WebsocketClient] - Got Close data");
    return;
}

After my ESP got this message, It does not see any clean close initiation and nothing. It just stops working after that. No error cb, no close cb.

When I try to close it after this data packet, it does not work also. I have tried to destroy the ws client, tried to just stop it and tried to disconnect from it. Nothing works. The ESP is stuck in this state and can never reconnect. Doesn't even try to in itself.

hitecSmartHome commented 1 week ago

My case is similar to this https://github.com/espressif/esp-idf/issues/10657 I can't reinit. esp_websocket_client_close and esp_websocket_client_destroy can not be called from the event handler and they also does nothing outside of the handler either.

I also can't reinit the wifi because I'm using wired ethernet.

void WebsocketClient::init() {
    client = esp_websocket_client_init(&config);
    esp_websocket_register_events(
        client,
        WEBSOCKET_EVENT_ANY,
        reinterpret_cast<esp_event_handler_t>(eventHandler),
        static_cast<void*>(this));
    esp_websocket_client_start(client);
}

bool WebsocketClient::reInit(){
    esp_err_t err = esp_websocket_client_destroy(client);
    if(err != ESP_OK){
        return false;
    }
    init();
    return true;
}

I have tried to call reinit after a timer when I got this close frame, no help. I probably have some weird issue with my server or the proxy which does not close the socket cleanly, but the fact that the ESP stuck in this state is really strange. Even if the server just drops the connection, the ESP should reconnect. I have set the debug level to 5 and I got no other debug output.

SMi1984DE commented 6 days ago

I also can't reinit the wifi because I'm using wired ethernet.

But I thought we were talking about the ESP32-WROVER-E? Where do you plug in the Ethernet cable? ;)

hitecSmartHome commented 6 days ago

Into an rj45 socket ;)

SMi1984DE commented 5 days ago

Okay, I wasn't aware that the ESP32WROVER-E could also be used as an Ethernet controller, when using an external physical interface device (PHY). So far, it was mainly a WLAN/Bluetooth module for me. I have now taken a closer look at the technical specifications. I also assumed that the IPv6/DNS problem described in arduino-esp32/issues/9712 is more likely related to the WLAN functionality, as the described misbehavior has only been reported in this context there, ...as far as I know. But the Arduino-ESP32 doesn't have an RJ45 socket, so there should be fewer ethernet problems with this configuration of course. ;)

However, I could well imagine that the ESP32 also exhibits such behavior when working as an Ethernet controller. Perhaps it partially uses the same program libraries or similar functions for ethernet and wifi. Now the question is:

Is your ESP32 maybe trying to connect to the router using IPv6 even though you haven't enabled it? This could explain why it's not working properly, since it might get a response over IPv6 from the Router but actually is using IPv4 only. I therefore recommend checking this or activating IPv6 to see if the described behavior persists.

hitecSmartHome commented 5 days ago

Thank you for the help really. I dont understand how this is related. The websocket connects fine initially and it can reconnect when it gets a clean close or in almost any other case. Will check the ipv6 options regardless.

SMi1984DE commented 5 days ago

I dont understand how this is related.

Well, if the ESP32 briefly informs your router (at some point, during operation) that it is reachable with its MAC address a1:b2:c3:xy under an IPv6 address now, the router will remember this IPv6 address for a while for the corresponding MAC address and will send its response to the IPv6 address. However, the ESP32 receives nothing because it is only reachable via IPv4. That's the theory, at least.

Maybe it could also help, to deactivate IPv6 in the Router. I don't know.

See also https://github.com/espressif/arduino-esp32/issues/9712#issuecomment-2187911297

hitecSmartHome commented 5 days ago

Iam on arduino 2.0.17 and idf 4.4.7. I use arduino as a component

hitecSmartHome commented 5 days ago

Also my router does not have any IPV6 settings. It doesn't even list any IPV6 address.

I can enable the ipv6 interface of the ethernet as well as the wifi hovewer. Will check it out.

ETH.begin(
    ETH8720_PHY_ADDR,
    PIN_PHY_POWER,
    PIN_SMI_MDC,
    PIN_SMI_MDIO,
    ETH_PHY_LAN8720,
    ETH_CLK_MODE
);
ETH.enableIpV6();

This calls

bool ETHClass::enableIpV6()
{
    return tcpip_adapter_create_ip6_linklocal(TCPIP_ADAPTER_IF_ETH) == 0;
}
gabsuren commented 4 days ago

Iam on arduino 2.0.17 and idf 4.4.7. I use arduino as a component

Sorry, Is there a specific reason for using the older ESP-IDF v4.4? Please note that it will reach its end of life this July. The latest stable version of ESP-IDF is v5.3, and the latest stable ESP WebSocket client can be found here: ESP WebSocket Client v1.2.3.

Anyway, I tried to reproduce the issue using the example with ESP-IDF v4.4, and it seems to work correctly as it attempts to reconnect after receiving a close message from the server. Have you tested this example?

Here is the link for reference: GitHub Issue #14072.

hitecSmartHome commented 4 days ago

The specific reason is that Espressif don't want to support PIO at all so the guys are lacking behind. We, who uses PIO and Arduino as a component of IDF can't upgrade to the latest cores. It just simply not supported so we stuck at 2.0.17 and 4.4.7 See https://github.com/platformio/platform-espressif32/issues/1225

This happens only when the server sends a close DATA message and not a clean close. On clean close it reconnects fine. I could not reproduce this on the server side myself yet. If I restart the server or suspend it it always sends a clean close. It just sometimes drops the ESPs with an OP code of 0x08 in a DATA frame. It was totally accidental when I have seen this frame. Since then, I can see that the server sends this packet and the websocket traffic completly stops after that but the ESP thinks it's still connected.

void WebsocketClient::handleWebSocketData(esp_websocket_event_data_t* data) {
    if (data->op_code == 0x08 && data->data_len == 2) {
        printf("[WebsocketClient] - Got Close data\n");
        return;
    }
}

And it does not help even if I reinit the websocket client.

void WebsocketClient::init() {
    client = esp_websocket_client_init(&config);
    esp_websocket_register_events(
        client,
        WEBSOCKET_EVENT_ANY,
        reinterpret_cast<esp_event_handler_t>(eventHandler),
        static_cast<void*>(this));
    esp_websocket_client_start(client);
}

bool WebsocketClient::reInit(){
    esp_err_t err = esp_websocket_client_destroy(client);
    if(err != ESP_OK){ return false; }
    init();
    return true;
}

I have 4 ESP modules on test. Two of them using WIFI, two of them using Ethernet. I turned on IPV6 on all four, for wifi and eth too. So far they are online and connected to the server 3-4 hours ago. No reconnection issues so far.

hitecSmartHome commented 4 days ago

Okay, one of the ESP with ethernet has gone offline after 3 hour and 57 min. It still thinks its connected but no traffic. The two with the wifi is ok so far