espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.89k stars 7.32k forks source link

Misleading esp-tls error messages (IDFGH-5162) #6940

Closed AxelLin closed 3 years ago

AxelLin commented 3 years ago

Environment

Module or chip used: ESP32-WROOM-32E IDF version: v4.4-dev-960-gcf457d412a77 Build System: idf.py Compiler version: xtensa-esp32-elf-gcc (crosstool-NG esp-2020r3) 8.4.0 Operating System: Linux Power Supply: USB

Problem Description

Everytime I notice below error messages in current master tree I have to remind myself it's actually nothing to do with TLS.

E (20544) esp-tls: couldn't get hostname for :XXX.XXX.XXX: getaddrinfo() returns 202, addrinfo=0x0 E (20544) esp-tls: Failed to open new connection E (20548) TRANSPORT_BASE: Failed to open a new connection E (20555) MQTT_CLIENT: Error transport connect

If I check event->error_handle->esp_tls_last_esp_er (when got MQTT_EVENT_ERROR), I got 0x8001 (ESP_ERR_ESP_TLS_CANNOT_RESOLVE_HOSTNAME)

It is a very misleading message because my setting is to connect mqtt broker port 1883 without TLS. Any chance to fix it? (v4.3 or earier versions do not have this issue)

negativekelvin commented 3 years ago

https://github.com/espressif/esp-idf/issues/6744#issuecomment-802869633

ESP-YJM commented 3 years ago

Hi @AxelLin, now tcp and ssl connection are both in esp-tls component. You can refer the commit. I admit that this description is indeed confused. But it makes no effect on code function.

AxelLin commented 3 years ago

@ESP-YJM

I know that, but that is not a good excuse! The thing is the refactor causes some unwanted side-effect.

  1. It now takes more memory because raw tcp will allocate memory for tls.
  2. The misleading error message is annoying

As far as I can tell, this is a bug and regression since older esp-idf does not has such issues. (IMHO, A refactor should not has such side-effect)

ESP-YJM commented 3 years ago

@AxelLin Yes, I agree with that. And we plan to fix unwanted side-effect you said in future version. Our colleagues is working on this.

negativekelvin commented 3 years ago

Tag could be renamed to "esp-tl" or "esp-tl(s)"

AxelLin commented 3 years ago

Tag could be renamed to "esp-tl" or "esp-tl(s)"

It's not just about rename the tag. For example, I got 0x8001 if hostname lookup failure. Anyone will interpret it as TLS error (ESP_ERR_ESP_TLS_CANNOT_RESOLVE_HOSTNAME) but it actually nothing to do with TLS.

negativekelvin commented 3 years ago

CANNOT_RESOLVE_HOSTNAME is not confusing. Either they can rename everything or they can specify that esp-tls stands for "transport layer system" not "transport layer security". Retrain our brains is easier. Tls already has alternate meaning in esp-idf too as "thread local storage".

david-cermak commented 3 years ago

Hi @AxelLin

As @ESP-YJM mentioned above, we're trying to address the memory issue, please find below a WIP patch which avoids tls struct allocation for plain TCP connections:

tcp_transport-Use-tcp_connect-esp_tls-for-TCP.patch.txt

As for the misleading error message, I'm afraid we cannot do much, as the esp-tls in this context is a name of the component, which is used as an underlying transport (should it be a TLS or non-TLS). This component has always supported non-tls connections (we just didn't use it).

If you break down the error code above ESP_ERR_ESP_TLS_CANNOT_RESOLVE_HOSTNAME:

AxelLin commented 3 years ago

My understanding is the esp-tls was a layer for providing a common inteface using different TLS library. See the naming and the options in esp-tls menuconfig, it's clearly for TLS. I know with recent refactor it's also for plain TCP now, but I'm not convinced it's a good change.

Revert this refactor is an option, but it's up to you. I don't see too much benefit with this refactor (how much memory saved? firmware size reduced?). You reduce the lines of code, but now when you want to debug plain TCP issues you will need to read the more complex esp-tls code.

david-cermak commented 3 years ago

I'm still convinced it was a good change. The main reason was to reduce code duplication, and the maintenance effort on plain socket operations. I understand your concern, but with the proposed fix above, debugging of TCP issues would be pretty easy, as would be only related to the tcp_connect() API.

See the naming and the options in esp-tls menuconfig, it's clearly for TLS.

This is true, but the plain TCP transport was supported before and it usually is supported by default. You can also connect to a plain TCP server with openssl, which may say:

Verification: OK
---
New, (NONE), Cipher is (NONE)

or

system lib:crypto/bio/b_addr.c:730:Name or service not known
connect:errno=22

in case of TCP connection issues.

AxelLin commented 3 years ago

Hi @AxelLin

As @ESP-YJM mentioned above, we're trying to address the memory issue, please find below a WIP patch which avoids tls struct allocation for plain TCP connections:

tcp_transport-Use-tcp_connect-esp_tls-for-TCP.patch.txt

I hit below error after applying this patch: E (1221122) esp-tls: Failed to create socket (family 2 socktype 1 protocol 0) E (1221122) TRANSPORT_BASE: Failed to open a new connection: 32770 E (1221126) HTTP_CLIENT: Connection failed, sock < 0 esp_http_client_open failed, err=0x7002 ESP_ERR_HTTP_CONNECT errno=23

david-cermak commented 3 years ago

Oh, socket exhaustion issue (if we connect manually, we have to close the socket too)

diff --git a/components/tcp_transport/transport_ssl.c b/components/tcp_transport/transport_ssl.c
index f118b6bf5e2..82e89152bc6 100644
--- a/components/tcp_transport/transport_ssl.c
+++ b/components/tcp_transport/transport_ssl.c
@@ -277,6 +277,8 @@ static int ssl_close(esp_transport_handle_t t)
         ret = esp_tls_conn_destroy(ssl->tls);
         ssl->conn_state = TRANS_SSL_INIT;
         ssl->ssl_initialized = false;
+    } else if (ssl && ssl->sockfd >= 0) {
+        close(ssl->sockfd);
     }
     return ret;
 }

Thanks for the early feedback and sorry for the inconvenience! we do have a test on socket leaks, but was too hasty in posting patches before running it.

AxelLin commented 3 years ago

Hi @david-cermak

With above fix, now I no longer hit socket exhaustion issue.

But sometimes I hit below issue using v4.4-dev-1254-g639e7ad494d9 + all the patch/fix mentioned in this thread. ( I don't recall such issue when I test with v4.3 branch).

Guru Meditation Error: Core 0 panic'ed (LoadProhibited). Exception was unhandled.

Core 0 register dump: PC : 0x401396cf PS : 0x00060330 A0 : 0x801a1eda A1 : 0x3ffdeea0 0x401396cf: tcp_write at /home/axel/esp/esp-idf/components/tcp_transport/transport_ssl.c:215 (discriminator 6)

A2 : 0x00000000 A3 : 0x00679e31 A4 : 0x00000001 A5 : 0x00002710 A6 : 0x00000600 A7 : 0x3ffbbee4 A8 : 0x801396c5 A9 : 0x3ffdee80 A10 : 0x3f431d4d A11 : 0x3f4241be A12 : 0x00000038 A13 : 0x00000004 A14 : 0x3ffb5150 A15 : 0x3ffdecf0 SAR : 0x0000001d EXCCAUSE: 0x0000001c EXCVADDR: 0x000006c0 LBEG : 0x40116a62 LEND : 0x40116a69 LCOUNT : 0x00000000 0x40116a62: lwip_selscan at /home/axel/esp/esp-idf/components/lwip/lwip/src/api/sockets.c:1905 (discriminator 3)

0x40116a69: lwip_selscan at /home/axel/esp/esp-idf/components/lwip/lwip/src/api/sockets.c:1898

Backtrace:0x401396cc:0x3ffdeea00x401a1ed7:0x3ffdeed0 0x40139e71:0x3ffdeef0 0x400deccd:0x3ffdef10 0x400dee75:0x3ffdef70 0x401a427b:0x3ffdef90 0x400deea1:0x3ffdefe0 0x401396cc: tcp_write at /home/axel/esp/esp-idf/components/tcp_transport/transport_ssl.c:215 (discriminator 6)

0x401a1ed7: esp_transport_write at /home/axel/esp/esp-idf/components/tcp_transport/transport.c:206

0x40139e71: esp_http_client_write at /home/axel/esp/esp-idf/components/esp_http_client/esp_http_client.c:1324 (inlined by) esp_http_client_write at /home/axel/esp/esp-idf/components/esp_http_client/esp_http_client.c:1316

My application calls esp_http_client_write() when this issue happen.

AxelLin commented 3 years ago

Hi @david-cermak

With above fix, now I no longer hit socket exhaustion issue.

But sometimes I hit below issue using v4.4-dev-1254-g639e7ad494d9 + all the patch/fix mentioned in this thread. ( I don't recall such issue when I test with v4.3 branch).

Guru Meditation Error: Core 0 panic'ed (LoadProhibited). Exception was unhandled.

Core 0 register dump: PC : 0x401396cf PS : 0x00060330 A0 : 0x801a1eda A1 : 0x3ffdeea0 0x401396cf: tcp_write at /home/axel/esp/esp-idf/components/tcp_transport/transport_ssl.c:215 (discriminator 6)

A2 : 0x00000000 A3 : 0x00679e31 A4 : 0x00000001 A5 : 0x00002710 A6 : 0x00000600 A7 : 0x3ffbbee4 A8 : 0x801396c5 A9 : 0x3ffdee80 A10 : 0x3f431d4d A11 : 0x3f4241be A12 : 0x00000038 A13 : 0x00000004 A14 : 0x3ffb5150 A15 : 0x3ffdecf0 SAR : 0x0000001d EXCCAUSE: 0x0000001c EXCVADDR: 0x000006c0 LBEG : 0x40116a62 LEND : 0x40116a69 LCOUNT : 0x00000000

I think above exception is not related to the patch discussed here.

I also applied lwip-optimization-config-option-LWIP_TCPIP_CORE_LOCKING.patch.txt in my current code, and I thought there is no impact with the test because my config disabled the LWIP_TCPIP_CORE_LOCKING.

As I mentioned in https://github.com/espressif/esp-idf/issues/6919#issuecomment-826734896 The lwip-optimization-config-option-LWIP_TCPIP_CORE_LOCKING.patch.txt also changed LWIP_NETCONN_SEM_PER_THREAD setting to 1 when !LWIP_TCPIP_CORE_LOCKING. By changing LWIP_NETCONN_SEM_PER_THREAD back to 0, it seems no longer hit the exception so far.

david-cermak commented 3 years ago

Thanks for the clarification. This patch really needs some cleanup, but nothing related to the issue above.

About the core-locking (have replied in the linked issue): LWIP_NETCONN_SEM_PER_THREAD should be set to 1 on the version you referred (v4.4-dev-1254-g639e7ad494d9)

AxelLin commented 3 years ago

In below changes:

+static int tcp_write(esp_transport_handle_t t, const char *buffer, int len, int timeout_ms)
+{
+    int poll;
+    transport_esp_tls_t *ssl = ssl_get_context_data(t);
+
+    if ((poll = esp_transport_poll_write(t, timeout_ms)) <= 0) {
+        ESP_LOGW(TAG, "Poll timeout or error, errno=%s, fd=%d, timeout_ms=%d", strerror(errno), ssl->tls->sockfd, timeout_ms);

ssl->tls is NULL, so it should be:

ESP_LOGW(TAG, "Poll timeout or error, errno=%s, fd=%d, timeout_ms=%d", strerror(errno), ssl->sockfd, timeout_ms);
AxelLin commented 3 years ago

Hi @david-cermak

(This is probably off-topic) I'm tracing above exception I hitted, and it seems happen after esp_http_client_write() returns 0. So I'm wondering if esp_http_client_write() returns 0 is an error or not? https://www.esp32.com/viewtopic.php?f=13&t=20782

Then I check current implemenation in of tcp_write/ssl_write in components/tcp_transport/transport_ssl.c What happen if esp_transport_poll_write returns 0? it makes tcp_write returns 0 (without sending any data). then esp_http_client_write() returns 0. If I do a following esp_http_client_write() to write more data after esp_http_client_write() returning 0, it trigger the exception. (All the callers in esp-idf check if the ret < 0, is it consider 0 as success. Is such checking correct? Or should it change to check ret <= 0?)

Similar question in esp-mqtt library: One of the esp_transport_write() caller checks "if (write_len < 0)", the other one checks "if (write_len <= 0)", which one is correct? https://github.com/espressif/esp-mqtt/blob/master/mqtt_client.c#L579 https://github.com/espressif/esp-mqtt/blob/master/mqtt_client.c#L866

david-cermak commented 3 years ago

@AxelLin This is a good question, I think neither of the examples you gave handles the writes 100% correctly. I would say that when esp_http_client_write() returns 0, it is not an error, but an indication of the timeout, a means of exiting this loop (without an error) https://github.com/espressif/esp-idf/blob/master/components/esp_http_client/esp_http_client.c#L1333-L1343

(of course this is a bit simplified implementation, we should start a timer just before looping and exit the loop once it expires and we haven't sent all the data, but I believe a good and simple approximation)

esp_http_client_write() returning 0, it trigger the exception.

This is strange, are you getting the error posted above? (LoadProhibited in the lwip_selscan() ? Could you please check the errno when getting the 0 ret value? (and perhaps enable lwip-socket debug). The only thing I could think of is that the connection gets closed cleanly, by the FIN flag and we don't correctly check for this condition in the tcp_write() (I believe this would show up in the socket log), similarly to what's being done for tcp_read() here:

https://github.com/espressif/esp-idf/blob/master/components/tcp_transport/transport_ssl.c#L216-L217


As for the mqtt library, we should either check for <= 0 or do-while until we transmit all the data (the same way as it's done in the esp_http_client). I'll do the former first and try to recreate the condition with incomplete writes.

MPC-BlackBox commented 2 years ago

Hi. I am using eclipse with the ESP32 IDF plugin 2.3.02021111091633.

I have a wifi application that has connections for a UDP, HTTP webserver, sntp server and MQTT.

For testing I have the ESP32 connected to a mobile phone AP. If I switch off the AP and switch on again about every 1 in 10 goes I get the following Log:

E (3700561) esp-tls: Failed to create socket (family 2 socktype 1 protocol 0)
E (3700561) esp-tls: Failed to open new connection
E (3700561) TRANS_SSL: Failed to open a new connection
E (3700571) MQTT_CLIENT: Error transport connect
I (3700571) GATEWAY_MAIN: MQTT_EVENT_ERROR
I (3700581) GATEWAY_MAIN: MQTT_EVENT_DISCONNECTED
I (3708321) RAINDIRECTOR_END_NODE: Message sent
I (3709731) LORA_MODEM: Message from Device Address 22011C6F MHDR = 40
W (3709731) MQTT_CLIENT: Publish: Losing qos0 data when client not connected
I (3709731) GATEWAY_MAIN: sent publish successful, msg_id=0
W (3709951) MQTT_CLIENT: Publish: Losing qos0 data when client not connected
I (3709951) GATEWAY_MAIN: sent publish successful, msg_id=0
I (3715581) GATEWAY_MAIN: Other event id:7
E (3715581) esp-tls: Failed to create socket (family 2 socktype 1 protocol 0)
E (3715581) esp-tls: Failed to open new connection
E (3715581) TRANS_SSL: Failed to open a new connection
E (3715591) MQTT_CLIENT: Error transport connect
I (3715591) GATEWAY_MAIN: MQTT_EVENT_ERROR
I (3715601) GATEWAY_MAIN: MQTT_EVENT_DISCONNECTED

This repeats forever. The MQTT client refuses to reconnect. If I reset the ESP32 everything recovers okay. 

Any thoughts?
AxelLin commented 2 years ago

This repeats forever. The MQTT client refuses to reconnect. If I reset the ESP32 everything recovers okay.

Any thoughts?

@MPC-BlackBox Please create a new issue.

Divraj-Dev commented 2 years ago

@MPC-BlackBox Did you get a resolution on this? I am facing the same issue. Do you have link to the new issue ?

AxelLin commented 2 years ago

@MPC-BlackBox Did you get a resolution on this? I am facing the same issue. Do you have link to the new issue ?

@divrajBevie This issue was closed and the topic of this issue is not related to your issue. Please create a new issue for tracking.