espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.91k stars 7.33k forks source link

MbedTLS often fails Client side certificate validation in handshake (IDFGH-1928) #4127

Closed enricop closed 4 years ago

enricop commented 5 years ago

Environment

Problem Description

In our project we are using Mutual Authentication with esp_mqtt and esp_http_client. We run extensive testing and we saw very often that MbedTLS fails during the validation of Client Side certificate. The first step of the handshake, when it validates the Server side certificate with the provided CA always goes fine.

To make it clearer: our client side certificates are signed correctly (2048bit RSA) and for around 30% of the cases, ESP_MQTT authenticates correctly with the broker with mutual auth.

The main problem now is "BIGNUM - Memory allocation failed" but printing heap_caps_print_heap_info() it shows that we have 100K of free internal RAM. Moreover sometimes the TLS handshake takes too long to run on the ESP32 and the connection is reset. But sniffing the traffic on the server side we don't see anything strange. Why sometimes it does work and sometimes does not? 70% of the times I get the errors below.

We saw similar problems with esp_http_client, but they are quite rare.

@david-cermak please have a look at this two debug logs that occurs very often when connecting

Debug Logs

I (553807) mqtt_main_event_cb: MQTT_EVENT_BEFORE_CONNECT E (555697) esp-tls: mbedtls_ssl_handshake returned -0x10 I (555697) esp-tls: Certificate verified. E (555697) esp-tls: Failed to open new connection E (555697) TRANS_SSL: Failed to open a new connection E (555727) MQTT_CLIENT: Error transport connect W (555727) mqtt_main_event_cb: MQTT_EVENT_ERROR: Last esp error code: 0x8010 = ESP_ERR_MBEDTLS_SSL_HANDSHAKE_FAILED W (555737) mqtt_main_event_cb: Last mbedtls failure: 0x10 = BIGNUM - Memory allocation failed I (555737) mqtt_main_event_cb: MQTT_EVENT_DISCONNECTED


I (514495) mqtt_main_event_cb: MQTT_EVENT_BEFORE_CONNECT E (520035) esp-tls: mbedtls_ssl_handshake returned -0x7280 I (520035) esp-tls: Certificate verified. E (520035) esp-tls: Failed to open new connection E (520045) TRANS_SSL: Failed to open a new connection E (520055) MQTT_CLIENT: Error transport connect W (520055) mqtt_main_event_cb: MQTT_EVENT_ERROR: Last esp error code: 0x8010 = ESP_ERR_MBEDTLS_SSL_HANDSHAKE_FAILED W (520065) mqtt_main_event_cb: Last mbedtls failure: 0x7280 = SSL - The connection indicated an EOF I (520075) mqtt_main_event_cb: MQTT_EVENT_DISCONNECTED


if neeed I can provide more verbose logging

SDKConfig

sdkconfig

enricop commented 5 years ago

This is the log with CONFIG_MBEDTLS_DEBUG_LEVEL_VERBOSE=y mbedtls_verbose_mutual_authentication_errors.txt

The error codes are not always the same...it doesn't fail exactly in the same way all the times.

david-cermak commented 5 years ago

Hi @enricop

I think in general these frequent connection failures could be indeed connected with allocation issues. Does any other allocation or task creation take place after printing heap info? I would suggest trying to debug mbedtls allocs by alterring mbedtls_calloc() in components/mbedtls/mbedtls/library/platform.c into:

#include "esp_heap_caps.h"

void * mbedtls_calloc( size_t nmemb, size_t size )
{
    void * m = (*mbedtls_calloc_func)( nmemb, size );
    if (!m) {
        printf("Failed to allocate: %d x %d !\n", nmemb, size);
        heap_caps_print_heap_info(MALLOC_CAP_INTERNAL|MALLOC_CAP_8BIT);
    }
    return m;
}

Even if allocation succeeds, it may take more time and might be timeouted on the broker side. here is an example of my local mosquitto connection tests when reducing free heap space:

1569916599: New connection from 192.168.2.11 on port 8883.
1569916622: Client <unknown> has exceeded timeout, disconnecting.

Failing in a similar fashion you posted (the mbedtls verbose log) Which broker do you use? Can you please check the error message on server side?

Could you please also try to set external memory as Memory allocation strategy in mbetls configuration just to check if the issues go away? Moreover this option could help: Asymmetric in/out fragment length, again in mbedtls component config. Noticed it wasn't set in your sdkconfig.

enricop commented 5 years ago

Hi @david-cermak ,

when I enable mbedtls debug log, I receive very often the two errors below, in particular:

mbedtls_ssl_fetch_input() fail:

I (42552) mbedtls: ssl_tls.c:2722 ssl->f_recv(_timeout)() returned 0 (-0x0000) W (42562) mbedtls: ssl_tls.c:4973 mbedtls_ssl_fetch_input() returned -29312 (-0x7280) W (42572) mbedtls: ssl_tls.c:4344 ssl_get_next_record() returned -29312 (-0x7280) W (42582) mbedtls: ssl_tls.c:5906 mbedtls_ssl_read_record() returned -29312 (-0x7280) I (42582) mbedtls: ssl_tls.c:8094 <= handshake E (42592) esp-tls: mbedtls_ssl_handshake returned -0x7280 I (42592) esp-tls: Certificate verified. E (42602) esp-tls: Failed to open new connection E (42602) TRANS_SSL: Failed to open a new connection E (42612) MQTT_CLIENT: Error transport connect W (42612) mqtt_main_event_cb: MQTT_EVENT_ERROR: Last esp error code: 0x8010 = ESP_ERR_MBEDTLS_SSL_HANDSHAKE_FAILED W (42632) mqtt_main_event_cb: Last mbedtls failure: 0x7280 = SSL - The connection indicated an EOF

mbedtls_ssl_flush_output() fail:

I (58812) mbedtls: ssl_tls.c:2779 ssl->f_send() returned -80 (-0x0050) W (58812) mbedtls: ssl_tls.c:3472 mbedtls_ssl_flush_output() returned -80 (-0x0050) W (58822) mbedtls: ssl_tls.c:3315 ssl_write_record() returned -80 (-0x0050) W (58832) mbedtls: ssl_tls.c:5889 mbedtls_ssl_write_handshake_msg() returned -80 (-0x0050) I (58842) mbedtls: ssl_tls.c:8094 <= handshake E (58842) esp-tls: mbedtls_ssl_handshake returned -0x50 I (58852) esp-tls: Certificate verified. E (58852) esp-tls: Failed to open new connection E (58862) TRANS_SSL: Failed to open a new connection E (58872) MQTT_CLIENT: Error transport connect W (58872) mqtt_main_event_cb: MQTT_EVENT_ERROR: Last esp error code: 0x8010 = ESP_ERR_MBEDTLS_SSL_HANDSHAKE_FAILED W (58882) mqtt_main_event_cb: Last mbedtls failure: 0x50 = UNKNOWN ERROR CODE (0050)

these looks not related to memory allocation. Just before starting any connection attemp i print the internal heap info and I've got:

Heap summary for capabilities 0x00000800: At 0x3ffb2730 len 15448 free 0 allocated 15316 min_free 0 largest_free_block 0 alloc_blocks 25 free_blocks 0 total_blocks 25 At 0x3ffc6a24 len 65535 free 9864 allocated 55380 min_free 9832 largest_free_block 9184 alloc_blocks 61 free_blocks 3 total_blocks 64 At 0x3ffaff10 len 240 free 0 allocated 152 min_free 0 largest_free_block 0 alloc_blocks 14 free_blocks 0 total_blocks 14 At 0x3ffb6388 len 7288 free 0 allocated 7136 min_free 0 largest_free_block 0 alloc_blocks 30 free_blocks 0 total_blocks 30 At 0x3ffb9a20 len 16648 free 0 allocated 16284 min_free 0 largest_free_block 0 alloc_blocks 83 free_blocks 0 total_blocks 83 At 0x3ffc6a08 len 103928 free 0 allocated 103452 min_free 0 largest_free_block 0 alloc_blocks 111 free_blocks 0 total_blocks 111 At 0x3ffe0440 len 129984 free 69648 allocated 58436 min_free 69132 largest_free_block 69596 alloc_blocks 464 free_blocks 3 total_blocks 467 At 0x40078000 len 32768 free 32732 allocated 0 min_free 32724 largest_free_block 32732 alloc_blocks 0 free_blocks 1 total_blocks 1 At 0x40099a80 len 25984 free 25948 allocated 0 min_free 25948 largest_free_block 25948 alloc_blocks 0 free_blocks 1 total_blocks 1 Totals: free 138192 allocated 256156 min_free 137636 largest_free_block 69596

The broker server is handled by another company so we don't know the software. They just said that they don't see any particular error on the server side.

We already tried with external memory allocation enable but the performance are much worse and the handshake almost always fails with different errors.

enricop commented 5 years ago

Increasing the CPU speed from 160mhz to 240mhz solves the handshake problem

iosix commented 5 years ago

I found that you need to increase the watchdog timeout to 60s and put a 30s limit on the socket read. Then enable hardware acceleration and you should be getting handshakes in around 8s even at 80mhz.

Robert Vogt IV CEO IOSiX, LLC 1300 Tefft Ct #1 Saline, MI 48176 P: +1-855-OBD-1939 C: +1-734-730-9690 robert@iosix.com

On Oct 3, 2019, at 7:59 AM, enricop notifications@github.com wrote:

Increasing the CPU speed from 160mhz to 240mhz solves the handshake problem

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

enricop commented 5 years ago

@iosix thanks. How can you increase those timeouts on esp-idf?

iosix commented 5 years ago

make menuconfig -> Component config -> ESP32-specific -> Task Watchdog timeout period (seconds) -> set to 60

For socket: struct timeval interval = {30, 0}; lwip_setsockopt(fd, SOL_SOCKET, SO_SNDTIMEO, (char )&interval, sizeof(struct timeval)); lwip_setsockopt(fd, SOL_SOCKET, SO_RCVTIMEO, (char )&interval, sizeof(struct timeval));

By the way, I recommend using lwip non-blocking connect and then initializing mbedtls instead of using mbedtls_net_connect as the mbedtls function doesn't support non-blocking and thus can get stuck if your connection drops, etc:

fd = lwip_socket(AF_INET, SOCK_STREAM, IPPROTO_TCP); struct sockaddr_in addr; memset(&addr, 0, sizeof(addr)); addr.sin_len = sizeof(addr); addr.sin_family = AF_INET; addr.sin_port = PP_HTONS(7901); addr.sin_addr.s_addr = inet_addr(SERVERip); if (fd < 0) { break; } u32 opt; opt = lwip_fcntl(fd, F_GETFL, 0); opt |= O_NONBLOCK; lwip_fcntl(fd, F_SETFL, opt);

struct timeval interval = {30, 0}; lwip_setsockopt(fd, SOL_SOCKET, SO_SNDTIMEO, (char )&interval, sizeof(struct timeval)); lwip_setsockopt(fd, SOL_SOCKET, SO_RCVTIMEO, (char )&interval, sizeof(struct timeval));

if ((ret = lwip_connect(fd, (struct sockaddr*)&addr, sizeof(addr))) != 0) { fd_set readset, writeset, errset; struct timeval tv; FD_ZERO(&readset); FD_SET(fd, &readset); FD_ZERO(&writeset); FD_SET(fd, &writeset); FD_ZERO(&errset); FD_SET(fd, &errset); tv.tv_sec = 30; tv.tv_usec = 0; ret = lwip_select(fd + 1, &readset, &writeset, &errset, &tv); debug_printf("select returned %d\r\n", ret);

if (ret != 1) { ret = lwip_shutdown(fd, 2); ret = lwip_close(fd); if (ret == 0) { sock = get_socket(fd); if (sock) free_socket(sock, 1); } break; } else { debug_printf("Connected to server\r\n"); server_fd.fd = fd; } } else { debug_printf("Connected to server\r\n"); server_fd.fd = fd; }

To do this you also need to make get_socket and free_socket not static in lwip library and declare them in your code as:

struct lwip_sock ESP_IRAM_ATTR get_socket(int s); void free_socket(struct lwip_sock sock, int is_tcp);

Thanks.

Robert Vogt IV CEO IOSiX, LLC 1300 Tefft Ct #1 Saline, MI 48176 Reception: (855) OBD-1939 Direct/Cell: +1-734-730-9690 robert@iosix.com

On Thu, Oct 3, 2019 at 12:19 PM enricop notifications@github.com wrote:

@iosix https://github.com/iosix thanks. How can you increase those timeout on esp-idf?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/espressif/esp-idf/issues/4127?email_source=notifications&email_token=AAUXMBY22OVBXTGWPWQ4ONDQMYLPXA5CNFSM4I3EL5HKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEAIX2WI#issuecomment-538017113, or mute the thread https://github.com/notifications/unsubscribe-auth/AAUXMB6YBDA7OHTZBTOZNVLQMYLPXANCNFSM4I3EL5HA .

enricop commented 5 years ago

2892

after increasing speed to 240mhz it worked just for one day. the issue is always present now, and still occurs on connections.

I think this is all caused by mbedtls using external ram in our case.

copercini commented 5 years ago

I'm hitting probably the same problem: Internal heap remaining is 124200 bytes and AWS handshake fails

Setting mbedtls to psram or as 240mhz, it just works

ThomasRogg commented 5 years ago

If it is the PSRAM bug, working without PSRAM OR in unicore mode fixes the problem.

enricop commented 5 years ago

Hi @neoniousTR we have the following option enabled in sdkconfig, together with PSRAM enabled.

CONFIG_FREERTOS_UNICORE=y

Is this the option for "unicore mode" as you refer in #2892 ? If it is this one, we are experiencing PSRAM cache corruption problems even with this enabled.

ThomasRogg commented 5 years ago

Interesting. We do not. We've tested a lot on low level and all our problems stopped when switching to unicore. We are also using mbedtls with PSRAM, works like a charm.

How do you know that this problem has to do with the PSRAM cache?

enricop commented 5 years ago

this is why it is related to PSRAM cache corruption:

Screenshot_corrupt

ThomasRogg commented 5 years ago

I hate memory bugs.. The reason for these can be everywhere.. Dangling pointers, and other things.

Same here: This is not a proof that this is the PSRAM issue.

enricop commented 5 years ago

Hi all, with release/v4.0 and this sdkconfig.txt we have always this error below:

W (59604) mbedtls: ssl_cli.c:2997 mbedtls_ecdh_calc_secret() returned -16 (-0x0010)

See the full log here: https://pastebin.com/raw/rD9rPb3F

projectgus commented 5 years ago

Hi @enricop ,

Thanks for providing all this detail. For future reference, please use a service like https://gist.github.com to paste long logs. It makes the comment thread easier to follow. I've taken the liberty of moving the log dump to pastebin.com and linking it.

CONFIG_MBEDTLS_INTERNAL_MEM_ALLOC=y

W (59604) mbedtls: ssl_cli.c:2997 mbedtls_ecdh_calc_secret() returned -16 (-0x0010)

(59654) on_error: Last mbedtls failure: 0x10 = BIGNUM - Memory allocation failed free external memory: 4067340 free internal memory: 119768 free malloc() memory: 4127320

MBEDTLS_ERR_MPI_ALLOC_FAILED -0x0010 indicates that either a malloc() has failed or a big number grew larger than 40,000 bytes in size (unlikely).

The free internal memory at the end looks quite high. To be sure that free memory has been high during the entire operation (rather than running out mid-operation and then freeing a lot of memory during cleanup), can you please try adding the following code from the place you print the free memory at the end:

printf("free internal memory: %u lifetime minimum: %u\n",
          heap_caps_get_free_size(MALLOC_CAP_INTERNAL|MALLOC_CAP_8BIT),
          heap_caps_get_minimum_free_size(MALLOC_CAP_INTERNAL|MALLOC_CAP_8BIT));
printf("integrity check: %d\n", heap_caps_check_integrity(MALLOC_CAP_INTERNAL|MALLOC_CAP_8BIT, true));
enricop commented 5 years ago

Hi @projectgus

thank you very much for the code snippet for printing the free memory. We are using this and the free memory is much lower than the one printed by heap_caps_print_heap_info.

We set the mbedtls alloc config to : CONFIG_MBEDTLS_DEFAULT_MEM_ALLOC=y to try to use external memory. Now we have other errors but once in 20/30 times it connects to the broker.

tls_error_31102019

sdkconfig2.txt

enricop commented 5 years ago

after it connects, the connection if oftern cut down by this error:

tls_transport_read

negativekelvin commented 5 years ago

Have you tried this https://github.com/espressif/esp-idf/issues/3624

mahavirj commented 5 years ago

@enricop One observation from your sdkconfig:

CONFIG_MBEDTLS_ASYMMETRIC_CONTENT_LEN=y
CONFIG_MBEDTLS_SSL_IN_CONTENT_LEN=8192
CONFIG_MBEDTLS_SSL_OUT_CONTENT_LEN=16384

These settings do not look correct, especially IN content length, unless you control server side configuration as well. Here are default IDF settings per release/v4.0, https://github.com/espressif/esp-idf/blob/release/v4.0/components/mbedtls/Kconfig#L68 and https://github.com/espressif/esp-idf/blob/release/v4.0/components/mbedtls/Kconfig#L77

Any specific reason for this? What happens if you revert to IDF recommended values?

enricop commented 5 years ago

hi @mahavirj

now we set to:

CONFIG_MBEDTLS_SSL_MAX_CONTENT_LEN=16384
# CONFIG_MBEDTLS_ASYMMETRIC_CONTENT_LEN is not set

with this config the same problem occurs as reported above

projectgus commented 5 years ago

TLS fatal alert message 10 is "unexpected message", it means the server received an invalid message at the TLS layer, and is notifying the client that it's closing the connection.

Could you please post a full log again (can attach the file to the post or put it on gist.github.com), with the updated TLS buffer sizes as shown in your last post?

Also, can you tell us anything about the server you're connecting to?

enricop commented 5 years ago

hi @projectgus

setting mbedtls allocation method to INTERNAL memory (sdkconfig.txt) we have BIGNUM error: (tls_internal_memory_error_full_log.txt):

tls_internal_memory_error

setting mbedtls allocation method to EXTERNAL memory (sdkconfig2.txt) we have ALERT error, both in handshake and while connected (tls_external_memory_error_full_log.txt):

tls_external_memory_error

here there are server side packet capture for the test with external ram allocation: pcap_external_ram_tls_test.zip

If you see the pcap, the fatal error is caused by an "unexpected message" in the handshake. This confirms what @projectgus was saying.

Since everything IS working FINE while we have INTERNAL memory allocation and enough free memory, our server/client/certificates are fine and the problem is in expressif mbedtls #3624 with external memory and gcc8.

In our project we have very low free memory, so to test the mqtt client with INTERNAL memory we are DISABLING some functionalities. For us it is NOT normal that with 60k of free RAM we have BIGNUM allocation error. This error goes away sistematically if we free memory! See full log here: tls_internal_memory_working_full_log.txt. Note that after connection is enstablished the free memory is 14856 and just before it, it was 75512.

negativekelvin commented 5 years ago

You never said whether you tried the updated libc https://github.com/espressif/esp-idf/issues/3624#issuecomment-515170464

enricop commented 5 years ago

You never said whether you tried the updated libc #3624 (comment)

copying the comment libc.a to esp-idf components/newlib/lib/libc.a has the following result: the handshake always completes successfully but then the connection always terminates with the same error as before

negativekelvin commented 5 years ago

/toolchain-path/xtensa-esp32-elf/xtensa-esp32-elf/lib/esp32-psram/libc.a is the one that is linked with psram workaround

enricop commented 5 years ago

thank you very much @negativekelvin , with that libc.a it connects without errors and the connection stays alive using external memory! (tls_external_memory_fixed_full_log.txt)

projectgus commented 5 years ago

Thanks for the suggestion, @negativekelvin . It looks like the root cause for this is the same as for #3624, then.

@enricop If you agree with this assessment then could you please close this issue and click Subscribe on the linked issue? It will be updated once the PSRAM toolchain fix is available.

david-cermak commented 4 years ago

Hi @enricop

Any update? Could you please close the issue if the @projectgus's assessment is okay with you? Or share any other concern otherwise?

enricop commented 4 years ago

hi all, yes with the relaease of toolchain 2019r2, mbedtls handshake is working. sorry for not updating the issue.