espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.58k stars 7.27k forks source link

HTTPS client - MBED TLS Hardware MPI issue(task WDT get triggered after a few minute since i started HTTPS client) (IDFGH-7102) #8710

Closed giorgi877 closed 1 year ago

giorgi877 commented 2 years ago

Environment

Problem Description

Hello, i have started HTTPS client(async mode) according to examples. After a few minutes of working console prints that task WDT triggered and back trace info. when i disable hardware MPI it seems to works.

Decoding stack results 0x400848f5: esp_crosscore_isr at C:/Users/GIORGI877/eclipse/esp-idf-v4.4/components/esp_system/crosscore_int.c line 92 0x4010028d: mbedtls_mpi_sub_abs at C:/Users/GIORGI877/eclipse/esp-idf-v4.4/components/mbedtls/mbedtls/library/bignum.c line 1388 0x40100385: mbedtls_mpi_sub_mpi at C:/Users/GIORGI877/eclipse/esp-idf-v4.4/components/mbedtls/mbedtls/library/bignum.c line 1507 0x40100939: mbedtls_mpi_mod_mpi at C:/Users/GIORGI877/eclipse/esp-idf-v4.4/components/mbedtls/mbedtls/library/bignum.c line 1963 0x4010cb91: calculate_rinv at C:/Users/GIORGI877/eclipse/esp-idf-v4.4/components/mbedtls/port/esp_bignum.c line 129 0x4010cde8: mbedtls_mpi_exp_mod at C:/Users/GIORGI877/eclipse/esp-idf-v4.4/components/mbedtls/port/esp_bignum.c line 99 0x4010a98b: mbedtls_rsa_public at C:/Users/GIORGI877/eclipse/esp-idf-v4.4/components/mbedtls/mbedtls/library/rsa.c line 789 0x4010bd91: mbedtls_rsa_rsassa_pkcs1_v15_verify at C:/Users/GIORGI877/eclipse/esp-idf-v4.4/components/mbedtls/mbedtls/library/rsa.c line 2426 0x4010be3d: mbedtls_rsa_pkcs1_verify at C:/Users/GIORGI877/eclipse/esp-idf-v4.4/components/mbedtls/mbedtls/library/rsa.c line 2493 0x40108f46: rsa_verify_wrap at C:/Users/GIORGI877/eclipse/esp-idf-v4.4/components/mbedtls/mbedtls/library/pk_wrap.c line 113 0x40108c89: mbedtls_pk_verify_restartable at C:/Users/GIORGI877/eclipse/esp-idf-v4.4/components/mbedtls/mbedtls/library/pk.c line 309 0x400ff221: mbedtls_ssl_handshake_client_step at C:/Users/GIORGI877/eclipse/esp-idf-v4.4/components/mbedtls/mbedtls/library/ssl_cli.c line 2904 0x400fd42d: mbedtls_ssl_handshake_step at C:/Users/GIORGI877/eclipse/esp-idf-v4.4/components/mbedtls/mbedtls/library/ssl_tls.c line 8358 0x400fd44c: mbedtls_ssl_handshake_step at C:/Users/GIORGI877/eclipse/esp-idf-v4.4/components/mbedtls/mbedtls/library/ssl_tls.c line 8363 0x400fd472: mbedtls_ssl_handshake at C:/Users/GIORGI877/eclipse/esp-idf-v4.4/components/mbedtls/mbedtls/library/ssl_tls.c line 8377 0x400f8f65: esp_mbedtls_verify_certificate at C:/Users/GIORGI877/eclipse/esp-idf-v4.4/components/esp-tls/esp_tls_mbedtls.c line 273 0x400f8955: esp_tls_low_level_conn at C:/Users/GIORGI877/eclipse/esp-idf-v4.4/components/esp-tls/esp_tls.c line 425 0x400f8d39: esp_tls_conn_new_async at C:/Users/GIORGI877/eclipse/esp-idf-v4.4/components/esp-tls/esp_tls.c line 511 0x401404fb: esp_tls_connect_async at C:/Users/GIORGI877/eclipse/esp-idf-v4.4/components/tcp_transport/transport_ssl.c line 85 0x40140531: ssl_connect_async at C:/Users/GIORGI877/eclipse/esp-idf-v4.4/components/tcp_transport/transport_ssl.c line 96 0x40157ae5: esp_transport_connect_async at C:/Users/GIORGI877/eclipse/esp-idf-v4.4/components/tcp_transport/transport.c line 181 0x400f9e39: esp_http_client_connect at C:/Users/GIORGI877/eclipse/esp-idf-v4.4/components/esp_http_client/esp_http_client.c line 1210 0x400fab08: esp_http_client_perform at C:/Users/GIORGI877/eclipse/esp-idf-v4.4/components/esp_http_client/esp_http_client.c line 1051 0x400d77a3: httpClientTask at ../main/main.c line 282

Expected Behavior

Works, if i disable hardware MPI form MBEDTLS config

Actual Behavior

Task WDT trigger and core hung

Steps to reproduce

Disable hardware MPI

Code to reproduce this issue

esp_http_client_config_t config = {
.url = "https://portal-rsig.ge/dev/",
//.path = "",
//.keep_alive_enable = false,
.method = HTTP_METHOD_POST,
.event_handler = _http_event_handler,
.crt_bundle_attach = esp_crt_bundle_attach,
.user_data = local_response_buffer, // Pass address of local buffer to get response
.is_async = true,
.timeout_ms = HTTP_REQUEST_TIMEOUT, //3000
};

            esp_http_client_handle_t client = esp_http_client_init(&config);
            esp_http_client_set_header(client, "Content-Type", "application/json");

            do {
                err = esp_http_client_perform(client);
            }while(err == ESP_ERR_HTTP_EAGAIN);

            if (err == ESP_OK) {
                ESP_LOGI(TAG, "HTTPS Status = %d, content_length = %d",
                        esp_http_client_get_status_code(client),
                        esp_http_client_get_content_length(client));
            } else {
                ESP_LOGE(TAG, "Error perform http request %s", esp_err_to_name(err));
            }
        esp_http_client_cleanup(client);

Debug Logs

E (1079115) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
E (1079115) task_wdt:  - IDLE (CPU 1)
E (1079115) task_wdt: Tasks currently running:
E (1079115) task_wdt: CPU 0: IDLE
E (1079115) task_wdt: CPU 1: Http client
E (1079115) task_wdt: Print CPU 0 (current core) backtrace

Backtrace:0x400848F5:0x3FFB11800x400829F1:0x3FFB11A0 0x4010028D:0x3FFC9480 0x40100385:0x3FFC94B0 0x40100939:0x3FFC94D0 0x4010CB91:0x3FFC9500 0x4010CDE8:0x3FFC9530 0x4010A98B:0x3FFC95C0 0x4010BD91:0x3FFC95F0 0x4010BE3D:0x3FFC9620 0x40108F46:0x3FFC9650 0x40108C89:0x3FFC9690 0x400FF221:0x3FFC96C0 0x400FD42D:0x3FFC9770 0x400FD44C:0x3FFC9790 0x400FD472:0x3FFC97B0 0x400F8F65:0x3FFC97D0 0x400F8955:0x3FFC97F0 0x400F8D39:0x3FFC9870 0x401404FB:0x3FFC9890 0x40140531:0x3FFC98C0 0x40157AE5:0x3FFC98E0 0x400F9E39:0x3FFC9900 0x400FAB08:0x3FFC9930 0x400D77A3:0x3FFC9950 

Other items if possible

laukik-hase commented 2 years ago

Hello, @giorgi877!

Thank you for sending the issue report. I was investigating this issue but unfortunately could not reproduce it with the code snippet you provided on the ESP-IDF v4.4 tag.

Could you please attach the following -

  1. The sdkconfig file for your example
  2. Debug logs (with CONFIG_LOG_DEFAULT_LEVEL_DEBUG=y)
giorgi877 commented 2 years ago

Hello, @giorgi877!

Thank you for sending the issue report. I was investigating this issue but unfortunately could not reproduce it with the code snippet you provided on the ESP-IDF v4.4 tag.

Could you please attach the following -

  1. The sdkconfig file for your example
  2. Debug logs (with CONFIG_LOG_DEFAULT_LEVEL_DEBUG=y)

Thank u for your reply!

  1. here is sdconfig for this project. sdkconfig.txt
  2. unfortunately i has not test device here, im not able to provide debug logs right now.

but i can show u where is bug. It seems bug is in hardware RSA Accelerator. Works fine if i use MBED tls MPI(uncheck "enable hardware MPI(bignum) accelartion", fails in case of hardware RSA exponentiation and multiplication calculations. i have tried to set "fallback to software implemenation for larger MPI values" (in case of automatic swithing to software mbedTLS calculation, if number exceeds SOC_RSA_MAX_BIT_LEN value) but the result is the same, works fine only when i enable MBED tls MPI software calculations(uncheck "enable hardware MPI(bignum) accelartion" in sdkonfig->mbedTLS.

Capture1

Below is the code where bug might be in(file: esp_bignum.c)

Capture2

laukik-hase commented 2 years ago

Hello, @giorgi877!

  1. When does your example fail? With CONFIG_MBEDTLS_HARDWARE_MPI enabled or disabled? (Sorry for the confusion)
  2. Could you give me the exact version of ESP-IDF you are working with? Try running the command git describe --tags in ${IDF_PATH}.

Thanks, Laukik.

giorgi877 commented 2 years ago

Hello @laukik-hase ,

  1. Fails when CONFIG_MBEDTLS_HARDWARE_MPI is enabled.
  2. v4.4

Thanks

jammiejamjam commented 2 years ago

Same here

laukik-hase commented 2 years ago

Hello, @giorgi877 and @jammiejamjam!

Sorry for the late reply! I am still not able to reproduce the issue.

Could you try reproducing the issue with the esp_http_client example and your configuration?

Does increasing the task watchdog timeout period at (Top) → Component config → ESP System Settings → ESP_TASK_WDT_TIMEOUT_S help in your use case?

Also, some debug logs (with CONFIG_LOG_DEFAULT_LEVEL_DEBUG=y) would be helpful.

Equidamoid commented 2 years ago

I'm having a potentially similar issue. "Potentially" -- because I don't use https, just http, however, the rest is kinda similar.

I don't have a minimalistic enough example for copypaste, but in general the application is a busy loop polling mqtt (https://github.com/knolleary/pubsubclient) with occasional HTTP POST requests reporting some status. The http code is as follows:

    logf("pre-http {}", url);
    auto t_start = std::chrono::steady_clock::now();
    esp_http_client_config_t conf{
            .url = url.c_str(),
            .method = HTTP_METHOD_POST,
            .timeout_ms = 2000,
            .buffer_size = 1024,
            .is_async = true,
            .keep_alive_enable = false,
    };
    auto cli = esp_http_client_init(&conf);
    esp_http_client_set_post_field(cli, payload.data(), payload.size());
    while(1){
        auto err = esp_http_client_perform(cli);
        if (err == ESP_ERR_HTTP_EAGAIN){
            delay(10); // tried force-yielding, doesn't help at all
            continue;
        } else if (err == ESP_OK){
//        logf("HTTP status = {}, content_length = {}",
//             esp_http_client_get_status_code(cli),
//             esp_http_client_get_content_length(cli));
        } else {
            logf("HTTP error {}", esp_err_to_name(err));
        }
        break;
    }
    esp_http_client_cleanup(cli);
    logf("post-http {}", url);
    auto dt = std::chrono::duration_cast<milliseconds>(std::chrono::steady_clock::now() - t_start).count();
    if (dt > 200){
        logf("HTTP request to {} took {} ms", url, dt);
    }

I also tried without is_async, same result.

I tried to narrow it down as much as the setup allows, here is what I've got so far:

Infortunately I'm bound to platformio and can't easily experiment with esp-idf config.

AxelLin commented 2 years ago

@laukik-hase

I just check the code in master tree, the esp_http_client_perform() returns ESP_ERR_HTTP_EAGAIN when esp_http_client_read / esp_http_client_fetch_headers timeout. What is the correct way to handle "esp_http_client_perform() returns ESP_ERR_HTTP_EAGAIN when is_async is false"? Just calling esp_http_client_perform() again? or any other actions required ?

BTW, the error message "Connection timed out before data was ready!" seems misleading because it's not a connection timeout, but a read/receive timeout.

laukik-hase commented 2 years ago

Hello, @AxelLin!

For sync connections, if esp_http_client_performs() returns ESP_ERR_HTTP_EAGAIN, you will have to retry.

AxelLin commented 2 years ago

Hello, @AxelLin!

For sync connections, if esp_http_client_performs() returns ESP_ERR_HTTP_EAGAIN, you will have to retry.

@laukik-hase The esp_http_client_fetch_headers() now has different behavior in different esp-idf branches. e.g. In master tree, it returns -ESP_ERR_HTTP_EAGAIN if esp_transport_read() returns 0. In v4.3, it returns ESP_FAIL if esp_transport_read() returns 0.

i.e. In previous esp-idf versions, esp_http_client_perform() returns ESP_ERR_HTTP_EAGAIN only when is_async is true.

So does that mean the older esp-idf versions need fix to return ESP_ERR_HTTP_EAGAIN if read timeout?

AxelLin commented 2 years ago

Hello, @giorgi877 and @jammiejamjam!

Sorry for the late reply! I am still not able to reproduce the issue.

Could you try reproducing the issue with the esp_http_client example and your configuration?

Looks like https://github.com/espressif/esp-idf/issues/9371 is the same issue.

AxelLin commented 2 years ago

@laukik-hase

I hit the issue: esp32-wroom-32e with v5.1-dev-286-gbcb34ca7ae

E (359745) task_wdt: Task watchdog got triggered. The following tasks/users did not reset the watchdog in time: E (359745) task_wdt: - IDLE (CPU 1) E (359745) task_wdt: Tasks currently running: E (359745) task_wdt: CPU 0: IDLE E (359745) task_wdt: CPU 1: http_cli E (359745) task_wdt: Aborting. E (359745) task_wdt: Print CPU 1 backtrace

Backtrace: 0x4014415f:0x3ffdea50 0x401441f1:0x3ffdea70 0x401445ee:0x3ffdea90 0x4014e9f5:0x3ffdeab0 0x4014ebf5:0x3ffdeae0 0x4014c612:0x3ffdeb50 0x4014d2f8:0x3ffdeb80 0x4014d358:0x3ffdebb0 0x4014ae3f:0x3ffdebd0 0x4014abda:0x3ffdec00 0x401418d4:0x3ffdec20 0x4013f8d1:0x3ffdecd0 0x4013f937:0x3ffdecf0 0x40134d35:0x3ffded10 0x401349c9:0x3ffded30 0x40134aa7:0x3ffded70 0x4013865a:0x3ffdeda0 0x4019b8fb:0x3ffdedd0 0x40139c20:0x3ffdedf0 0x40139eb1:0x3ffdee10 0x400db0a3:0x3ffdee30 0x400db6ba:0x3ffdefa0 0x4014415f: mpi_sub_hlp at /home/axel/esp/esp-idf/components/mbedtls/mbedtls/library/bignum.c:1201 (inlined by) mbedtls_mpi_sub_abs at /home/axel/esp/esp-idf/components/mbedtls/mbedtls/library/bignum.c:1239

0x401441f1: mbedtls_mpi_sub_mpi at /home/axel/esp/esp-idf/components/mbedtls/mbedtls/library/bignum.c:1312

0x401445ee: mbedtls_mpi_mod_mpi at /home/axel/esp/esp-idf/components/mbedtls/mbedtls/library/bignum.c:1815 (inlined by) mbedtls_mpi_mod_mpi at /home/axel/esp/esp-idf/components/mbedtls/mbedtls/library/bignum.c:1799

0x4014e9f5: calculate_rinv at /home/axel/esp/esp-idf/components/mbedtls/port/esp_bignum.c:200 (discriminator 2)

0x4014ebf5: esp_mpi_exp_mod at /home/axel/esp/esp-idf/components/mbedtls/port/esp_bignum.c:388 (inlined by) mbedtls_mpi_exp_mod at /home/axel/esp/esp-idf/components/mbedtls/port/esp_bignum.c:460

0x4014c612: mbedtls_rsa_public at /home/axel/esp/esp-idf/components/mbedtls/mbedtls/library/rsa.c:767

0x4014d2f8: mbedtls_rsa_rsassa_pkcs1_v15_verify at /home/axel/esp/esp-idf/components/mbedtls/mbedtls/library/rsa.c:2158

0x4014d358: mbedtls_rsa_pkcs1_verify at /home/axel/esp/esp-idf/components/mbedtls/mbedtls/library/rsa.c:2210

0x4014ae3f: rsa_verify_wrap at /home/axel/esp/esp-idf/components/mbedtls/mbedtls/library/pk_wrap.c:93

0x4014abda: mbedtls_pk_verify_restartable at /home/axel/esp/esp-idf/components/mbedtls/mbedtls/library/pk.c:316

0x401418d4: ssl_parse_server_key_exchange at /home/axel/esp/esp-idf/components/mbedtls/mbedtls/library/ssl_cli.c:3215 (inlined by) mbedtls_ssl_handshake_client_step at /home/axel/esp/esp-idf/components/mbedtls/mbedtls/library/ssl_cli.c:4233

0x4013f8d1: mbedtls_ssl_handshake_step at /home/axel/esp/esp-idf/components/mbedtls/mbedtls/library/ssl_tls.c:5255 (inlined by) mbedtls_ssl_handshake_step at /home/axel/esp/esp-idf/components/mbedtls/mbedtls/library/ssl_tls.c:5225

0x4013f937: mbedtls_ssl_handshake at /home/axel/esp/esp-idf/components/mbedtls/mbedtls/library/ssl_tls.c:5317 (inlined by) mbedtls_ssl_handshake at /home/axel/esp/esp-idf/components/mbedtls/mbedtls/library/ssl_tls.c:5293

0x40134d35: esp_mbedtls_handshake at /home/axel/esp/esp-idf/components/esp-tls/esp_tls_mbedtls.c:195

0x401349c9: esp_tls_handshake at /home/axel/esp/esp-idf/components/esp-tls/esp_tls.c:83 (discriminator 15) (inlined by) esp_tls_low_level_conn at /home/axel/esp/esp-idf/components/esp-tls/esp_tls.c:437 (discriminator 15)

0x40134aa7: esp_tls_conn_new_sync at /home/axel/esp/esp-idf/components/esp-tls/esp_tls.c:464

0x4013865a: ssl_connect at /home/axel/esp/esp-idf/components/tcp_transport/transport_ssl.c:109

0x4019b8fb: esp_transport_connect at /home/axel/esp/esp-idf/components/tcp_transport/transport.c:123

0x40139c20: esp_http_client_connect at /home/axel/esp/esp-idf/components/esp_http_client/esp_http_client.c:1274

0x40139eb1: esp_http_client_open at /home/axel/esp/esp-idf/components/esp_http_client/esp_http_client.c:1429

boarchuz commented 2 years ago

I'm also getting this.

The earliest point where I've identified the misbehaviour is at this line here: https://github.com/espressif/esp-idf/blob/eea8629fa18a75755903474c8356268812f26a5e/components/mbedtls/port/esp_bignum.c#L642

Add this just below to confirm: if(mpi_words(Z) != z_words) printf("Incorrect z_words; will now lock up\n");

Causing it to get stuck in this loop here: https://github.com/espressif/mbedtls/blob/7b428b1bf260ce1fec4e3ffb6494070d439a3a67/library/bignum.c#L1765-L1766

From what I can tell, typically z_words == 66 in mpi_mult_mpi_failover_mod_mult, so hw_words == 80. The hardware does its thing and the 80 words are read out via esp_mpi_read_result_hw_op. If all went well, the last 14 words of those 80 should be 0s. For some reason, very rarely, all 80 will be filled with junk(?) instead.

MartyMcFlyInTheSky commented 2 years ago

I'm having the same issue. Newest esp-idf (esp-idf 5.1). We have only sporadic wifi in our room so it could associated to that.

mahavirj commented 2 years ago

We are still investigating this issue.

If you would like to have quick workaround then you can disable CONFIG_MBEDTLS_HARDWARE_MPI and that should fix the problem.

Highlighting few points based on our investigations so far:

We are evaluating this further and keep you posted on the progress.

MartyMcFlyInTheSky commented 2 years ago

@mahavirj Indeed disabling hardware MPI seems to fix things. We didn't have any issues so far until now. I'm also using IDF5.x and I can't remember this problem happened before.. but I could be wrong.

mahavirj commented 2 years ago

@giorgi877 @AxelLin @boarchuz @MartyMcFlyInTheSky

Update:

So far our analysis shows culprit as the code responsible to zero initialize MPI peripheral memory block here:

https://github.com/espressif/esp-idf/blob/1c84cfde14dcffdc77d086a5204ce8a548dce935/components/mbedtls/port/esp32/bignum.c#L80-L83

This code is not behaving as per the expectation. Erroneously some memory location here still contains the garbage value. This mainly happens when compiler converts this loop into memset call (from ROM library) with optimization flags enabled. This started happening with latest toolchain from ESP-IDF v5.x release.

We are evaluating 2 fixes here:

  1. Use DPORT_WRITE_REG (volatile writes) wrappers to write to the MPI peripheral. This would have slight performance penalty, we are still analyzing the impact. 0001-esp32-mpi-use-DPORT-wrappers-to-write-register-data.zip

  2. Disable compiler -ftree-loop-distribute-patterns flag for the specific routine. It appears like this flag has been default enabled with GCC 10.x, causing library calls to be used (e.g., call to memset) 0001-esp32-mpi-fix-issue-with-data-corruption-for-release.zip

Note: Please apply either of the fix using git am -p0 <file_name>

Please try out these fixes and let us know if you still run into any issues. This will help to confirm our analysis.

AxelLin commented 1 year ago

@giorgi877 Do you still hit the issue on v4.4? (The fix is not for v4.4.)

giorgi877 commented 1 year ago

@AxelLin, I'v fixed above issue my own, will post an udate later. Thanks.

WangShuoran commented 1 year ago

I am find any question related to this problem: https://stackoverflow.com/questions/46996893/gcc-replaces-loops-with-memcpy-and-memset https://stackoverflow.com/questions/6410595/getting-gcc-to-compile-without-inserting-call-to-memcpy/33818680#33818680 Does the memset or memcpy do not work as we want is memcpy() can not find in this find running? (I am a runoob in compiler)

boarchuz commented 1 year ago

@mahavirj Just wanted to mention, in light of https://github.com/espressif/esp-idf/issues/10403, that the fix seems to be effective for my case. Thankyou.