espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.79k stars 7.31k forks source link

MPI: Timed out waiting for completion of MPI Interrupt (IDFGH-11118) #12289

Closed yizi15 closed 10 months ago

yizi15 commented 1 year ago

Answers checklist.

IDF version.

v5.1.1

Operating System used.

Windows

How did you build your project?

Command line with idf.py

If you are using Windows, please specify command line type.

PowerShell

Development Kit.

ESP32-S3-WROOM-1 MCN8R8

Power Supply used.

Battery

What is the expected behavior?

i need get a https handle to download file

What is the actual behavior?

Some times, the function is return null, Some times, function return handle and could download the file from https At esp-idf v4.4, the function blow is work ok.

Steps to reproduce.

`HttpsDownloadHandle https_download_begin(const esp_http_client_config_t http_config, struct https_download_arg* arg) {

if(http_config == nullptr || arg == nullptr || arg->https_buf == nullptr || arg->https_buf_size < arg->header_size || arg->header_size < 0) { return nullptr; } HttpsDownloadHandle self = ql_malloc(sizeof(HttpsDownloadHandle)); assert_ret(self, nullptr); memset(self, 0, sizeof(HttpsDownloadHandle)); self->http_client = esp_http_client_init(http_config); if (self->http_client == NULL) { log_error("Failed to initialise HTTP connection"); goto failure; } esp_err_t err = _http_connect(self->http_client); if (err != ESP_OK) { ESP_LOGE(TAG, "Failed to establish HTTP connection"); goto http_cleanup; } self->state = e_https_download_init; self->arg = arg; return self; http_cleanup: _http_cleanup(self->http_client); failure: if(self) { ql_free(self); } return nullptr; };`

Config is below:

' esp_http_client_config_t config = { .url = GET_VER_INFO(this_type).image_path,//https://sleeppydata.oss-cn-beijing.aliyuncs.com/xxx .timeout_ms = 60000, .keep_alive_enable = true, .skip_cert_common_name_check = true, .crt_bundle_attach = esp_crt_bundle_attach, };'

Debug Logs.

is 

E (1243994) MPI: Timed out waiting for completion of MPI Interrupt
(1695375323647) time_sync_task: remote ms 1695375323641
E (1246012) MPI: Timed out waiting for completion of MPI Interrupt
E (1246012) esp-x509-crt-bundle: PK verify failed with error FFFFBD7F
E (1246015) esp-x509-crt-bundle: Failed to verify certificate
E (1246023) esp-tls-mbedtls: mbedtls_ssl_handshake returned -0x3000
E (1246028) esp-tls: Failed to open new connection
E (1246034) transport_base: Failed to open a new connection
E (1246041) HTTP_CLIENT: Connection failed, sock < 0
E (1246046) _http_connect: Failed to open HTTP connection: ESP_ERR_HTTP_CONNECT
E (1246053) https_download_begin: Failed to establish HTTP connection

More Information.

No response

mahavirj commented 1 year ago

@yizi15

Please share your sample code along with the sdkconfig. You may modify https_request example from IDF to recreate your failure case and share the patch here.

yizi15 commented 1 year ago

@mahavirj I found a way to reproduce https download failure, but the error log is different from the previous one. I (15794) TAG: Connected to AP, begin http example I (46240) esp-x509-crt-bundle: Certificate validated E (48828) esp-aes: Timed out waiting for completion of AES Interrupt E (48829) esp-aes: esp_aes_dma_wait_complete failed E (48829) esp-tls-mbedtls: read error :-0x0001: E (48840) transport_base: esp_tls_conn_read error, errno=Connection already in progress E (48852) https_download_begin: Failed to establish HTTP connection

http_example.zip

mahavirj commented 1 year ago

@yizi15

AES interrupt timeout issue was recently fixed on release/v5.1 branch with https://github.com/espressif/esp-idf/commit/560518939871895b53db6a55c8d800aa42355894. Can you please try your sample application on latest release/v5.1 branch once?

yizi15 commented 1 year ago

@mahavirj I checkout release/V5.1, the log is below: I (14781) example_common: - IPv6 address: fe80:0000:0000:0000:4a27:e2ff:fee9:0fa8, type: ESP_IP6_ADDR_IS_LINK_LOCAL I (14792) TAG: Connected to AP, begin http example E (22426) MPI: Timed out waiting for completion of MPI Interrupt E (24444) ERROR A stack overflow in task tst has been detected.

Backtrace: 0x40379fce:0x3fcd5410 0x40384225:0x3fcd5430 0x403870d6:0x3fcd5450 0x403859ff:0x3fcd54d0 0x40387224:0x3fcd54f0 0x4038721a:0x00000100 |<-CORRUPTED 0x40379fce: panic_abort at D:/Espressif5/frameworks/esp_idf_v5.1_test/esp-idf/components/esp_system/panic.c:452

0x40384225: esp_system_abort at D:/Espressif5/frameworks/esp_idf_v5.1_test/esp-idf/components/esp_system/port/esp_system_chip.c:84

0x403870d6: vApplicationStackOverflowHook at D:/Espressif5/frameworks/esp_idf_v5.1_test/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:581

0x403859ff: vTaskSwitchContext at D:/Espressif5/frameworks/esp_idf_v5.1_test/esp-idf/components/freertos/FreeRTOS-Kernel/tasks.c:3728

0x40387224: _frxt_dispatch at D:/Espressif5/frameworks/esp_idf_v5.1_test/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/portasm.S:450

0x4038721a: _frxt_int_exit at D:/Espressif5/frameworks/esp_idf_v5.1_test/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/portasm.S:245

the stack of task 'tst' is 4096

yizi15 commented 1 year ago

when i clone the idf, no heap\tlsf found, then i copy the dir from v5.1.1 (download from https://dl.espressif.cn/dl/idf-installer/esp-idf-tools-setup-offline-5.1.1.exe)

yizi15 commented 1 year ago

http_example.zip is run at ESP32-S3-DevKitC-1 V1.0

yizi15 commented 1 year ago

the bug is only reproduce on the audio(i2s0 i2c0), uart1 and i2c1 enable at the same,

mahavirj commented 1 year ago

@yizi15

Some observations from the sample app you shared:

  1. There is no reason to block the app_main, you can simply return from it. Once returned, the main task shall be deleted.
  2. Code like following can be replaced with more relaxed tick period or simply by vTaskSuspend, vTaskDelete
    while (1)
    {
        vTaskDelay(1);
    }
  3. Please monitor stack usage of different tasks using uxTaskGetStackHighWaterMark. Typically for TLS related operations, we recommend stack sizes to be well above 5-6KB.
yizi15 commented 1 year ago

Now, the stack is 8192. ' UBaseType_t WaterMark = uxTaskGetStackHighWaterMark(NULL); ESP_LOGI("", "WaterMark %u\n", WaterMark);'

I (24954) esp-x509-crt-bundle: Certificate validated E (28023) esp-aes: Timed out waiting for completion of AES Interrupt E (28024) esp-aes: esp_aes_dma_wait_complete failed E (28024) esp-tls-mbedtls: read error :-0x0001: E (28035) transport_base: esp_tls_conn_read error, errno=Connection already in progress E (28047) https_download_begin: Failed to establish HTTP connection I (28050) : WaterMark 4644

assert failed: https_download_perform https_download.c:185 (handle)

@mahavirj Can you reproduce this bug?

mahavirj commented 1 year ago

@yizi15 Please cherry-pick the fix from https://github.com/espressif/esp-idf/issues/12289#issuecomment-1736619821 and see if that fixes your problem.

yizi15 commented 1 year ago

@yizi15 Please cherry-pick the fix from #12289 (comment) and see if that fixes your problem.

image It's my local code , it is same to https://github.com/espressif/esp-idf/commit/560518939871895b53db6a55c8d800aa42355894

hmalpani commented 1 year ago

Hello @yizi15 I tried https_request example with the URL from the sample application you shared. I am unable to reproduce the issue. Can you share a minimal application that can reproduce this issue?

Here are my logs:

I (5440) example: Start https_request example
I (5440) example: https_request using crt bundle
I (5450) main_task: Returned from app_main()
I (5460) wifi:<ba-add>idx:1 (ifx:0, 68:d7:9a:81:31:24), tid:0, ssn:0, winSize:64
I (6690) esp-x509-crt-bundle: Certificate validated
I (7570) example: Connection established...
I (7570) example: 137 bytes written
I (7580) example: Reading HTTP response...
I (221250) example: connection closed
I (221250) example: 10...
I (222250) example: 9...
I (223250) example: 8...
I (224250) example: 7...
I (225250) example: 6...
I (226250) example: 5...
yizi15 commented 1 year ago

@hmalpani http_example.zip

jakub-kocka commented 1 year ago

when i clone the idf, no heap\tlsf found, then i copy the dir from v5.1.1 (download from https://dl.espressif.cn/dl/idf-installer/esp-idf-tools-setup-offline-5.1.1.exe)

Hello @yizi15, It looks like that you have had another issue during this with installed esp-idf.

I haven't been able to reproduce this, the directory should be there so it must have been some exceptional case. Do you have installation logs?

yizi15 commented 1 year ago

@jakub-kocka
When I remove heal/tlsf, run "git submodule update", heal/tlsf works fine. There may be network fluctuations during the first installation. However, AES also has timeout issues.

hmalpani commented 1 year ago

Hello @yizi15 Can you please try with the suggestions here and let me know if they helps resolving the issue.

yizi15 commented 1 year ago

@hmalpani All the suggestions were tried but no work,

ves011 commented 1 year ago

Hello

I have the same issue but in different context. Im my case MPI timeout is hit when trying to connect to a MQTT broker with dual authentication Here is the log of the error

E (18:39:51.102) MPI: Timed out waiting for completion of MPI Interrupt E (18:39:53.122) MPI: Timed out waiting for completion of MPI Interrupt E (18:39:55.122) MPI: Timed out waiting for completion of MPI Interrupt E (18:39:55.125) esp-tls-mbedtls: mbedtls_ssl_handshake returned -0x2700 I (18:39:55.126) esp-tls-mbedtls: Failed to verify peer certificate! E (18:39:55.126) esp-tls: Failed to open new connection E (18:39:55.127) transport_base: Failed to open a new connection E (18:39:55.130) mqtt_client: Error transport connect I (18:39:55.131) MQTTClient: MQTT_EVENT_ERROR E (18:39:55.132) MQTTClient: Last error reported from esp-tls: 0x801a E (18:39:55.132) MQTTClient: Last error reported from tls stack: 0x2700 I (18:39:55.133) MQTTClient: Last errno string (Success) I (18:39:55.133) MQTTClient: MQTT_EVENT_DISCONNECTED

MQTTClient: tag comes from my mqtt client warpper

I come into this issue while porting the code from 4.4.3 to 5.1.1. On 4.4 everything works fine and no MPI timeot. The strange thing is that i ported 2 applications. App 1 is the main application which connects to the mqtt broker and shows this issue. App 2 is a related OTA application, which the same, connects to mqtt broker, but also to a https server to fetch the firmware, and this one works fine! Both applictions are using the same certificates, and basically the same mqtt client wrapper. Sometimes, right after flashing. app 1 works fine, but after first reset goes into the issue.

The workaround i found is to undef CONFIG_MBEDTLS_MPI_USE_INTERRUPT. After undefining app 1 works fine. Can you comment what is the impact of this workaround?

yizi15 commented 1 year ago

@ves011 My code work fine at espidf_v4.4.2 and esp_adf_v2.4. When upgrade to esp_idf_V5.1.1 and esp_adf V2.6(by esp-idf-tools-setup-offline-5.1.1.exe), the bug appear firstly. In my project, when init the uart1, i2c1 by api of idf and init the audio module (i2s0 and i2c0) at same time, AES/MPI timeout will occur. When not use api of adf, and init i2s0 and i2c0 by api of idf, it work fine. Today, i checkout the release/V5.1, and my bug seems to fixed, even though the https download speeds only is 30k/s.

TempoTian commented 1 year ago

I have met similar issue, the cause is that interrupt is not enough for AES. If you are use 2 core, you can move some interrupt to core 1, you can simply move some peripheral initialized code to another core, then the interrupt alloced by the peripheral in initialized API will more to core 1. Sample code like below:

#define RUN_ASYNC(thread, core, body)  \
void async_task##thread(void* arg) {   \
    body;                              \
    vTaskDelete(NULL);                 \
}                                      \
xTaskCreatePinnedToCore(async_task##thread, #thread, 4*1024, NULL, 3, NULL, core);

static int init_peripheral(void)
{
    _i2s_init();
    _i2c_init();
    init_lcd();
    RUN_ASYNC(i_core1, 1, {
        mount_sdcard();
        init_console();
    });
    return 0;
}
ves011 commented 1 year ago

These are just workarounds :( For me it looks like a degradation which needs to be fixed. @yizi15 had the same problem while migrating the code to v5 (on my understanding) The first step is for espressif guys to accept this is an issue.

@yizi15 you said "i checkout the release/V5.1, and my bug seems to fixed". Do you know what is the fix?

Again can someone comment what is the impact of disabling CONFIG_MBEDTLS_MPI_USE_INTERRUPT?

yizi15 commented 1 year ago

@ves011 '"i checkout the release/V5.1, and my bug seems to fixed' , it's not always work. I move the audio to core1 and work ok.

I have met similar issue, the cause is that interrupt is not enough for AES. If you are use 2 core, you can move some interrupt to core 1, you can simply move some peripheral initialized code to another core, then the interrupt alloced by the peripheral in initialized API will more to core 1. Sample code like below:

#define RUN_ASYNC(thread, core, body)  \
void async_task##thread(void* arg) {   \
    body;                              \
    vTaskDelete(NULL);                 \
}                                      \
xTaskCreatePinnedToCore(async_task##thread, #thread, 4*1024, NULL, 3, NULL, core);

static int init_peripheral(void)
{
    _i2s_init();
    _i2c_init();
    init_lcd();
    RUN_ASYNC(i_core1, 1, {
        mount_sdcard();
        init_console();
    });
    return 0;
}
chipweinberger commented 11 months ago

I am also hitting this issue in v5.1.1, when calling mbedtls_rsa_pkcs1_sign, esp32-s3

edit: same issue on v5.1.2,

edit: tried changing the timeout to 60s, same issue

edit: CONFIG_MBEDTLS_MPI_USE_INTERRUPT=n solves it!

mahavirj commented 11 months ago

Update:

mahavirj commented 11 months ago

Again can someone comment what is the impact of disabling CONFIG_MBEDTLS_MPI_USE_INTERRUPT?

There is no harm in disabling this option, it will keep the CPU in busy-wait state for the MPI operation which otherwise could have allowed other code to run on it. @ves011

mahavirj commented 11 months ago

I am also hitting this issue in v5.1.1, when calling mbedtls_rsa_pkcs1_sign

Could you please check if there was any error corresponding to MPI interrupt allocation failure? @chipweinberger

chipweinberger commented 11 months ago

@mahavirj , there are no "interrupt allocation failed" logs in my output

mahavirj commented 11 months ago

@chipweinberger My bad. Please put some debugs print here and log the return value.

Note: Per my earlier comment, this code is already fixed with https://github.com/espressif/esp-idf/commit/d3be7bda056dfcb7edbac4cf8757ec1ac7d0a1ea and backports are WIP.

ves011 commented 11 months ago

Again can someone comment what is the impact of disabling CONFIG_MBEDTLS_MPI_USE_INTERRUPT?

There is no harm in disabling this option, it will keep the CPU in busy-wait state for the MPI operation which otherwise could have allowed other code to run on it. @ves011

thanks @mahavirj Will keep it disable until the fix will be released

Harshal5 commented 10 months ago

Hello @chipweinberger, @ves011,

Recently we merged some fixes in the AES and MPI modules that were backported to release/v5.1 as well. I would recommend you to check if the issue still persists with the branch release/v5.1. Fixes that were merged were: AES-DMA related fixes (https://github.com/espressif/esp-idf/commit/2fbe919e4cafba8ce2528e8eaa21bfe7fbbe7d8f and https://github.com/espressif/esp-idf/commit/560518939871895b53db6a55c8d800aa42355894), interrupt related fixes (https://github.com/espressif/esp-idf/commit/4c30f2a4a0229979b4700c6acb34d49705fb6670, https://github.com/espressif/esp-idf/commit/e8a5d9ef989f958be9ea6d07280c13f7601a9f47).

Closing this issue as of now, feel free to re-open if the issue still persists.