espressif / esp-aws-iot

AWS IoT SDK for ESP32 based chipsets
Apache License 2.0
270 stars 164 forks source link

Memory allocation problem occurs in AWS IoT OTA process (IDFGH-13815) (CA-345) #230

Closed Gyunamchoi closed 2 weeks ago

Gyunamchoi commented 1 month ago

Answers checklist.

IDF version.

ESP-IDF v5.3 2nd stage bootloader

Espressif SoC revision.

ESP32-WROOM-32UE chip revision: v3.1

Operating System used.

macOS

How did you build your project?

Command line with idf.py

If you are using Windows, please specify command line type.

None

Development Kit.

custom board (ESP32-WROOM-32UE)

Power Supply used.

USB

What is the expected behavior?

The ultimate goal is to proceed with OTA updates through aws-iot.

What is the actual behavior?

An error related to heap_caps occurs when allocating memory such as malloc.

Steps to reproduce.

  1. Power on
  2. AWS IoT connection and MQTT connection via SIM7600EK modem with lwip(ppp) interface
  3. Transmitting shadow and measurement information - works well
  4. Create ota job in aws iot
  5. AWS -> device : topic : $aws/things/(device id)/jobs/notify-next (OK)
  6. device -> AWS : topic : $aws/things/(device id)/jobs/start-next (OK)
  7. AWS -> device : topic : $aws/things/testdev21/jobs/start-next/accepted (OK)
  8. device -> AWS : $aws/things/(device id)/streams/(stream_id)/get/json ( message Publish ERROR )

Debug Logs.

Guru Meditation Error: Core  0 panic'ed (StoreProhibited). Exception was unhandled.

Core  0 register dump:
PC      : 0x4008d1c6  PS      : 0x00060b33  A0      : 0x8008cb64  A1      : 0x3ffc40e0  
0x4008d1c6: remove_free_block at /Users/odn/esp/esp-idf/components/heap/tlsf/tlsf.c:332
 (inlined by) block_locate_free at /Users/odn/esp/esp-idf/components/heap/tlsf/tlsf.c:567
 (inlined by) tlsf_malloc at /Users/odn/esp/esp-idf/components/heap/tlsf/tlsf.c:1015

A2      : 0x3ffd039c  A3      : 0x00000058  A4      : 0x00000002  A5      : 0x00000036  
A6      : 0x3ffc4c1c  A7      : 0x3ffb55f4  A8      : 0x00000006  A9      : 0xffffffff  
A10     : 0x00000003  A11     : 0xffffffff  A12     : 0x00000468  A13     : 0x3ffb5618  
A14     : 0x3ffc4240  A15     : 0x0000cdcd  SAR     : 0x0000001e  EXCCAUSE: 0x0000001d  
EXCVADDR: 0x0000000b  LBEG    : 0x4000c349  LEND    : 0x4000c36b  LCOUNT  : 0x00000000  
0x4000c349: memcpy in ROM
0x4000c36b: memcpy in ROM

Backtrace: 0x4008d1c3:0x3ffc40e0 0x4008cb61:0x3ffc4100 0x40082875:0x3ffc4120 0x40082992:0x3ffc4140 0x400829cd:0x3ffc4170 0x40082b57:0x3ffc4190 0x40082231:0x3ffc41b0 0x400da61e:0x3ffc41d0 0x4010fa37:0x3ffc4230 0x40145491:0x3ffc4290 0x4010fc3c:0x3ffc42b0 0x40145451:0x3ffc42f0 0x400e5599:0x3ffc4310 0x400e72d1:0x3ffc4330 0x400dbf03:0x3ffc4370 0x400dc261:0x3ffc43a0 0x400dca71:0x3ffc47d0 0x400dcdd9:0x3ffc47f0 0x40147c29:0x3ffc4890 0x401476be:0x3ffc48c0 0x400e50a6:0x3ffc4900 0x400e5b08:0x3ffc4920 0x400e5c6f:0x3ffc4950 0x400e616c:0x3ffc4980 0x400883f9:0x3ffc49b0
0x4008d1c3: remove_free_block at /Users/odn/esp/esp-idf/components/heap/tlsf/tlsf.c:331 (discriminator 1)
 (inlined by) block_locate_free at /Users/odn/esp/esp-idf/components/heap/tlsf/tlsf.c:567 (discriminator 1)
 (inlined by) tlsf_malloc at /Users/odn/esp/esp-idf/components/heap/tlsf/tlsf.c:1015 (discriminator 1)
0x4008cb61: multi_heap_malloc_impl at /Users/odn/esp/esp-idf/components/heap/multi_heap.c:211
0x40082875: aligned_or_unaligned_alloc at /Users/odn/esp/esp-idf/components/heap/heap_caps_base.c:77
0x40082992: heap_caps_aligned_alloc_base at /Users/odn/esp/esp-idf/components/heap/heap_caps_base.c:150
0x400829cd: heap_caps_malloc_base at /Users/odn/esp/esp-idf/components/heap/heap_caps_base.c:170
0x40082b57: heap_caps_calloc_base at /Users/odn/esp/esp-idf/components/heap/heap_caps_base.c:284
0x40082231: heap_caps_calloc at /Users/odn/esp/esp-idf/components/heap/heap_caps.c:255
0x400da61e: esp_vfs_select at /Users/odn/esp/esp-idf/components/vfs/vfs.c:993
0x4010fa37: base_poll_write at /Users/odn/esp/esp-idf/components/tcp_transport/transport_ssl.c:196 (discriminator 1)
0x40145491: esp_transport_poll_write at /Users/odn/esp/esp-idf/components/tcp_transport/transport.c:164
0x4010fc3c: ssl_write at /Users/odn/esp/esp-idf/components/tcp_transport/transport_ssl.c:215
0x40145451: esp_transport_write at /Users/odn/esp/esp-idf/components/tcp_transport/transport.c:148
0x400e5599: esp_mqtt_write at /Users/odn/esp/esp-idf/components/mqtt/esp-mqtt/mqtt_client.c:704
0x400e72d1: esp_mqtt_client_publish at /Users/odn/esp/esp-idf/components/mqtt/esp-mqtt/mqtt_client.c:2121
0x400dbf03: mqtt_publish at /Users/odn/workspace/odn_device_rev2/components/mqtt_handle/Mqtt_handler.c:188
0x400dc261: mqtt_recvtopic_jobs at /Users/odn/workspace/odn_device_rev2/components/mqtt_handle/Mqtt_handler.c:545
0x400dca71: mqtt_recvprocessing at /Users/odn/workspace/odn_device_rev2/components/mqtt_handle/Mqtt_handler.c:360 (discriminator 1)
0x400dcdd9: mqtt_eventhandler at /Users/odn/workspace/odn_device_rev2/components/mqtt_handle/Mqtt_handler.c:294
0x40147c29: handler_execute at /Users/odn/esp/esp-idf/components/esp_event/esp_event.c:135
0x401476be: esp_event_loop_run at /Users/odn/esp/esp-idf/components/esp_event/esp_event.c:584
0x400e50a6: esp_mqtt_dispatch_event at /Users/odn/esp/esp-idf/components/mqtt/esp-mqtt/mqtt_client.c:1032
0x400e5b08: deliver_publish at /Users/odn/esp/esp-idf/components/mqtt/esp-mqtt/mqtt_client.c:1099
0x400e5c6f: mqtt_process_receive at /Users/odn/esp/esp-idf/components/mqtt/esp-mqtt/mqtt_client.c:1377
0x400e616c: esp_mqtt_task at /Users/odn/esp/esp-idf/components/mqtt/esp-mqtt/mqtt_client.c:1657
0x400883f9: vPortTaskWrapper at /Users/odn/esp/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:134

More Information.

I will write down the source code part where a very short problem can occur.

Initialze mqtt `esp_err_t mqtt_initialize(){ esp_err_t err = ESP_OK; ESP_LOGI(TAG, "mqtt start intialization."); mqtt_client = esp_mqtt_client_init(&mqtt_config); err = esp_mqtt_client_register_event(mqtt_client, MQTT_EVENT_ANY, &mqtt_eventhandler, mqtt_client); err = esp_mqtt_client_start(mqtt_client); return err; }

void mqtt_eventhandler(void arg, esp_event_base_t base, int32_t event_id, void event_data) esp_mqtt_event_handle_t event = event_data; switch ((esp_mqtt_event_id_t)event_id) { case MQTT_EVENT_CONNECTED: break; case MQTT_EVENT_DISCONNECTED: break;

    case MQTT_EVENT_DATA:
        ESP_LOGI(TAG, "MQTT_EVENT_DATA,  msg_id=%d, topic[%d]=%s", event->msg_id, event->topic_len, event->topic);
        if( event->client == mqtt_client) {
            char* recv_topic = (char*)malloc(event->topic_len);
            strncpy(recv_topic, event->topic, event->topic_len);
            char* recv_data = recv_data = (char*)malloc(event->data_len);
            strncpy(recv_data, event->data, event->data_len);
            mqtt_recvprocessing(recv_topic, recv_data);
        }
       break;

}

esp_err_t mqtt_recvprocessing(char recv_topic, char recv_data){ ( parsing aws iot topic ) if(!strcmp(parts, "jobs")){ // Event Data string to JSON Structure err = mqtt_recvtopic_jobs(strtok(NULL, END_TOKEN), recv_data);
} }

esp_err_t mqtt_recvtopic_jobs(char remain_topic, char event_data){ esp_err_t err = ESP_OK; // Event Data string to JSON Structure cJSON json_jobs = cJSON_Parse(event_data); if(json_jobs != NULL){ cJSON json_execution = cJSON_GetObjectItem(json_jobs, "execution"); cJSON* json_status = cJSON_GetObjectItem(json_execution, "status");

    if((json_status != NULL) && ( json_status->valuestring != NULL )){
        char* job_status = json_status->valuestring; 
        if(!strcmp(job_status, "QUEUED" )){
            char* job_topic = (char*)malloc(sizeof("$aws/things//jobs/start-next") + strlen(device_id));
            sprintf(job_topic, "$aws/things/%s/jobs/start-next", device_id);
            mqtt_publish(job_topic, CLIENT_TOKEN);
            free(job_topic);
        } 
        else if(!strcmp(job_status, "IN_PROGRESS" )){
            ESP_LOGI(TAG, "jobs remain_topic =%s  event_data = %s ", remain_topic, event_data);                    
            ESP_LOGI(TAG, "job status : %s", cJSON_Print(json_status));
            cJSON* json_jobdocument = cJSON_GetObjectItem(json_execution, "jobDocument");
            cJSON* json_otadocument = cJSON_GetObjectItem(json_jobdocument, "afr_ota");
            char* stream_id = cJSON_GetObjectItem(json_otadocument, "streamname")->valuestring;
            if((json_otadocument != NULL) && (cJSON_IsObject(json_otadocument))){
                const esp_partition_t* running_partition = esp_ota_get_running_partition();
                if(running_partition->subtype == ESP_PARTITION_SUBTYPE_APP_FACTORY)
                {
                    // current Partition is Factory -> OTA Update 
                    ESP_LOGI(TAG, "Execute new OTA update task. (from factory partition) ");
                    // ota start topic build
                    char* topic_mqtt_start = (char*)malloc(sizeof("$aws/things//streams//get/json") + strlen(device_id) + strlen(stream_id));
                    sprintf(topic_mqtt_start, "$aws/things/%s/streams/%s/get/json", device_id, stream_id);                    
                    // start message build 
                    char* stream_id = cJSON_GetObjectItem(json_otadocument, "streamname")->valuestring;
                    cJSON* files = cJSON_GetObjectItem(json_otadocument, "files");
                    cJSON* file = cJSON_GetArrayItem(files, 0);
                    uint16_t file_id = (uint16_t)cJSON_GetObjectItem(file, "fileid")->valueint;
                    char* cert_file = cJSON_GetObjectItem(file, "certfile")->valuestring;
                    int file_size = cJSON_GetObjectItem(file, "filesize")->valueint;
                    // ota stream build
                    ota_stream.total.stream_id = stream_id;
                    ota_stream.total.file_id = file_id;
                    ota_stream.total.file_size = file_size;
                    ota_stream.total.cert_file = cert_file;
                    ota_stream.total.remain_block = file_size / CONFIG_OTA_STREAM_BLOCK_SIZE +1 ;
                    ota_stream.request_block_num = 16;
                    ota_stream.received_blocks = (int16_t*)malloc(ota_stream.request_block_num);
                    for(int i=0; i<ota_stream.request_block_num; i++){
                        ota_stream.received_blocks[i] = -1;
                    }
                    ota_stream.received_image_size = 0;
                    ota_stream.block_offset = 0;                        
                    // build payload
                //    cJSON* json_payload = ota_build_payload_stream();
                //    char* payload = cJSON_Print(json_payload);
                //    cJSON_Delete (json_payload);
                //    return payload;*/
                    **char msg_mqtt_start[1000] ;
                    sprintf(msg_mqtt_start, "{\"c\":%s, \"s\":%d, \"f\":%d,\"l\":%d,\"o\":%d,\"n\":%d}",
                        CONFIG_OTA_CLIENT_TOKEN, CONFIG_OTA_STREAM_VERSION, (int)ota_stream.total.file_size, 
                        CONFIG_OTA_STREAM_BLOCK_SIZE, ota_stream.block_offset, ota_stream.request_block_num);

// char* msg_mqtt_start = ota_request_start_stream(json_otadocument); ESP_LOGI(TAG, "topic_mqtt_start =%s msg_mqtt_start = %s ", topic_mqtt_start, msg_mqtt_start);
mqtt_publish(topic_mqtt_start, msg_mqtt_start);** ESP_LOGI(TAG, "Start ota [%s] from factory mode to stream image file request complete.",stream_id); free(topic_mqtt_start); free(msg_mqtt_start); } else { ` mqtt message process

  1. When an MQTT event occurs, the event handler processes the event and passes the topic and data as char* arguments.
  2. Parse the topic
  3. If it is jobs, pass it to the jobs process
  4. Receive the $aws/things/(device id)/jobs/notify-next topic and check if the status is queued.
  5. Once the check is complete, publish the $aws/things/(device id)/jobs/start-next message
  6. Then, AWS checks that the message status of the $aws/things/testdev21/jobs/start-next/accepted topic is IN PROGRESS as the device. 8. Once confirmed, publish the $aws/things/(device id)/streams/(stream_id)/get/json message to AWS (In this process, a kernel panic occurs during memory allocation)

When creating cJSON, cJSON_CreateObject() uses malloc, so a kernel panic occurs, and the same problem occurs in esp_mqtt_publish() when publishing. A heap_caps_calloc error occurs when performing memory-related operations. Whether creating with malloc, creating with fixed generation, or using cJSON, the same problem occurs in the same part.

There is about 170KB of free heap memory, and there are no problems at all in periodic message publishing, except for the jobs-related part.

I have no idea whether it is a problem with the mqtt module itself when publishing, a problem with another part, or a problem with json. Please help me.

Thank you.

nileshkale123 commented 1 month ago

Hello @Gyunamchoi,

Thank you for bringing up the issue.

Could you please print the free heap memory before the function fails? Alternatively, if possible, could you create a small working example to reproduce the issue or integrate it into one of the existing IDF examples?

Gyunamchoi commented 1 month ago

Hello @nileshkale123

Could you wait for 2~3days? I will summarize and deliver the following responses.

  1. I will deliver all heap dump data from heap_caps_dump_all()
  2. I will extract the part related to processing the aws iot jobs reserved command among the mqtt handlers or integrate it into the existing sample code and deliver it.

2-1) However, since it is a problem that occurs when connecting to the aws iot core service and creating an ota job, I wonder if the environment is suitable for testing the aws iot core 2-2) I will deliver the job creation process in the aws iot service later when uploading the code or heap data.

Thank you for your response.

Gyunamchoi commented 1 month ago

Hello @nileshkale123

This is all heap data just before the problem occurred. Showing data for heap: 0x3ffae6e0 Block 0x3ffae860 data, size: 184 bytes, Free: No Block 0x3ffae91c data, size: 12 bytes, Free: No Block 0x3ffae92c data, size: 12 bytes, Free: No Block 0x3ffae93c data, size: 16 bytes, Free: No Block 0x3ffae950 data, size: 196 bytes, Free: No Block 0x3ffaea18 data, size: 196 bytes, Free: No Block 0x3ffaeae0 data, size: 432 bytes, Free: No Block 0x3ffaec94 data, size: 88 bytes, Free: No Block 0x3ffaecf0 data, size: 88 bytes, Free: No Block 0x3ffaed4c data, size: 88 bytes, Free: No Block 0x3ffaeda8 data, size: 132 bytes, Free: No Block 0x3ffaee30 data, size: 132 bytes, Free: No Block 0x3ffaeeb8 data, size: 16 bytes, Free: No Block 0x3ffaeecc data, size: 88 bytes, Free: No Block 0x3ffaef28 data, size: 88 bytes, Free: No Block 0x3ffaef84 data, size: 16 bytes, Free: No Block 0x3ffaef98 data, size: 12 bytes, Free: No Block 0x3ffaefa8 data, size: 88 bytes, Free: No Block 0x3ffaf004 data, size: 1028 bytes, Free: No Block 0x3ffaf40c data, size: 344 bytes, Free: No Block 0x3ffaf568 data, size: 1028 bytes, Free: No Block 0x3ffaf970 data, size: 344 bytes, Free: No Block 0x3ffafacc data, size: 344 bytes, Free: No Block 0x3ffafc28 data, size: 12 bytes, Free: No Block 0x3ffafc38 data, size: 16 bytes, Free: No Block 0x3ffafc4c data, size: 12 bytes, Free: No Block 0x3ffafc5c data, size: 16 bytes, Free: No Block 0x3ffafc70 data, size: 344 bytes, Free: No Block 0x3ffafdcc data, size: 344 bytes, Free: No Block 0x3ffaff28 data, size: 88 bytes, Free: No Block 0x3ffaff84 data, size: 16 bytes, Free: No Block 0x3ffaff98 data, size: 100 bytes, Free: No Showing data for heap: 0x3ffb55e0 Block 0x3ffb5990 data, size: 4100 bytes, Free: No Block 0x3ffb6998 data, size: 4100 bytes, Free: No Block 0x3ffb79a0 data, size: 1540 bytes, Free: No Block 0x3ffb7fa8 data, size: 1540 bytes, Free: No Block 0x3ffb85b0 data, size: 344 bytes, Free: No Block 0x3ffb870c data, size: 20 bytes, Free: No Block 0x3ffb8724 data, size: 16 bytes, Free: No Block 0x3ffb8738 data, size: 20 bytes, Free: No Block 0x3ffb8750 data, size: 20 bytes, Free: No Block 0x3ffb8768 data, size: 28 bytes, Free: No Block 0x3ffb8788 data, size: 600 bytes, Free: No Block 0x3ffb89e4 data, size: 88 bytes, Free: No Block 0x3ffb8a40 data, size: 2820 bytes, Free: No Block 0x3ffb9548 data, size: 344 bytes, Free: No Block 0x3ffb96a4 data, size: 88 bytes, Free: No Block 0x3ffb9700 data, size: 44 bytes, Free: No Block 0x3ffb9730 data, size: 88 bytes, Free: No Block 0x3ffb978c data, size: 48 bytes, Free: No Block 0x3ffb97c0 data, size: 48 bytes, Free: No Block 0x3ffb97f4 data, size: 20 bytes, Free: No Block 0x3ffb980c data, size: 24 bytes, Free: No Block 0x3ffb9828 data, size: 56 bytes, Free: No Block 0x3ffb9864 data, size: 56 bytes, Free: No Block 0x3ffb98a0 data, size: 56 bytes, Free: No Block 0x3ffb98dc data, size: 56 bytes, Free: No Block 0x3ffb9918 data, size: 56 bytes, Free: No Block 0x3ffb9954 data, size: 56 bytes, Free: No Block 0x3ffb9990 data, size: 108 bytes, Free: No Block 0x3ffb9a00 data, size: 344 bytes, Free: No Block 0x3ffb9b5c data, size: 132 bytes, Free: No Block 0x3ffb9be4 data, size: 132 bytes, Free: No Block 0x3ffb9c6c data, size: 132 bytes, Free: No Block 0x3ffb9cf4 data, size: 32 bytes, Free: No Block 0x3ffb9d18 data, size: 32 bytes, Free: No Block 0x3ffb9d3c data, size: 88 bytes, Free: No Block 0x3ffb9d98 data, size: 40 bytes, Free: No Block 0x3ffb9dc4 data, size: 20 bytes, Free: No Block 0x3ffb9ddc data, size: 12 bytes, Free: No Block 0x3ffb9dec data, size: 12 bytes, Free: No Block 0x3ffb9dfc data, size: 20 bytes, Free: No Block 0x3ffb9e14 data, size: 12 bytes, Free: No Block 0x3ffb9e24 data, size: 12 bytes, Free: No Block 0x3ffb9e34 data, size: 88 bytes, Free: No Block 0x3ffb9e90 data, size: 88 bytes, Free: No Block 0x3ffb9eec data, size: 16 bytes, Free: No Block 0x3ffb9f00 data, size: 196 bytes, Free: No Block 0x3ffb9fc8 data, size: 88 bytes, Free: No Block 0x3ffba024 data, size: 20 bytes, Free: No Block 0x3ffba03c data, size: 20 bytes, Free: No Block 0x3ffba054 data, size: 20 bytes, Free: No Block 0x3ffba06c data, size: 12 bytes, Free: No Block 0x3ffba07c data, size: 216 bytes, Free: No Block 0x3ffba158 data, size: 3588 bytes, Free: No Block 0x3ffbaf60 data, size: 344 bytes, Free: No Block 0x3ffbb0bc data, size: 88 bytes, Free: No Block 0x3ffbb118 data, size: 104 bytes, Free: No Block 0x3ffbb184 data, size: 16 bytes, Free: No Block 0x3ffbb198 data, size: 16 bytes, Free: No Block 0x3ffbb1ac data, size: 260 bytes, Free: No Block 0x3ffbb2b4 data, size: 12 bytes, Free: No Block 0x3ffbb2c4 data, size: 12 bytes, Free: No Block 0x3ffbb2d4 data, size: 12 bytes, Free: No Block 0x3ffbb2e4 data, size: 20 bytes, Free: No Block 0x3ffbb2fc data, size: 100 bytes, Free: No Block 0x3ffbb364 data, size: 468 bytes, Free: No Block 0x3ffbb53c data, size: 24 bytes, Free: No Block 0x3ffbb558 data, size: 56 bytes, Free: No Block 0x3ffbb594 data, size: 128 bytes,Guru Meditation Error: Core 0 panic'ed (Interrupt wdt timeout on CPU0).

This is the error message from the part where the error occurred.

`Guru Meditation Error: Core 0 panic'ed (Interrupt wdt timeout on CPU0).

Core 0 register dump: PC : 0x4000921f PS : 0x00060a34 A0 : 0x80007d16 A1 : 0x3ffc4210
0x4000921f: uart_tx_one_char in ROM

A2 : 0xeb80c00f A3 : 0x10000000 A4 : 0x00000000 A5 : 0x00060a23
A6 : 0x00000001 A7 : 0x3ffb9e94 A8 : 0x3ff40000 A9 : 0x00000020
A10 : 0x00800000 A11 : 0x3ff4001c A12 : 0x00000000 A13 : 0x00060a23
A14 : 0x80000000 A15 : 0x3ffae888 SAR : 0x00000004 EXCCAUSE: 0x00000005
EXCVADDR: 0x00000000 LBEG : 0x4000c2e0 LEND : 0x4000c2f6 LCOUNT : 0x00000000
0x4000c2e0: memcpy in ROM 0x4000c2f6: memcpy in ROM

Backtrace: 0x4000921c:0x3ffc4210 0x40007d13:0x3ffc4230 0x40007c69:0x3ffc4250 0x40008148:0x3ffc4270 0x40148c49:0x3ffc4300 0x40148f47:0x3ffc4320 0x40148cd9:0x3ffc4340 0x400d49ab:0x3ffc4360 0x400d49be:0x3ffc4380 0x400dc541:0x3ffc43a0 0x400dcd85:0x3ffc47d0 0x400dd0ed:0x3ffc47f0 0x40147fb9:0x3ffc4890 0x40147a4e:0x3ffc48c0 0x400e53ba:0x3ffc4900 0x400e5e1c:0x3ffc4920 0x400e5f83:0x3ffc4950 0x400e6480:0x3ffc4980 0x400883f9:0x3ffc49b0 0x4000921c: uart_tx_one_char in ROM 0x40007d13: ets_write_char_uart in ROM 0x40007c69: ets_write_char in ROM 0x40008148: ets_printf in ROM 0x40148c49: multi_heap_dump_tlsf at /Users/odn/esp/esp-idf/components/heap/multi_heap.c:361 (discriminator 4) 0x40148f47: tlsf_walk_pool at /Users/odn/esp/esp-idf/components/heap/tlsf/tlsf.c:771 0x40148cd9: multi_heap_dump at /Users/odn/esp/esp-idf/components/heap/multi_heap.c:371 (discriminator 1) 0x400d49ab: heap_caps_dump at /Users/odn/esp/esp-idf/components/heap/heap_caps.c:450 0x400d49be: heap_caps_dump_all at /Users/odn/esp/esp-idf/components/heap/heap_caps.c:457 0x400dc541: mqtt_recvtopic_jobs at /Users/odn/workspace/odn_device_rev2/components/mqtt_handle/Mqtt_handler.c:532 0x400dcd85: mqtt_recvprocessing at /Users/odn/workspace/odn_device_rev2/components/mqtt_handle/Mqtt_handler.c:359 (discriminator 1) 0x400dd0ed: mqtt_eventhandler at /Users/odn/workspace/odn_device_rev2/components/mqtt_handle/Mqtt_handler.c:293 0x40147fb9: handler_execute at /Users/odn/esp/esp-idf/components/esp_event/esp_event.c:135 0x40147a4e: esp_event_loop_run at /Users/odn/esp/esp-idf/components/esp_event/esp_event.c:584 0x400e53ba: esp_mqtt_dispatch_event at /Users/odn/esp/esp-idf/components/mqtt/esp-mqtt/mqtt_client.c:1032 0x400e5e1c: deliver_publish at /Users/odn/esp/esp-idf/components/mqtt/esp-mqtt/mqtt_client.c:1099 0x400e5f83: mqtt_process_receive at /Users/odn/esp/esp-idf/components/mqtt/esp-mqtt/mqtt_client.c:1377 0x400e6480: esp_mqtt_task at /Users/odn/esp/esp-idf/components/mqtt/esp-mqtt/mqtt_client.c:1657 0x400883f9: vPortTaskWrapper at /Users/odn/esp/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:134

Core 1 register dump: PC : 0x400858ee PS : 0x00060f34 A0 : 0x800d4496 A1 : 0x3ffb84c0
0x400858ee: esp_cpu_wait_for_intr at /Users/odn/esp/esp-idf/components/esp_hw_support/cpu.c:64

A2 : 0x00000000 A3 : 0x00000000 A4 : 0x3ffaf900 A5 : 0x3ffaf8e0
A6 : 0x40081ed0 A7 : 0x00000001 A8 : 0x800eb55a A9 : 0x3ffb8480
0x40081ed0: ipc_task at /Users/odn/esp/esp-idf/components/esp_system/esp_ipc.c:53

A10 : 0x00000000 A11 : 0x00000000 A12 : 0x3ffaf8e0 A13 : 0x3ffaf8c0
A14 : 0x00000001 A15 : 0x3ffb85b4 SAR : 0x00000000 EXCCAUSE: 0x00000005
EXCVADDR: 0x00000000 LBEG : 0x00000000 LEND : 0x00000000 LCOUNT : 0x00000000

Backtrace: 0x400858eb:0x3ffb84c0 0x400d4493:0x3ffb84e0 0x400894bd:0x3ffb8500 0x400883f9:0x3ffb8520 0x400858eb: xt_utils_wait_for_intr at /Users/odn/esp/esp-idf/components/xtensa/include/xt_utils.h:82 (inlined by) esp_cpu_wait_for_intr at /Users/odn/esp/esp-idf/components/esp_hw_support/cpu.c:55 0x400d4493: esp_vApplicationIdleHook at /Users/odn/esp/esp-idf/components/esp_system/freertos_hooks.c:58 0x400894bd: prvIdleTask at /Users/odn/esp/esp-idf/components/freertos/FreeRTOS-Kernel/tasks.c:4344 (discriminator 1) 0x400883f9: vPortTaskWrapper at /Users/odn/esp/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:134`

This is the log just before heap output raw log. ` I (607459) [mqtt]: Event dispatched from event loop event_id=6 I (607459) [mqtt]: MQTT_EVENT_DATA, msg_id=0, topic[38]=$aws/things/testdev21/jobs/notify-next{"timestamp":1728619018,"execution":{"jobId":"AFR_OTA-testdev21_update42","status":"QUEUED","queuedAt":1728619016,"lastUpdatedAt":1728619016,"versionNumber":1,"executionNumber":1,"jobDocument":{"afr_ota":{"protocols":["MQTT"],"streamname":"AFR_OTA-ef4fecab-c7d0-4812-a5fd-fb86a739abdf","files":[{"filepath":"/","filesize":730304,"fileid":0,"certfile":"/main/certs/aws_codesign.crt","sig-sha256-ecdsa":"MEUCIEzzqvV1b6ra97DIV+AJzx2uVUtWiG4xtuPDi6dV5nYOAiEAmV7pw9Y3cPwb8l5SEHLPzqx7920FiQ9awPHd4dbS+/4="}]}}}} '�$�8��c��@&� �d��{I� I (610509) [mqtt]: sent publish successful, msg_id=0, topic = $aws/things/testdev21/jobs/start-next I (610509) [mqtt]: Event dispatched from event loop event_id=6 I (610509) [mqtt]: MQTT_EVENT_DATA, msg_id=0, topic[37]=$aws/things/testdev21/jobs/start-next{"clientToken": "device-dev"}ecution":{"jobId":"AFR_OTA-testdev21_update42","status":"QUEUED","queuedAt":1728619016,"lastUpdatedAt":1728619016,"versionNumber":1,"executionNumber":1,"jobDocument":{"afr_ota":{"protocols":["MQTT"],"streamname":"AFR_OTA-ef4fecab-c7d0-4812-a5fd-fb86a739abdf","files":[{"filepath":"/","filesize":730304,"fileid":0,"certfile":"/main/certs/aws_codesign.crt","sig-sha256-ecdsa":"MEUCIEzzqvV1b6ra97DIV+AJzx2uVUtWiG4xtuPDi6dV5nYOAiEAmV7pw9Y3cPwb8l5SEHLPzqx7920FiQ9awPHd4dbS+/4="}]}}}} '�$�8��c��@&� �d��{I� I (610569) [mqtt]: Event dispatched from event loop event_id=6 I (610569) [mqtt]: MQTT_EVENT_DATA, msg_id=0, topic[46]=$aws/things/testdev21/jobs/start-next/accepted{"clientToken":"device-dev","timestamp":1728619019,"execution":{"jobId":"AFR_OTA-testdev21_update42","status":"IN_PROGRESS","queuedAt":1728619016,"startedAt":1728619019,"lastUpdatedAt":1728619019,"versionNumber":2,"executionNumber":1,"jobDocument":{"afr_ota":{"protocols":["MQTT"],"streamname":"AFR_OTA-ef4fecab-c7d0-4812-a5fd-fb86a739abdf","files":[{"filepath":"/","filesize":730304,"fileid":0,"certfile":"/main/certs/aws_codesign.crt","sig-sha256-ecdsa":"MEUCIEzzqvV1b6ra97DIV+AJzx2uVUtWiG4xtuPDi6dV5nYOAiEAmV7pw9Y3cPwb8l5SEHLPzqx7920FiQ9awPHd4dbS+/4="}]}}}}�砂���Bp����5��e$�'�tޠ��g�r�� x��������$� A�9W��8�1�� r��FXy<yQ��Y�d�< I (610639) [mqtt]: jobs event_data = {"clientToken":"device-dev","timestamp":1728619019,"execution":{"jobId":"AFR_OTA-testdev21_update42","status":"IN_PROGRESS","queuedAt":1728619016,"startedAt":1728619019,"lastUpdatedAt":1728619019,"versionNumber":2,"executionNumber":1,"jobDocument":{"afr_ota":{"protocols":["MQTT"],"streamname":"AFR_OTA-ef4fecab-c7d0-4812-a5fd-fb86a739abdf","files":[{"filepath":"/","filesize":730304,"fileid":0,"certfile":"/main/certs/aws_codesign.crt","sig-sha256-ecdsa":"MEUCIEzzqvV1b6ra97DIV+AJzx2uVUtWiG4xtuPDi6dV5nYOAiEAmV7pw9Y3cPwb8l5SEHLPzqx7920FiQ9awPHd4dbS+/4="}]}}}}, I (610689) [mqtt]: job status : "IN_PROGRESS" I (610699) [mqtt]: Execute new OTA update task. (from factory partition) Showing data for heap: 0x3ffae6e0

To explain the log, " Event dispatched from event loop event_id=6 " => 데이터 이벤트 발생 ` 01-02 : aws -> device : $aws/things/(id)/jobs/notify-next 수신 ( status QUEUED ) 03 : device -> aws : $aws/things/(id)/jobs/start-next 로 응답 ( status QUEUED ) 04-05 : aws -> device : $aws/things/(id)/jobs/start-next 수신 ( status QUEUED ) 06-08 : aws -> device : $aws/things/(id)/jobs/start-next/accepted 수신 ( status IN_PROGRESS ) 09 after : device -> aws : $aws/things/(id)/streams/(stream-id)/get/json (Cannot proceed due to error)

`

The data that the device needs to pass to AWS is as follows. char msg_mqtt_start[1000] ; sprintf(msg_mqtt_start, "{\"c\":%s, \"s\":%d, \"f\":%d,\"l\":%d,\"o\":%d,\"n\":%d}", CONFIG_OTA_CLIENT_TOKEN, CONFIG_OTA_STREAM_VERSION, (int)ota_stream.total.file_size, CONFIG_OTA_STREAM_BLOCK_SIZE, ota_stream.block_offset, ota_stream.request_block_num); "{ "c": "device-dev", "s": 1, "f": 730304, "l": 4096, "o": 0, "n": 0 }"

It looks like the data from ota should be sent to the stream immediately after this, but an error occurs before that happens.

SoucheSouche commented 1 month ago

@Gyunamchoi, it seems that the heap memory got corrupted along the line. you have to find out where exactly. You can use heap_caps_check_integrity_all(true); (esp_heap_caps.h) at different places in the code to find which part of it induced the corruption. Make sure that the comprehensive poisoning is enabled in the menuconfig ( Component configHeap memory debuggingheap corruption detection).

Gyunamchoi commented 1 month ago

hello @SoucheSouche.

As you advised, I enabled comprehensive poisoning, added heap_caps_check_integrity_all(true); in several places, and checked for memory corruption. This is the code of the corrupt heap right before the problem occurred. Please check.

`I (50769) [mqtt]: Execute new OTA update task. (from factory partition) CORRUPT HEAP: Bad tail at 0x3ffcf320. Expected 0xbaad5678 got 0xffffffff CORRUPT HEAP: Invalid data at 0x3ffcf334. Expected 0xfefefefe got 0x3ffc52b4 CORRUPT HEAP: Invalid data at 0x3ffcf338. Expected 0xfefefefe got 0x3767029e CORRUPT HEAP: Invalid data at 0x3ffcf33c. Expected 0xfefefefe got 0x78141a98 CORRUPT HEAP: Invalid data at 0x3ffcf340. Expected 0xfefefefe got 0x1d7fce4e CORRUPT HEAP: Invalid data at 0x3ffcf344. Expected 0xfefefefe got 0xadf51b98 CORRUPT HEAP: Invalid data at 0x3ffcf348. Expected 0xfefefefe got 0x8d482231 CORRUPT HEAP: Invalid data at 0x3ffcf34c. Expected 0xfefefefe got 0xd2d47c66 CORRUPT HEAP: Invalid data at 0x3ffcf350. Expected 0xfefefefe got 0x48d1ad6b CORRUPT HEAP: Invalid data at 0x3ffcf354. Expected 0xfefefefe got 0xf45bfd9c CORRUPT HEAP: Invalid data at 0x3ffcf358. Expected 0xfefefefe got 0xdbcbf209 CORRUPT HEAP: Invalid data at 0x3ffcf35c. Expected 0xfefefefe got 0xf4ed7d9c CORRUPT HEAP: Invalid data at 0x3ffcf360. Expected 0xfefefefe got 0xe3302886 CORRUPT HEAP: Invalid data at 0x3ffcf364. Expected 0xfefefefe got 0x2c285da1 CORRUPT HEAP: Invalid data at 0x3ffcf368. Expected 0xfefefefe got 0x457f49b1 CORRUPT HEAP: Invalid data at 0x3ffcf36c. Expected 0xfefefefe got 0xd925344d CORRUPT HEAP: Invalid data at 0x3ffcf370. Expected 0xfefefefe got 0x8e55f80a CORRUPT HEAP: Invalid data at 0x3ffcf374. Expected 0xfefefefe got 0x5f5837f3 CORRUPT HEAP: Invalid data at 0x3ffcf378. Expected 0xfefefefe got 0xb8ff7f04 CORRUPT HEAP: Invalid data at 0x3ffcf37c. Expected 0xfefefefe got 0x3702ca47 CORRUPT HEAP: Invalid data at 0x3ffcf380. Expected 0xfefefefe got 0xb8c93118 CORRUPT HEAP: Invalid data at 0x3ffcf384. Expected 0xfefefefe got 0x0e9f0a15 CORRUPT HEAP: Invalid data at 0x3ffcf388. Expected 0xfefefefe got 0xba692d56 CORRUPT HEAP: Invalid data at 0x3ffcf38c. Expected 0xfefefefe got 0xed367bec CORRUPT HEAP: Invalid data at 0x3ffcf390. Expected 0xfefefefe got 0xe090a69a CORRUPT HEAP: Invalid data at 0x3ffcf394. Expected 0xfefefefe got 0xd6097dbf CORRUPT HEAP: Invalid data at 0x3ffcf398. Expected 0xfefefefe got 0x2ba89c60 CORRUPT HEAP: Invalid data at 0x3ffcf39c. Expected 0xfefefefe got 0xe9e7d1c2 CORRUPT HEAP: Invalid data at 0x3ffcf3a0. Expected 0xfefefefe got 0xaa32d2de CORRUPT HEAP: Invalid data at 0x3ffcf3a4. Expected 0xfefefefe got 0x68a640d7 CORRUPT HEAP: Invalid data at 0x3ffcf3a8. Expected 0xfefefefe got 0x62264983 CORRUPT HEAP: Invalid data at 0x3ffcf3ac. Expected 0xfefefefe got 0x4355f993 CORRUPT HEAP: Invalid data at 0x3ffcf3b0. Expected 0xfefefefe got 0x56fbca0d CORRUPT HEAP: Invalid data at 0x3ffcf3b4. Expected 0xfefefefe got 0x5baf764c CORRUPT HEAP: Invalid data at 0x3ffcf3b8. Expected 0xfefefefe got 0xbaad5678 CORRUPT HEAP: Invalid data at 0x3ffcf3bc. Expected 0xfefefefe got 0x00000649 CORRUPT HEAP: Invalid data at 0x3ffcf3c0. Expected 0xfefefefe got 0x3ffb5664 CORRUPT HEAP: Invalid data at 0x3ffcf3c4. Expected 0xfefefefe got 0x3ffb5664 CORRUPT HEAP: Invalid data at 0x3ffcfa04. Expected 0xfefefefe got 0x3ffcf3b8 CORRUPT HEAP: Invalid data at 0x3ffcfa08. Expected 0xfefefefe got 0x00000042 CORRUPT HEAP: Invalid data at 0x3ffcfa0c. Expected 0xfefefefe got 0xabba1234 CORRUPT HEAP: Invalid data at 0x3ffcfa10. Expected 0xfefefefe got 0x0000002c CORRUPT HEAP: Invalid data at 0x3ffcfa14. Expected 0xfefefefe got 0x3ffc52b4 CORRUPT HEAP: Invalid data at 0x3ffcfa18. Expected 0xfefefefe got 0x3ffcebc8 CORRUPT HEAP: Invalid data at 0x3ffcfa1c. Expected 0xfefefefe got 0x00000046 CORRUPT HEAP: Invalid data at 0x3ffcfa20. Expected 0xfefefefe got 0x0000Guru Meditation Error: Core 0 panic'ed (Interrupt wdt timeout on CPU0).

Core 0 register dump: PC : 0x40009222 PS : 0x00060734 A0 : 0x80007d16 A1 : 0x3ffc4da0
0x40009222: uart_tx_one_char in ROM

A2 : 0x00800000 A3 : 0x38000000 A4 : 0x00000000 A5 : 0x0000000d
A6 : 0x3ffcf32c A7 : 0x3ffcf324 A8 : 0x3ff40000 A9 : 0x00000037
A10 : 0x00800000 A11 : 0x3ff4001c A12 : 0x00000000 A13 : 0x3ffc4f30
A14 : 0xffffffff A15 : 0x00000001 SAR : 0x00000005 EXCCAUSE: 0x00000005
EXCVADDR: 0x00000000 LBEG : 0x4000c349 LEND : 0x4000c36b LCOUNT : 0xffffffff
0x4000c349: memcpy in ROM 0x4000c36b: memcpy in ROM

Backtrace: 0x4000921f:0x3ffc4da0 0x40007d13:0x3ffc4dc0 0x40007c69:0x3ffc4de0 0x40008106:0x3ffc4e00 0x4008cea1:0x3ffc4e90 0x4008ccb1:0x3ffc4eb0 0x40148e9b:0x3ffc4ed0 0x40149201:0x3ffc4ef0 0x40149343:0x3ffc4f10 0x4014938f:0x3ffc4f30 0x40148ee7:0x3ffc4f60 0x400d42f5:0x3ffc4f80 0x400d4319:0x3ffc4fa0 0x400dce47:0x3ffc4fc0 0x400dbd82:0x3ffc4fe0 0x400dc509:0x3ffc5030 0x400dc882:0x3ffc5050 0x401481bd:0x3ffc50f0 0x40147c52:0x3ffc5120 0x400e4dfe:0x3ffc5160 0x400e5860:0x3ffc5180 0x400e59c7:0x3ffc51b0 0x400e5ec4:0x3ffc51e0 0x40088425:0x3ffc5210 0x4000921f: uart_tx_one_char in ROM 0x40007d13: ets_write_char_uart in ROM 0x40007c69: ets_write_char in ROM 0x40008106: ets_printf in ROM 0x4008cea1: verify_fill_pattern at /Users/odn/esp/esp-idf/components/heap/multi_heap_poisoning.c:153 0x4008ccb1: multi_heap_internal_check_block_poisoning at /Users/odn/esp/esp-idf/components/heap/multi_heap_poisoning.c:421 0x40148e9b: tlsf_check_hook at /Users/odn/esp/esp-idf/components/heap/multi_heap.c:329 0x40149201: integrity_walker at /Users/odn/esp/esp-idf/components/heap/tlsf/tlsf.c:689 0x40149343: tlsf_walk_pool at /Users/odn/esp/esp-idf/components/heap/tlsf/tlsf.c:771 0x4014938f: tlsf_check_pool at /Users/odn/esp/esp-idf/components/heap/tlsf/tlsf.c:798 0x40148ee7: multi_heap_check at /Users/odn/esp/esp-idf/components/heap/multi_heap.c:350 (discriminator 1) 0x400d42f5: heap_caps_check_integrity at /Users/odn/esp/esp-idf/components/heap/heap_caps.c:422 0x400d4319: heap_caps_check_integrity_all at /Users/odn/esp/esp-idf/components/heap/heap_caps.c:431 0x400dce47: ota_request_start_stream at /Users/odn/workspace/odn_device_rev2/components/ota_update/ota_updater.c:84 0x400dbd82: mqtt_recvtopic_jobs at /Users/odn/workspace/odn_device_rev2/components/mqtt_handle/mqtt_handler.c:517 0x400dc509: mqtt_recvprocessing at /Users/odn/workspace/odn_device_rev2/components/mqtt_handle/mqtt_handler.c:362 (discriminator 1) 0x400dc882: mqtt_eventhandler at /Users/odn/workspace/odn_device_rev2/components/mqtt_handle/mqtt_handler.c:296 0x401481bd: handler_execute at /Users/odn/esp/esp-idf/components/esp_event/esp_event.c:135 0x40147c52: esp_event_loop_run at /Users/odn/esp/esp-idf/components/esp_event/esp_event.c:584 0x400e4dfe: esp_mqtt_dispatch_event at /Users/odn/esp/esp-idf/components/mqtt/esp-mqtt/mqtt_client.c:1032 0x400e5860: deliver_publish at /Users/odn/esp/esp-idf/components/mqtt/esp-mqtt/mqtt_client.c:1099 0x400e59c7: mqtt_process_receive at /Users/odn/esp/esp-idf/components/mqtt/esp-mqtt/mqtt_client.c:1377 0x400e5ec4: esp_mqtt_task at /Users/odn/esp/esp-idf/components/mqtt/esp-mqtt/mqtt_client.c:1657 0x40088425: vPortTaskWrapper at /Users/odn/esp/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:134

Core 1 register dump: PC : 0x4008591a PS : 0x00060a34 A0 : 0x800d3dd6 A1 : 0x3ffb86c0
0x4008591a: esp_cpu_wait_for_intr at /Users/odn/esp/esp-idf/components/esp_hw_support/cpu.c:64

A2 : 0x00000000 A3 : 0x00000000 A4 : 0x3ffaf9f0 A5 : 0x3ffaf9d0
A6 : 0x40081efc A7 : 0x00000001 A8 : 0x800eaf9e A9 : 0x3ffb8680
0x40081efc: ipc_task at /Users/odn/esp/esp-idf/components/esp_system/esp_ipc.c:53

A10 : 0x00000000 A11 : 0x00000001 A12 : 0x800887ca A13 : 0x3ffc4b60
A14 : 0x00000003 A15 : 0x3ffb87c4 SAR : 0x00000000 EXCCAUSE: 0x00000005
EXCVADDR: 0x00000000 LBEG : 0x00000000 LEND : 0x00000000 LCOUNT : 0x00000000

Backtrace: 0x40085917:0x3ffb86c0 0x400d3dd3:0x3ffb86e0 0x400894e9:0x3ffb8700 0x40088425:0x3ffb8720 0x40085917: xt_utils_wait_for_intr at /Users/odn/esp/esp-idf/components/xtensa/include/xt_utils.h:82 (inlined by) esp_cpu_wait_for_intr at /Users/odn/esp/esp-idf/components/esp_hw_support/cpu.c:55 0x400d3dd3: esp_vApplicationIdleHook at /Users/odn/esp/esp-idf/components/esp_system/freertos_hooks.c:58 0x400894e9: prvIdleTask at /Users/odn/esp/esp-idf/components/freertos/FreeRTOS-Kernel/tasks.c:4344 (discriminator 1) 0x40088425: vPortTaskWrapper at /Users/odn/esp/esp-idf/components/freertos/FreeRTOS-Kernel/portable/xtensa/port.c:134`

kvp1703 commented 2 weeks ago

Hi @Gyunamchoi, Can you help share the esp-aws-iot branch you're on? Also, can you try with standard OTA example ? We looked at your code, could you verify whether recv_topic and recv_data are being properly freed, including edge cases? If not, please provide the relevant code snippet that handles the freeing or utilization of these resources.

Gyunamchoi commented 2 weeks ago

@kvp1703 Oh I'm Sorry. It's been resolved, but I didn't close it. There was a subtle mismatch in the size and timing when allocating, using, and freeing memory. It's resolved now and it works fine. Thank you for your interest.