CORRUPT HEAP when freeing scratch buffers allocated for ESP32-S3 optimized conv kernels

AIWintermuteAI commented 10 months ago

Hi, @vikramdattu ! Thank you for your amazing work, we integrated it into our Edge Impulse SDK, so more users can benefit from optimized NN inference with ESP NN.

Currently we only support ESP32 optimized kernels and I'm working on adding support for ESP32-S3 as well, since this chip is gaining popularity with users. When testing object detection compiled (interpreterless) model I'm running into CORRUPT HEAP error on freeing scratch buffers. It makes me think that there is possible heap corruption in write operation inside optimized Conv kernel, i.e. the kernel writing into area outside of allocated scratch buffer. I'm attaching the complete project here, here is the sample output I'm getting:

Edge Impulse standalone inferencing (Espressif ESP32)
Left 11712 bytes after tensor_boundary
CORRUPT HEAP: Bad head at 0x3fce0f34. Expected 0xabba1234 got 0x00000000

assert failed: multi_heap_free multi_heap_poisoning.c:253 (head != NULL)

Backtrace:0x403759fe:0x3fcf44c00x40379135:0x3fcf44e0 0x4037f2d1:0x3fcf4500 0x4037ef39:0x3fcf4620 0x40375bf2:0x3fcf4640 0x4037f2ed:0x3fcf4660 0x42006e75:0x3fcf4680 0x420077a6:0x3fcf46a0 0x420061ea:0x3fcf46c0 0x420068be:0x3fcf4700 0x420069b9:0x3fcf48c0 0x42006c9e:0x3fcf4960 0x42024ffb:0x3fcf4b40 0x4037bb59:0x3fcf4b60 
0x403759fe: panic_abort at /Users/dmitrymaslov/esp/esp-idf/components/esp_system/panic.c:402

0x40379135: esp_system_abort at /Users/dmitrymaslov/esp/esp-idf/components/esp_system/esp_system.c:121

0x4037f2d1: __assert_func at /Users/dmitrymaslov/esp/esp-idf/components/newlib/assert.c:85

0x4037ef39: multi_heap_free at /Users/dmitrymaslov/esp/esp-idf/components/heap/multi_heap_poisoning.c:253 (discriminator 1)

0x40375bf2: heap_caps_free at /Users/dmitrymaslov/esp/esp-idf/components/heap/heap_caps.c:305

0x4037f2ed: free at /Users/dmitrymaslov/esp/esp-idf/components/newlib/heap.c:39

0x42006e75: ei_free(void*) at /Users/dmitrymaslov/github/example-standalone-inferencing-espressif-esp32/edge-impulse-sdk/porting/espressif/ei_classifier_porting.cpp:85

0x420077a6: tflite_learn_33_reset(void (*)(void*)) at /Users/dmitrymaslov/github/example-standalone-inferencing-espressif-esp32/tflite-model/tflite_learn_33_compiled.cpp:1486 (discriminator 2)

0x420061ea: _ZL20inference_tflite_runPK10ei_impulseP28ei_config_tflite_eon_graph_tyP12TfLiteTensorS5_S5_PhP19ei_impulse_result_tb$constprop$117 at /Users/dmitrymaslov/github/example-standalone-inferencing-espressif-esp32/edge-impulse-sdk/classifier/inferencing_engines/tflite_eon.h:125

0x420068be: run_nn_inference_image_quantized(ei_impulse const*, ei::ei_signal_t*, ei_impulse_result_t*, void*, bool) at /Users/dmitrymaslov/github/example-standalone-inferencing-espressif-esp32/edge-impulse-sdk/classifier/inferencing_engines/tflite_eon.h:334

0x420069b9: process_impulse at /Users/dmitrymaslov/github/example-standalone-inferencing-espressif-esp32/edge-impulse-sdk/classifier/ei_run_classifier.h:499
 (inlined by) process_impulse at /Users/dmitrymaslov/github/example-standalone-inferencing-espressif-esp32/edge-impulse-sdk/classifier/ei_run_classifier.h:196
[example-standalone-inferencing-espressif-esp32.zip](https://github.com/espressif/esp-nn/files/12499254/example-standalone-inferencing-espressif-esp32.zip)

0x42006c9e: app_main at /Users/dmitrymaslov/github/example-standalone-inferencing-espressif-esp32/edge-impulse-sdk/classifier/ei_run_classifier.h:609
 (inlined by) app_main at /Users/dmitrymaslov/github/example-standalone-inferencing-espressif-esp32/main/main.cpp:116

0x42024ffb: main_task at /Users/dmitrymaslov/esp/esp-idf/components/freertos/port/port_common.c:129 (discriminator 2)

0x4037bb59: vPortTaskWrapper at /Users/dmitrymaslov/esp/esp-idf/components/freertos/port/xtensa/port.c:131

Let me know if you have the time to try reproducing the error or have any insights on how can I debug it myself.

vikramdattu commented 10 months ago

@AIWintermuteAI thanks for the ESP32-S3 port. Let me try the example and get back to you.

vikramdattu commented 10 months ago

@AIWintermuteAI I was able to re-produce the issue. Thankfully, I could figure out the reason for the corruption.

You will need to do aligned allocs to fix it.

Please apply the following change and let me know if this fixes it for you:

__attribute__((weak)) void *ei_malloc(size_t size) {
    // return malloc(size);
    return aligned_alloc(16, size);
}

__attribute__((weak)) void *ei_calloc(size_t nitems, size_t size) {
    // return calloc(nitems, size);
    return aligned_alloc(16, nitems * size);
}

result:

Left 11712 bytes after tensor_boundary
Timing: DSP 7 ms, inference 170 ms, anomaly 0 ms
Object detection bounding boxes:
  face (0.996094) [ x: 32, y: 56, width: 8, height: 16 ]

BTW, I have enabled external RAM (SPI-RAM) from menuconfig to not fall short of memory requirements and set Interrupt WDT time to 1000ms (from 300)

AIWintermuteAI commented 10 months ago

@vikramdattu thanks for the prompt reply! I tested it and using alligned_alloc indeed works --- if scratch buffers are allocated outside of arena. By default scratch buffers are allocated in the head of arena - and in that case I'm getting wrong inference results, which was my original problem.

To give you a bit more context:

I was investigating why object detection model outputs no results for a particular sample. I found out by comparing intermediary inputs and outputs that output of the convolutional layer differs from the optimized (non-assembly) ESP NN.
since this problem was NOT present with a) optimized (non-assembly)ESP-NN b) regular (not-compiled) tflite model I started looking into scratch buffers, since this would be one single major difference.
I found out that I can get correct results by placing all scratch buffers from kernels outside of arena. But I would get corrupt heap error, which you saw me posted above.
Using alligned_alloc works if scratch buffers are placed outside of arena - however if they are placed inside of arena, the original issue still there: the model outputs wrong detection results (no objects detected).

I'm attaching the project, where scratch buffers are allocated within arena, already with necessary changes made to porting layer (ei_calloc and ei_malloc). If you run it, you'd see that the output is

Edge Impulse standalone inferencing (Espressif ESP32)
Left 11712 bytes after tensor_boundary
Timing: DSP 5 ms, inference 151 ms, anomaly 0 ms

which is incorrect for the same sample. example-standalone-inferencing-espressif-esp32.zip

My suspicion is the same here: that somehow the optimized assembly kernel writing into area outside of allocated scratch buffer. If the scratch buffer is located inside of arena, that would corrupt neighboring structures.

vikramdattu commented 10 months ago

Thanks for adding additional context to the issue. I shall reproduce the issue and fix it for good!

vikramdattu commented 10 months ago

@AIWintermuteAI this was similar alignment issue. The scratch buffers allocated were not checking for aligned boundary.

Please modify AllocatePersistentBufferImpl function to account for the alignment as below before returning the pointer:

  }

  current_location -= bytes;

  // align to the left aligned boundary of 16 bytes
  current_location -= 15; // for alignment
  current_location += 16 - ((int) (current_location) & 15);

  ptr = current_location;
  memset(ptr, 0, bytes);
  ei_printf("ARENA tensor_boundary %p current_location %p bytes %d \n", tensor_boundary, current_location, (int)bytes);
  ei_printf("ARENA %p scratch_buf_size %d \n", ptr, (int)bytes);

  return ptr;
}

You will also need to account for the extra buffer needed for this and

if (current_location - bytes < tensor_boundary) {

condition should change accordingly.

AIWintermuteAI commented 10 months ago

Hello! I tested the solution and it does work, thank you very much for the help.

I will need to think how exactly I'm going to incorporate it into our SDK, since AllocatePersistentBufferImpl is shared between all platforms and so far only ESP32-S3 required this specific alignment. Perhaps I'll make it a weak func and include the modified function into our porting layer. I'll mark issue as solved.

espressif / esp-nn

CORRUPT HEAP when freeing scratch buffers allocated for ESP32-S3 optimized conv kernels #7