Closed PerMalmberg closed 6 years ago
Hi @PerMalmberg,
At this point, I'd prefer to discuss this in a single place rather than two places. I've replied to you on the forum for now: https://esp32.com/viewtopic.php?f=2&t=4583&p=19970#p19970
If we determine that this is a bug in the heap diagnostics, please reopen this issue and we'll continue from here. But I don't think this is established quite yet.
Angus
Hi,
This became rather lengthy, please bear with me.
Assuming Heap Debugging is set to "Comprehensive", when allocating a block of memory (multi_heap_malloc()), an additional poison_head_t and poison_tail_t are allocated. Then, it goes on to poison the allocated buffer in poison_allocated_region(), setting the head and tail with HEAD_CANARY_PATTERN (0xABBA1234) and TAIL_CANARY_PATTERN (0xBAAD5678), respectively. When back in multi_heap_malloc, we call verify_fill_pattern(), to verify that the buffer in between head an tail contains the FREE_FILL_WORD (0xfefefefe), and replacing it with MALLOC_FILL_WORD (0xcececece)
First question: What fills the memory with 0xfe in the first place?
Ok, so we now know that whenever we allocate a buffer, we can expect it to be filled with 0xce, correct?
Moving on to freeing memory.
Again, assuming that Heap Debugging is set to "Comprehensive", when freeing a block of memory (multi_heap_free()), the region being freed is verified via verify_allocated_region() such that the head and tail values are intact, i.e. no buffer under/overruns have happened (at least none that wrote any data). Next, the entire buffer, including head and tail are filled with 0xfe, after which the entire buffer is handed back to the heap.
All correct so far?
Now, to my actual problem. I am consistently (as in 100%) getting the stack trace at the bottom of this post very shortly after start up.
First off, I've replaced cJSON with version 1.7.1 as 1.6.0 which ships with ESP-IDF has a memory bug which causes a too small buffer to be allocated.
While calling cJSON_free in
smooth::core::json::Value::to_string[abi:cxx11]() in Value.cpp:285
(using default mapping to free()) I expect that to eventually call function multi_heap_free(), which it seems to do just before entering newlib-code (the line numbers seems off once it enters this part of the code):Deeper in the stack it also wants to allocate a block of memory for whatever reason(?), which eventually ends up in the corrupt heap message and total stop.
A newly allocated block should have a tail value of 0xBAAD5678, which the message also states, but look closer at what it got:
CORRUPT HEAP: Bad tail at 0x3ffe04d1. Expected 0xbaad5678 got 0xcececece
That is the MALLOC_FILL_WORD, the value used to fill the buffer between head and tail in multi_heap_malloc()/verify_fill_pattern().
I 'm not certain that I'm right, but I've literally spent days with cppcheck, valgrind and -fsanitize trying to come to the conclusion that the error lies in my code and not in the code for heap poisoning, but I always come back to the later one over and over. The read failing value in the tail is no random number, it is always MALLOC_FILL_WORD and there is no user code that ever writes that specific value.
I've also attempted to use a breakpoint to stop the application when it hits 0x3ffe04d1:
esp_set_watchpoint(0, (void *)0x3ffe04d1, 4, ESP_WATCHPOINT_STORE);
If I set it at the very start of the program, it seems the memory is first accessed by the Wifi:
Setting the same breakpoint slightly prior to where cJSON_free is called (as in the problem description above), like so:
it seems that cJSON_print() is the culprit:
And yes, if I don't call cJSON_Print(), I no longer have an issue. However, I can also just not start a task that literally [i]only sleeps in the current use case[/i] and the issue also "goes away". I'm not sure what this tells us.
If I'm looking at all this the wrong way, please tell me. No one will be happier than me if it is possible to limit the search for this issue to my own code.
Update 1: I'm currently running the exact same code with "Light impact" mode in which the part where it writes the *_FILL_WORD bytes to the buffer is disabled and I'm not getting any issues with destroyed tails.
Update 2: Nearly two days later it is still running. Also, cmrogan in this thread seems to be having the exact same issue.
This is the complete stacktrace from which the snippets above are taken