Closed EDsteve closed 4 months ago
@LIFsCode @EDsteve After further investigation this crash is now persisting on my setup. I've tried to narrow down the scope but the issue remains, I'll continue more investigation tomorrow.
I haven't tried this yet but this may a useful debugging technique. JTAG access is not available as the required pins are already in use.
@OOHehir @EDsteve I think I found the source of the crash. I'll just wrap it up in a commit and add the log here for explanation.
First of all @OOHehir don't worry things like that happen :) I'm to blame for the crash due to the LIS3DH interrupts ;-) The system is getting rather complex with the different tasks running in parallel and such things are hard to find. I guess we will have more of this kind of stuff in the future. Anyway I'm really happy to have you on the project. I wouldn't been able to do this without you :)
Back to topic: For this debugging log to understand this commit is needed ec61336
this log file will explain a bit: HeapTrace_EI_task.txt
The actual crash is caused by 3 tasks running in parallel:
if (ai_run_enable == false && edgeImpulse.get_status() == EdgeImpulse::Status::running) {
edgeImpulse.set_status(EdgeImpulse::Status::not_running);
}
the EI task exiting when not in running state any more. In fact this adds some additional delay due to the xTaskNotifyWait() to run into the timeout
while (status == Status::running) {
if (xTaskNotifyWait(
0 /* ulBitsToClearOnEntry */,
0 /* ulBitsToClearOnExit */,
NULL /* pulNotificationValue */,
portMAX_DELAY /* xTicksToWait*/) == pdTRUE) {
if (inference.buf_ready == 1) {
// Run classifier from main.cpp
callback();
// Update times
auto change_SystemTimeSecs = time_utils::getSystemTimeSecs() - last_SystemTimeSecs;
detectingTime_secs += change_SystemTimeSecs;
totalDetectingTime_secs += change_SystemTimeSecs;
last_SystemTimeSecs = time_utils::getSystemTimeSecs();
ESP_LOGV(TAG, "detectingTime_secs: %d", detectingTime_secs);
ESP_LOGV(TAG, "totalDetectingTime_secs: %d", totalDetectingTime_secs);
}
} // if (xTaskNotifyWait())
}
ESP_LOGI(TAG, "deleting task");
vTaskDelete(NULL);
the I2SSampler which notifies the EI task
if (skip_current >= ei_skip_rate) {
ESP_LOGV(TAG, "Saving sample, skip_current = %d", skip_current);
inference->buffers[inference->buf_select][inference->buf_count++] = processed_sample_16bit;
skip_current = 1;
if (inference->buf_count >= inference->n_samples) {
inference->buf_select ^= 1;
inference->buf_count = 0;
// If running inference & buffer overrun => flag
if (ai_run_enable && inference->buf_ready == 1) {
inference_buffer_overrun = true;
}
inference->buf_ready = 1;
if (ei_TaskHandler != NULL)
xTaskNotify(ei_TaskHandler, (0), eNoAction);
}
} else {
ESP_LOGV(TAG, "Not saving sample, skip_current = %d", skip_current);
skip_current++;
}
ei_TaskHandler
handler. I guess this is some kind of race condition, when the I2S sampler checks the pointer, the ei task gets deleted and then the I2S task sends the notifiy.
Maybe it is even a compile problem, since ei_TaskHandler
is not volatile the compiler may optimize it. But more likely is a synchronization problem, especially since both run on different cores.A possible solution could be (in decresing favorability order), but maybe @OOHehir you have another solution in miind:
@LIFsCode Many thanks for all the investigation. It's strange that I wasn't able to replicate the crash until I tested the new branch you created. I'm sure you would have managed without out me but thanks for the kind words!
@EDsteve @LIFsCode I should have some time later to work on a fix & let you know when I have something.
@LIFsCode @EDsteve I can't be certain (yet) but perhaps this bug has another layer & @LIFsCode you're paranoia is correct :)
It centers on the setting of ai_run_enable, which is a global bool, defined in main.cpp.
I think the heap issue may occur as follows: Bluetooth(BT) receives a msg "recordOn_DetectOFF" & executes
if (ai_mode_change) {
xQueueSend(rec_ai_evt_queue, &new_ai_mode, (TickType_t)0);
}
new_ai_mode is a bool local to ElocCommands.cpp so doesn't appear to cause any issue. In the meantime I2SMEMSSampler fills a buffer, checks ai_run_enable & finds it true (snippet below) so signals xTaskNotify to run the inference
if (ai_run_enable && ei_TaskHandler != NULL)
xTaskNotify(ei_TaskHandler, (0), eNoAction);
At some point (before, after, during?) in main.cpp the following line is evaluated:
if (xQueueReceive(rec_ai_evt_queue, &ai_run_enable, pdMS_TO_TICKS(500))) {
ESP_LOGI(TAG, "Received AI run enable = %d", ai_run_enable);
This maps the change in mode (from BT) directly into (the global) ai_run_enable, whilst simultaneously evaluating it. I don't the internals of xQueueReceive, there could be atomic operations occurring internally but when its tested initially it discovers a change & begins (internally) to act on this. I imagine (guess) that this causes some sort of weird error, where FreeRTOS, after evaluating the above snippet clears it's queue, but that happens to be a global variable & messes up the heap.
To solve this particular I've introduced a new local bool for the receive queue as follows:
if (xQueueReceive(rec_ai_evt_queue, &change_ai_run_enable, pdMS_TO_TICKS(500))) {
ESP_LOGI(TAG, "Received AI run enable = %d", ai_run_enable);
Which is subsequently used to set ai_run_enable.
Also I've used a two step process to change the AI state:
This is similar to the WAVWriter setup which seems to be successful. I've done a few quick tests with the commit I've just pushed & so far so good, I'll do further tests over the weekend...
@EDsteve Sorry this is taking so long!
@OOHehir could it maybe be a problem with the auto type? In the end the queue stuff reads via a pointer and size based on how the queue is set up.
I'm not too familiar with the C++ standard on auto type. Bool at least has a implementation specific size and maybe if the compiler ends up with a different size for the auto variable it could result in corruption when the auto size is smaller than the bool size. Not sure if this makes sense.
@LIFsCode When I hover over the the variable in my IDE (vsc) I can see that the compiler has deduced it as type bool so I don't think that's the issue. I think the auto keyword in C used to do some strange thinks but from what I've read in C++ from C+11 its compile time deduction of type.
@EDsteve I ran some more tests (heapDebugging) & so far no crashes. There some memory exhaustion issues, probably a lot do to with the heap debugging, so I'll start to try & update master with the relevant changes. The branches have diverged a bit but I'll try just pick-out the changes to get a stable build.
@LIFsCode Do you want to include the SD card speed tests in master? I assume you don't want the heap debugging changes brought across?
@OOHehir @LIFsCode Haven't had the time for detailed tests lately. Sorry about that. I am free again in two days. And can start to do more tests with more devices. Let me know what to test. Heap debugging, master or perf debugging? I can test all if it helps :)
@OOHehir the sd card speed test cam be included in master branch, it should not affect runtime behavior as it's only an isolated command and may be helpful for checking SD cards.
Thanks a lot for bringing back the Bugfixes to master. All heap debugging stuff can be removed, this has all negative runtime performance impact. I tried to keep those in isolated commits so I guess you can just cherry pick the Bugfix commits
@OOHehir @LIFsCode Haven't had the time for detailed tests lately. Sorry about that. I am free again in two days. And can start to do more tests with more devices. Let me know what to test. Heap debugging, master or perf debugging? I can test all if it helps :)
@EDsteve I'll try have the master branch ready to test to try & cut down on the testing required. Hopefully we'll be able to remove any redundant branches to keep things organised.
@OOHehir the sd card speed test cam be included in master branch, it should not affect runtime behavior as it's only an isolated command and may be helpful for checking SD cards.
Thanks a lot for bringing back the Bugfixes to master. All heap debugging stuff can be removed, this has all negative runtime performance impact. I tried to keep those in isolated commits so I guess you can just cherry pick the Bugfix commits
@LIFsCode No problem. I'll pull the fixes from heapDebugging & then come back & see if there's fixes from PerfDebugging.
@EDsteve @LIFsCode I've pulled across what I can see are the relevant changes from heapDebugging into master. I've tried some quick tests of the master & so far so good.
@EDsteve @LIFsCode When you get a chance could you please:
Build & test esp32dev-ei-windows
I think it's safe to close this?
@LIFsCode @OOHehir Not sure if these reboots are connected but i think it is because of BT turning OFF or ON.
Situation 1
After a few hours of recordon_detecton i triggered the knock sensor and then it rebooted with Ruru Mediation Error. This happened only once for now. Log files: Using commit cf51c0378c7afd5d93d96bc6fc62fe502a3b9fc0 Reboot while recordon_detecton.txt Bootlog after crash.txt
Situation 2
Start the ELOC and just wait. When BT turns off due to the BT time out it restarts. This doesn't happeb all the time but a few times already since a few weeks.
Using commit: 29325f68d7c831e7b754bf81b368c9930524c5eb Reset when BT turns off.txt
Using commit: 8cc4b9949c1cb9bd53aed99f9485e725fab7dcb0 Reset when BT turns off 2.txt
Using commit: cf51c0378c7afd5d93d96bc6fc62fe502a3b9fc0 Reset when BT turns off 3.txt