BT(?) causes random Guru Mediation Error

EDsteve commented 5 months ago

@LIFsCode @OOHehir Not sure if these reboots are connected but i think it is because of BT turning OFF or ON.

Situation 1

After a few hours of recordon_detecton i triggered the knock sensor and then it rebooted with Ruru Mediation Error. This happened only once for now. Log files: Using commit cf51c0378c7afd5d93d96bc6fc62fe502a3b9fc0 Reboot while recordon_detecton.txt Bootlog after crash.txt

Situation 2

Start the ELOC and just wait. When BT turns off due to the BT time out it restarts. This doesn't happeb all the time but a few times already since a few weeks.

Using commit: 29325f68d7c831e7b754bf81b368c9930524c5eb Reset when BT turns off.txt

Using commit: 8cc4b9949c1cb9bd53aed99f9485e725fab7dcb0 Reset when BT turns off 2.txt

Using commit: cf51c0378c7afd5d93d96bc6fc62fe502a3b9fc0 Reset when BT turns off 3.txt

OOHehir commented 4 months ago

@LIFsCode @EDsteve After further investigation this crash is now persisting on my setup. I've tried to narrow down the scope but the issue remains, I'll continue more investigation tomorrow.

I haven't tried this yet but this may a useful debugging technique. JTAG access is not available as the required pins are already in use.

LIFsCode commented 4 months ago

@OOHehir @EDsteve I think I found the source of the crash. I'll just wrap it up in a commit and add the log here for explanation.

First of all @OOHehir don't worry things like that happen :) I'm to blame for the crash due to the LIS3DH interrupts ;-) The system is getting rather complex with the different tasks running in parallel and such things are hard to find. I guess we will have more of this kind of stuff in the future. Anyway I'm really happy to have you on the project. I wouldn't been able to do this without you :)

Back to topic: For this debugging log to understand this commit is needed ec61336

this log file will explain a bit: HeapTrace_EI_task.txt

Crash relevant components

The actual crash is caused by 3 tasks running in parallel:

the main task setting edgeImpulse.set_status --> not_running

        if (ai_run_enable == false && edgeImpulse.get_status() == EdgeImpulse::Status::running) {
            edgeImpulse.set_status(EdgeImpulse::Status::not_running);
        }

the EI task exiting when not in running state any more. In fact this adds some additional delay due to the xTaskNotifyWait() to run into the timeout

while (status == Status::running) {
if (xTaskNotifyWait(
                      0 /* ulBitsToClearOnEntry */,
                      0 /* ulBitsToClearOnExit */,
                      NULL /* pulNotificationValue */,
                      portMAX_DELAY /* xTicksToWait*/) == pdTRUE) {
  if (inference.buf_ready == 1) {
    // Run classifier from main.cpp
    callback();

    // Update times
    auto change_SystemTimeSecs = time_utils::getSystemTimeSecs() - last_SystemTimeSecs;
    detectingTime_secs += change_SystemTimeSecs;
    totalDetectingTime_secs += change_SystemTimeSecs;
    last_SystemTimeSecs = time_utils::getSystemTimeSecs();
    ESP_LOGV(TAG, "detectingTime_secs: %d", detectingTime_secs);
    ESP_LOGV(TAG, "totalDetectingTime_secs: %d", totalDetectingTime_secs);
  }
}  // if (xTaskNotifyWait())
}
ESP_LOGI(TAG, "deleting task");

vTaskDelete(NULL);

the I2SSampler which notifies the EI task

            if (skip_current >= ei_skip_rate) {
                ESP_LOGV(TAG, "Saving sample, skip_current = %d", skip_current);

                inference->buffers[inference->buf_select][inference->buf_count++] = processed_sample_16bit;
                skip_current = 1;

                if (inference->buf_count >= inference->n_samples) {
                inference->buf_select ^= 1;
                inference->buf_count = 0;

                // If running inference & buffer overrun => flag
                if (ai_run_enable && inference->buf_ready == 1) {
                    inference_buffer_overrun = true;
                }

                inference->buf_ready = 1;
                if (ei_TaskHandler != NULL)
                    xTaskNotify(ei_TaskHandler, (0), eNoAction);
                }

            } else {
                ESP_LOGV(TAG, "Not saving sample, skip_current = %d", skip_current);
                skip_current++;
            }

Crash explanation

As identified by @OOHehir this line is the code line which directly affects the occurance of the crash and can be seen as trigger
I added an additional log messages to catch this task deletion --> see line 62 in the log file --> so obviously this task gets deleted
I2S Sampler references ei_TaskHandler handler. I guess this is some kind of race condition, when the I2S sampler checks the pointer, the ei task gets deleted and then the I2S task sends the notifiy. Maybe it is even a compile problem, since ei_TaskHandler is not volatile the compiler may optimize it. But more likely is a synchronization problem, especially since both run on different cores.

final thoughts

I have not seen the crash with b37b962 Maybe that's because of the double check which helps synchronizing the 2 tasks
A little unease feeling remains for me since the behavior was a little bit too good reproduceable for such race condition. but this may be my paranoia ;-)
At least this would perfectly explain why it is always only 1 byte which gets corrupted on the heap, since it is only the notify message which ends up somewhere.

A possible solution could be (in decresing favorability order), but maybe @OOHehir you have another solution in miind:

do not delete EI task throughout the whole runtime --> would be ok in my opinion since it is idle it's stuck in a notifyWait the whole time and not using any CPU time
synchronize the tasks not via taskNotify but via a Queue which does not get deleted
synchronize deletion of EI task via a semaphore (I guess that does not make the code more friendly to read)

OOHehir commented 4 months ago

@LIFsCode Many thanks for all the investigation. It's strange that I wasn't able to replicate the crash until I tested the new branch you created. I'm sure you would have managed without out me but thanks for the kind words!

@EDsteve @LIFsCode I should have some time later to work on a fix & let you know when I have something.

OOHehir commented 4 months ago

@LIFsCode @EDsteve I can't be certain (yet) but perhaps this bug has another layer & @LIFsCode you're paranoia is correct :)

It centers on the setting of ai_run_enable, which is a global bool, defined in main.cpp.

I think the heap issue may occur as follows: Bluetooth(BT) receives a msg "recordOn_DetectOFF" & executes

if (ai_mode_change) {
        xQueueSend(rec_ai_evt_queue, &new_ai_mode, (TickType_t)0);
    }

new_ai_mode is a bool local to ElocCommands.cpp so doesn't appear to cause any issue. In the meantime I2SMEMSSampler fills a buffer, checks ai_run_enable & finds it true (snippet below) so signals xTaskNotify to run the inference

    if (ai_run_enable && ei_TaskHandler != NULL)
          xTaskNotify(ei_TaskHandler, (0), eNoAction);

At some point (before, after, during?) in main.cpp the following line is evaluated:

if (xQueueReceive(rec_ai_evt_queue, &ai_run_enable, pdMS_TO_TICKS(500))) {
            ESP_LOGI(TAG, "Received AI run enable = %d", ai_run_enable);

This maps the change in mode (from BT) directly into (the global) ai_run_enable, whilst simultaneously evaluating it. I don't the internals of xQueueReceive, there could be atomic operations occurring internally but when its tested initially it discovers a change & begins (internally) to act on this. I imagine (guess) that this causes some sort of weird error, where FreeRTOS, after evaluating the above snippet clears it's queue, but that happens to be a global variable & messes up the heap.

To solve this particular I've introduced a new local bool for the receive queue as follows:

if (xQueueReceive(rec_ai_evt_queue, &change_ai_run_enable, pdMS_TO_TICKS(500))) {
            ESP_LOGI(TAG, "Received AI run enable = %d", ai_run_enable);

Which is subsequently used to set ai_run_enable.

Also I've used a two step process to change the AI state:

ai_run_enable -> used to set REQUESTED change of state
inference.status_running -> used to indicate the ACTUAL state

This is similar to the WAVWriter setup which seems to be successful. I've done a few quick tests with the commit I've just pushed & so far so good, I'll do further tests over the weekend...

@EDsteve Sorry this is taking so long!

LIFsCode commented 4 months ago

@OOHehir could it maybe be a problem with the auto type? In the end the queue stuff reads via a pointer and size based on how the queue is set up.

I'm not too familiar with the C++ standard on auto type. Bool at least has a implementation specific size and maybe if the compiler ends up with a different size for the auto variable it could result in corruption when the auto size is smaller than the bool size. Not sure if this makes sense.

OOHehir commented 4 months ago

@LIFsCode When I hover over the the variable in my IDE (vsc) I can see that the compiler has deduced it as type bool so I don't think that's the issue. I think the auto keyword in C used to do some strange thinks but from what I've read in C++ from C+11 its compile time deduction of type.

Screenshot from 2024-02-18 04-35-12

@EDsteve I ran some more tests (heapDebugging) & so far no crashes. There some memory exhaustion issues, probably a lot do to with the heap debugging, so I'll start to try & update master with the relevant changes. The branches have diverged a bit but I'll try just pick-out the changes to get a stable build.

@LIFsCode Do you want to include the SD card speed tests in master? I assume you don't want the heap debugging changes brought across?

EDsteve commented 4 months ago

@OOHehir @LIFsCode Haven't had the time for detailed tests lately. Sorry about that. I am free again in two days. And can start to do more tests with more devices. Let me know what to test. Heap debugging, master or perf debugging? I can test all if it helps :)

LIFsCode commented 4 months ago

@OOHehir the sd card speed test cam be included in master branch, it should not affect runtime behavior as it's only an isolated command and may be helpful for checking SD cards.

Thanks a lot for bringing back the Bugfixes to master. All heap debugging stuff can be removed, this has all negative runtime performance impact. I tried to keep those in isolated commits so I guess you can just cherry pick the Bugfix commits

OOHehir commented 4 months ago

@OOHehir @LIFsCode Haven't had the time for detailed tests lately. Sorry about that. I am free again in two days. And can start to do more tests with more devices. Let me know what to test. Heap debugging, master or perf debugging? I can test all if it helps :)

@EDsteve I'll try have the master branch ready to test to try & cut down on the testing required. Hopefully we'll be able to remove any redundant branches to keep things organised.

OOHehir commented 4 months ago

@OOHehir the sd card speed test cam be included in master branch, it should not affect runtime behavior as it's only an isolated command and may be helpful for checking SD cards.

Thanks a lot for bringing back the Bugfixes to master. All heap debugging stuff can be removed, this has all negative runtime performance impact. I tried to keep those in isolated commits so I guess you can just cherry pick the Bugfix commits

@LIFsCode No problem. I'll pull the fixes from heapDebugging & then come back & see if there's fixes from PerfDebugging.

OOHehir commented 4 months ago

@EDsteve @LIFsCode I've pulled across what I can see are the relevant changes from heapDebugging into master. I've tried some quick tests of the master & so far so good.

@EDsteve @LIFsCode When you get a chance could you please:

Delete sdkconfig.dev-windows (will be regenerated from sdkconfig.defaults)
Delete sdkconfig.dev-ei-windows (will be regenerated from sdkconfig.defaults)
Under 'Project Tasks > esp32dev-ei-windows > General' select 'Full Clean'

Build & test esp32dev-ei-windows

OOHehir commented 4 months ago

I think it's safe to close this?

LIFsCode / ELOC-3.0