ayushsharma82 / WebSerial

A remote terminal library for wireless microcontrollers to log, monitor or debug your firmware/product
https://webserial.pro
GNU Affero General Public License v3.0
459 stars 107 forks source link

Task watchdog triggered because of the busy wait loop #86

Open mathieucarbou opened 1 week ago

mathieucarbou commented 1 week ago

I've been mistakenly running with the non high-perf version of webserial and found an issue in the busy loop which is triggerring the watchdog if activated:

void WebSerialProClass::_wait_for_print_mutex() {
  // Wait for mutex to be released
  if (_print_buffer_mutex) {
    while (_print_buffer_mutex) {
      delayMicroseconds(1);
    }
  }
}

Besides blocking, the while loop is not allowing the ESP to execute another task because delayMicroseconds(1); is too small.

delay or vTaskDelay could be used instead with a value no less than portTICK_PERIOD_MS. The default tick rate is 100Hz, which means portTICK_PERIOD_MS is 10 (10ms per tick). Therefore a delay lower than 10 ms causes no delay.

delayMicroseconds is instead a busy wait as it is implemented so it prevents any task scheduling to happen.

E (236597) task_wdt: Task watchdog got triggered. The following tasks/users did not reset the watchdog in time:
E (236597) task_wdt:  - Core (CPU 1)
E (236597) task_wdt:  - IDLE1 (CPU 1)
E (236597) task_wdt: Tasks currently running:
E (236597) task_wdt: CPU 0: IDLE0
E (236597) task_wdt: CPU 1: Router
E (236597) task_wdt: Aborting.
E (236597) task_wdt: Print CPU 1 backtrace

Backtrace: 0x40085009:0x3ffd4ed0 0x40085047:0x3ffd4ef0 0x4010fb38:0x3ffd4f10 0x4010c9fe:0x3ffd4f30 0x4010cb1e:0x3ffd4f50 0x4010cbb6:0x3ffd4f70 0x4010cbed:0x3ffd4f90 0x401dab7e:0x3ffd4fb0 0x4010b9c6:0x3ffd4fd0 0x4010bd27:0x3ffd5030 0x400efc7c:0x3ffd50b0 0x400efc91:0x3ffd50d0 0x40101f3b:0x3ffd50f0 0x40101f76:0x3ffd5120 0x4010201e:0x3ffd5140 0x40102069:0x3ffd5170 0x40102085:0x3ffd5190
  #0  0x40085009 in esp_timer_impl_get_counter_reg at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp_timer/src/esp_timer_impl_lac.c:110
  #1  0x40085047 in esp_timer_impl_get_time at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp_timer/src/esp_timer_impl_lac.c:140
  #2  0x4010fb38 in delayMicroseconds at /Users/mat/.platformio/packages/framework-arduinoespressif32@src-c4332d380c103569bb66c530e15d287f/cores/esp32/esp32-hal-misc.c:215
  #3  0x4010c9fe in WebSerialProClass::_wait_for_print_mutex() at lib/WebSerialPro/src/WebSerialPro.cpp:302
  #4  0x4010cb1e in WebSerialProClass::_flush_print_buffer() at lib/WebSerialPro/src/WebSerialPro.cpp:346
  #5  0x4010cbb6 in WebSerialProClass::loop() at lib/WebSerialPro/src/WebSerialPro.cpp:389
  #6  0x4010cbed in WebSerialProClass::write(unsigned char const*, unsigned int) at lib/WebSerialPro/src/WebSerialPro.cpp:247
  #7  0x401dab7e in Print::write(char const*, unsigned int) at /Users/mat/.platformio/packages/framework-arduinoespressif32@src-c4332d380c103569bb66c530e15d287f/cores/esp32/Print.h:66
      (inlined by) Print::print(String const&) at /Users/mat/.platformio/packages/framework-arduinoespressif32@src-c4332d380c103569bb66c530e15d287f/cores/esp32/Print.cpp:91
  #8  0x4010b9c6 in void Mycila::Logger::log<char const*, char const*, char const*, char const*, float>(unsigned char, char const*, char const*, char const*, char const*, char const*, char const*, float) at .pio/libdeps/trial-esp32-debug/MycilaLogger/src/MycilaLogger.h:95
  #9  0x4010bd27 in void Mycila::Logger::info<char const*, char const*, char const*, char const*, float>(char const*, char const*, char const*, char const*, char const*, char const*, float) at .pio/libdeps/trial-esp32-debug/MycilaLogger/src/MycilaLogger.h:56
      (inlined by) Mycila::RouterOutput::applyAutoBypass() at lib/MycilaRouter/MycilaRouterOutput.cpp:184
  #10 0x400efc7c in routerTask::{lambda(void*)#1}::operator()(routerTask) const at src/tasks/Router.cpp:13
  #11 0x400efc91 in void std::__invoke_impl<void, routerTask::{lambda(void*)#1}&, routerTask>(std::__invoke_other, routerTask::{lambda(void*)#1}&, routerTask&&) at /Users/mat/.platformio/packages/toolchain-xtensa-esp32/xtensa-esp32-elf/include/c++/12.2.0/bits/invoke.h:61
      (inlined by) std::enable_if<is_invocable_r_v<void, routerTask::{lambda(void*)#1}&, routerTask>, routerTask::{lambda(void*)#1}&>::type std::__invoke_r<void, routerTask::{lambda(void*)#1}&, routerTask>(std::enable_if&&, (void&&)...) at /Users/mat/.platformio/packages/toolchain-xtensa-esp32/xtensa-esp32-elf/include/c++/12.2.0/bits/invoke.h:111
      (inlined by) std::_Function_handler<void (void*), routerTask::{lambda(void*)#1}>::_M_invoke(std::_Any_data const&, void*&&) at /Users/mat/.platformio/packages/toolchain-xtensa-esp32/xtensa-esp32-elf/include/c++/12.2.0/bits/std_function.h:290
  #12 0x40101f3b in std::function<void (void*)>::operator()(void*) const at /Users/mat/.platformio/packages/toolchain-xtensa-esp32/xtensa-esp32-elf/include/c++/12.2.0/bits/std_function.h:591
  #13 0x40101f76 in Mycila::Task::_run(long long const&) at .pio/libdeps/trial-esp32-debug/MycilaTaskManager/src/MycilaTaskManager.cpp:417
  #14 0x4010201e in Mycila::Task::tryRun() at .pio/libdeps/trial-esp32-debug/MycilaTaskManager/src/MycilaTaskManager.cpp:334
  #15 0x40102069 in Mycila::TaskManager::loop() at .pio/libdeps/trial-esp32-debug/MycilaTaskManager/src/MycilaTaskManager.cpp:106
  #16 0x40102085 in Mycila::TaskManager::_asyncTaskManager(void*) at .pio/libdeps/trial-esp32-debug/MycilaTaskManager/src/MycilaTaskManager.cpp:196
ayushsharma82 commented 1 week ago

I was actually wondering about it... because if you remember Markus, he is also experiencing same issue. Similar crash log, crashed at loop waiting for print buffer mutex to be released.

ayushsharma82 commented 1 week ago

I guess switching to delay(1); will solve it? Let me know if you can test.

mathieucarbou commented 1 week ago

I guess switching to delay(1); will solve it? Let me know if you can test.

it wont on all ESP.

delay is implemented as follow:

void delay(uint32_t ms) {
  vTaskDelay(ms / portTICK_PERIOD_MS);
}

On my ESP32, portTICK_PERIOD_MS is 1 because configTICK_RATE_HZ is 1000. But this depends on boards.

To make sure you trigger a pause which is not 0, you must use something like:

delay(portTICK_PERIOD_MS); delay(N * portTICK_PERIOD_MS); ...

mathieucarbou commented 1 week ago

But the issue is not only here: triggering a short pause is OK, but doing it in a loop is not OK: for example, on a a single core, this small delay might not be enough to let the ESP give some time for bluetooth, wifi, core tasks, etc.

There is no simple solution here, and I think the root of the issue is the mutex implementation on a dual-core system which is not OK.

1) shared resources across core should be volatile for concurrent read and writes only from 1 core (_buffer_mutex is not)

2) only 1 writer core you ever mutate a volatile variable to avoid any unexpected state (but several cores can read) - which is not possible in your impl since both core 1 and core 0 can mutate the flag

3) Otherwise use proper mutex c++ classes

Here, _buffer_mutex is not volatile so a mutation from 1 core will not always be visible to the other core (memory is not always shared across core)

I don't know if Arduino internals are made in a way to simplify that, but these notions apply to the Java language and C++ at least.

https://www.arduino.cc/reference/en/language/variables/variable-scope-qualifiers/volatile/

ayushsharma82 commented 1 week ago

Right... I see into removing these mutex locks completely.

ayushsharma82 commented 3 days ago

@mathieucarbou , What about yield() ?

mathieucarbou commented 3 days ago

yield

yield() is called by delay() and only is relevant for multi-core since it allows another core to run, but yielding in a loop might not solve the problem.

To really solve the issue, the while loop should be removed. It's usually not a good practice to have a busy loop.

A real mutex could be used instead (std:mutex) to protect the shared resource. And depending on the access types (more read than writes, or the opposite), a std::shared_mutex can be used instead to optimise and still allow concurrency access but only lock when needed.

That's one of the reason I don't like the buffered approach within the library, which, IMO, has nothing to do here because it adds complexity and does not really solve the buffering issue which is better solved on the caller side. The websocket layer already has a buffer of messages to dequeue and send, so it can be leveraged instead of having a common shared resource that requires access synchronisation.