cesanta / mongoose-os

Mongoose OS - an IoT Firmware Development Framework. Supported microcontrollers: ESP32, ESP8266, CC3220, CC3200, STM32F4, STM32L4, STM32F7. Amazon AWS IoT, Microsoft Azure, Google IoT Core integrated. Code in C or JavaScript.
https://mongoose-os.com
Other
2.49k stars 430 forks source link

Update to be thread safe and add multi-core support (ie ESP32) #480

Open tripflex opened 5 years ago

tripflex commented 5 years ago

Mongoose OS only currently supports running ESP32 on single core, basically leaving the second core completely unused.

Yes i've seen the thread at: https://forum.mongoose-os.com/index.php?p=/discussion/comment/2356/#Comment_2356

And ticket #253

I opened this issue as technically this is an issue that only single core devices are supported, and multi-core devices that are supported ... only a single core is supported on them.

By closing ticket #253 does that mean that Cesanta has no plans for ever supporting multi-core devices?

While I understand that currently MOS is not thread safe, eventually over time one would assume that new devices will become available, many of them will have multi-core ... and even if it's not an ESP32, i'm sure you guys will want to add new devices ... but does that mean that even though support may be added, that they will be crippled by only supporting a single core?

rtrue983 commented 5 years ago

I haven't read this fully but it seems like the esp-idf supports the two threads... they could at least get two cores running and pin all the mongoose functions to one core. This could possible let a user add additional independent task to the 2nd core to start with?

https://docs.espressif.com/projects/esp-idf/en/latest/api-guides/freertos-smp.html

rtrue983 commented 5 years ago

Then again, this may go back to making mongoose thread safe as the first step and let users create their own task with less worry about threading and interacting with mongoose in those threads

markterrill commented 5 years ago

I'm also very interested in this.

Harvie commented 3 years ago

At very least it would be very helpfull to have thread-safe LOG() macro

rojer commented 3 years ago

it is, actually. logging is thread-safe. the only problem is sometimes the task it's logging from doesn't have enough stack... *printf calls are quite expensive in terms of stack.

Harvie commented 3 years ago

@rojer Do you think the stack running out is the case in the following backtrace? I can't really tell.

Reading symbols from /fw.elf...done.
Remote debugging using 127.0.0.1:1234
0x400880df in invoke_abort ()
    at /opt/Espressif/esp-idf/components/esp32/panic.c:156
156         *((int *) 0) = 0;
#0  0x400880df in invoke_abort ()
    at /opt/Espressif/esp-idf/components/esp32/panic.c:156
#1  0x40088264 in abort ()
    at /opt/Espressif/esp-idf/components/esp32/panic.c:171
#2  0x400828ee in lock_acquire_generic (lock=0x3ffae9d8, delay=4294967295, 
    mutex_type=4 '\004')
    at /opt/Espressif/esp-idf/components/newlib/locks.c:143
#3  0x40082a0c in _lock_acquire_recursive (lock=0x3ffae9d8)
    at /opt/Espressif/esp-idf/components/newlib/locks.c:171
#4  0x4014e389 in _fwrite_r (ptr=0x3ffc8d58, buf=<optimized out>, size=1, 
    count=24, fp=0x3ffae980) at ../../../.././newlib/libc/stdio/fwrite.c:168
#5  0x4014e404 in fwrite (buf=0x3ffc8b40, size=1, count=24, fp=0x3ffae980)
    at ../../../.././newlib/libc/stdio/fwrite.c:211
#6  0x4017b2a0 in cs_log_print_prefix (level=LL_WARN, file=<optimized out>, 
    ln=<optimized out>) at src/common/cs_dbg.c:86
#7  0x400850b0 in siterRF_RX_cb (pqh=<optimized out>, cb_arg=<optimized out>)
    at /home/harvie/Temp/firmware-github/display-lvgl/deps/siterRF/src/siterRF.cpp:210
#8  0x40084cec in pq_task (arg=0x3ffb634c <siter_pqh>)
    at /home/harvie/Temp/firmware-github/display-lvgl/deps/pq/src/pq.c:33
Detaching from program: /fw.elf, Remote target

i am using stack size of (MGOS_TASK_STACK_SIZE_BYTES / STACK_SIZE_UNIT) i've even tried to increase it 10 times and it still crashes the same way

Harvie commented 3 years ago

Ok, sorry. The problem was in directly calling printf() rather than LOG(). Unfortunately i need printf() for debugging, since LOG is too slow for my purposes :-/ But i guess i will have to figure this out...

Harvie commented 3 years ago

No. I take that back. It actualy happens in " cs_log_print_prefix() " call...

rojer commented 3 years ago
    if (!xPortCanYield()) {
        /* In ISR Context */
        if (mutex_type == queueQUEUE_TYPE_RECURSIVE_MUTEX) {
            abort(); /* recursive mutexes make no sense in ISR context */
        }

code around line 143. looks like you're trying to log from an ISR handler? this is not going to work.

rojer commented 3 years ago

from ISR you can use mgos_cd_printf that will print to UART

Harvie commented 3 years ago

Just checked. I am not loging from ISR handler...

Harvie commented 3 years ago

But FreeRTOS scheduler does switch between tasks in ISR, so one task preempting others tasks is basicaly ISR for the other tasks.

Harvie commented 3 years ago

Now after some testing i am pretty sure that this problem happens when i call LOG(...) from freertos task other than main mongoose task. probability is increased when running on the same freertos priority as main mongoose task and 1000Hz tickrate, since context switches are more frequent that way due to time slicing. Perhaps then there is more chance that two tasks call LOG() at once.

rojer commented 3 years ago

yet, somehow xPortCanYield() ends up being false. is the trace always the same as above?

Harvie commented 3 years ago

Yes

Harvie commented 3 years ago

I have GPIO ISR handler, which sends job to queue which is later handled by separate task. So it might be possible that after the GPIO ISR freertos scheduler immediately jumps to different task which is receiving the queue, because it gets unblocked by that ISR sending to queue.

Harvie commented 3 years ago

To be clear, in the ISR i am using this macro after sending job to the queue if queue gives me *pxHigherPriorityTaskWoken == true.

#define YIELD_FROM_ISR(should_yield)        \
  {                                         \
    if (should_yield) portYIELD_FROM_ISR(); \
  }
Harvie commented 3 years ago

I think it would make sense for mongoose loging to use FreeRTOS stream buffers https://www.freertos.org/RTOS-stream-buffer-example.html

That way it would be possible to log to such buffer from anywhere in code. Multiple tasks and ISRs. Stream buffer would then synchronize everything to make it thread safe.

Harvie commented 3 years ago

BTW what about mgos_azure_send_d2c_msg() (and respective mqtt api)? Is that thread safe? I am afraid that it will have same issues as LOG()...

Harvie commented 3 years ago

Until it gets fixed, i came up with temporary workaround:

#include "FreeRTOS.h"
#include "task.h"
#include "mgos_core_dump.h" //mgos_cd_printf() prototype
#include "common/cs_dbg.h"

#ifndef __FILENAME__
/// We don't want to log full paths, so we only use the filename
#define __FILENAME__ (strrchr("/" __FILE__, '/') + 1)
#endif //__FILENAME__

/**
 * mgos macro `LOG()` is not thread safe, so we have this instead.
 * unfortunately it can currently log only to UART...
 */
#define PQ_LOG(l, x)                                  \
  do {                                                \
    if(cs_log_level >= l)  {                          \
      mgos_cd_printf("%.24s:%d:%.16s:\t\t", __FILENAME__, __LINE__, pcTaskGetTaskName(NULL)); \
      mgos_cd_printf x;                               \
      mgos_cd_printf("\n");                           \
    }                                                 \
  } while (0)

(unfortunately this does not play very well with power management enabled https://github.com/cesanta/mongoose-os/issues/556 )

Harvie commented 3 years ago

From what i read here https://github.com/espressif/esp-idf/issues/4838 it feels like the 115200 should work with REF_TICK.

So perhaps we should try it and always use REF_TICK unless user specifies higher baudrate?

Harvie commented 3 years ago

BTW what is expected behaviour of LOG() in ISR? It should probably look like this:

LOG mutex is unlocked, ISR can LOG() LOG mutex is locked, ISR cannot LOG() and therefore should ignore the LOG() call

Even better version:

There is dedicated buffer for ISRs to LOG into. It is dumped to LOG() as soon as it gets unlocked.

What currently happens is

No matter if LOG is locked or not, calling LOG() in ISR causes panic

rojer commented 3 years ago

LOG is not meant to be used in ISR.

On Tue 27 Oct 2020, 09:03 Tomas Mudrunka, notifications@github.com wrote:

BTW what is expected behaviour of LOG() in ISR? It should probably look like this:

LOG mutex is unlocked, ISR can LOG() LOG mutex is locked, ISR cannot LOG() and therefore should ignore the LOG() call

What currently happens is

No matter if LOG is locked or not, calling LOG() in ISR causes panic

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cesanta/mongoose-os/issues/480#issuecomment-717093511, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEBW2YJSWHBVCMK4KJDIH3SM2EHNANCNFSM4GX2J5MA .

Harvie commented 3 years ago

LOG is not meant to be used in ISR.

Ok, that is not really important for me. Just saying it can be done if needed :-) I will try to write some minimal code needed to reproduce the bug i experience when calling LOG() from multiple tasks, so you can take a look at it.

Harvie commented 3 years ago

I am still strugling to isolate the bug. Might mgos_ints_disable() / mgos_ints_enable() have effect on triggering of this bug by any chance?

Harvie commented 3 years ago

O yeah! I got it. Happens when calling LOG() with ints disabled. At least in some cases...

Remote debugging using 127.0.0.1:1234
0x4008727b in invoke_abort ()
    at /opt/Espressif/esp-idf/components/esp32/panic.c:156
156         *((int *) 0) = 0;
#0  0x4008727b in invoke_abort ()
    at /opt/Espressif/esp-idf/components/esp32/panic.c:156
#1  0x40087400 in abort ()
    at /opt/Espressif/esp-idf/components/esp32/panic.c:171
#2  0x40082802 in lock_acquire_generic (lock=0x3ffae9d8, delay=4294967295, 
    mutex_type=4 '\004')
    at /opt/Espressif/esp-idf/components/newlib/locks.c:143
#3  0x40082920 in _lock_acquire_recursive (lock=0x3ffae9d8)
    at /opt/Espressif/esp-idf/components/newlib/locks.c:171
#4  0x4013f59d in _fwrite_r (ptr=0x3ffc8380, buf=<optimized out>, size=1, 
    count=24, fp=0x3ffae980) at ../../../.././newlib/libc/stdio/fwrite.c:168
#5  0x4013f618 in fwrite (buf=0x3ffc8220, size=1, count=24, fp=0x3ffae980)
    at ../../../.././newlib/libc/stdio/fwrite.c:211
#6  0x40169f00 in cs_log_print_prefix (level=LL_INFO, file=<optimized out>, 
    ln=<optimized out>) at src/common/cs_dbg.c:86
#7  0x400e6c06 in task_cb (pvParameters=<optimized out>)
    at /home/harvie/Temp/firmware-github/log-crash-test/src/main.c:21
Detaching from program: /fw.elf, Remote target
rojer commented 3 years ago

i bet it can. mgos_ints_disable() translates to portENTER_CRITICAL on freertos, you shouldn't be logging from inside a critical section.

Harvie commented 3 years ago

Can you please add some asserts checking if we are in critical section, so this becomes immediately obvious when somebody runs to this limitation again?

As well as puting big warning to LOG documentation https://mongoose-os.com/docs/mongoose-os/api/core/cs_dbg.h.md that it should not be called from ISR or with disabled interrupts.

//Isolated case of code causing the crash :-)
#include "mgos.h"

enum mgos_app_init_result mgos_app_init(void)
{
    mgos_ints_disable();
    LOG(LL_INFO,("Log from task %d", 456));
    mgos_ints_enable();

    return MGOS_APP_INIT_SUCCESS;
}
rojer commented 3 years ago

well, there are a lot of things you shouldn't be doing with ints disabled, logging is just one fo them - in fact, the set of thinjgs you can safely do with ints disabled is very limited. i guess it's worth documenting somewhere, though

Harvie commented 3 years ago

Anyway. I am satisfied with this conclusion, it is no problem to move LOG() out of the critical section. if i only knew week ago... I might have save myself lots of headache :-)

BTW what about stuff like MQTT? Can that be considered thread safe? What parts of mongoose cause biggest troubles regarding thread safety? So far stuff seems workable.

Hopefully i will soon release my module implementing parallel event loops, works well, just needs bit of polishing... Stay tuned :)

rojer commented 3 years ago

From what i read here espressif/esp-idf#4838 it feels like the 115200 should work with REF_TICK.

So perhaps we should try it and always use REF_TICK unless user specifies higher baudrate?

done in https://github.com/cesanta/mongoose-os/commit/5b937f05a949c6227b6a17267533f490f4b8c661 covers rates up to 500K.

markterrill commented 3 years ago

Anyway. I am satisfied with this conclusion, it is no problem to move LOG() out of the critical section. if i only knew week ago... I might have save myself lots of headache :-)

BTW what about stuff like MQTT? Can that be considered thread safe? What parts of mongoose cause biggest troubles regarding thread safety? So far stuff seems workable.

Hopefully i will soon release my module implementing parallel event loops, works well, just needs bit of polishing... Stay tuned :)

Sounds interesting!

I don't know if we're talking the exact same topic, but I was needing to play with the mgos_pwm _rgb code to implement hardware fading and reliable led blinking. Ended up doing the latter with

static int taskCore = 1;

    xReturned = xTaskCreatePinnedToCore(
        vLEDPWMTask, /* Function that implements the task. */
        "LEDPWMTASK", /* Text name for the task. */
        1024 * 4, /* Stack size in words, not bytes.  ..which might not be the case anymore according to espressif */
        pParams1, /* Parameter passed into the task. */
        MGOS_TASK_PRIORITY + 1,
        &led->xHandle, /* Used to pass out the created task's handle. */
        taskCore  /* Core where the task should be pinned */
        );

Interestingly before pinning to core 1 I was able to see the task getting sent to 1 or 2!

Harvie commented 3 years ago

Interestingly before pinning to core 1 I was able to see the task getting sent to 1 or 2!

Yes, that is an expected behaviour. FreeRTOS will run tasks on both cores as needed, when it is not disabled.

Harvie commented 3 years ago

Now i have firmware which runs multiple threads without problem when using UNICORE=Y, but when i enable all cores it causes following crash. Any ideas?

Remote debugging using 127.0.0.1:1234
0x40088627 in invoke_abort ()
    at /opt/Espressif/esp-idf/components/esp32/panic.c:156
156         *((int *) 0) = 0;
#0  0x40088627 in invoke_abort ()
    at /opt/Espressif/esp-idf/components/esp32/panic.c:156
#1  0x400887ac in abort ()
    at /opt/Espressif/esp-idf/components/esp32/panic.c:171
#2  0x400d2822 in __assert_func (
    file=0x3f402872 "/opt/Espressif/esp-idf/components/freertos/portmux_impl.inc.h", line=157, func=<optimized out>, 
    failedexpr=0x3f4028f1 "coreID == mux->owner")
    at ../../../.././newlib/libc/stdlib/assert.c:63
#3  0x4008a32c in vPortCPUReleaseMutexIntsDisabledInternal (
    mux=<optimized out>)
    at /opt/Espressif/esp-idf/components/freertos/portmux_impl.inc.h:163
#4  vPortCPUReleaseMutexIntsDisabled (mux=<optimized out>)
    at /opt/Espressif/esp-idf/components/freertos/portmux_impl.h:109
#5  vTaskExitCritical (mux=<optimized out>)
    at /opt/Espressif/esp-idf/components/freertos/tasks.c:4262
#6  0x40085325 in mgos_ints_enable ()
    at /home/harvie/Temp/firmware-github/display-lvgl/deps/freertos/src/mgos_freertos.c:326
#7  0x40109a94 in siterRF_RX_cb (pqh=<optimized out>, cb_arg=<optimized out>)
    at /home/harvie/Temp/firmware-github/display-lvgl/deps/siterRF/src/siterRF.cpp:259
#8  0x401062dc in pq_task (arg=0x3ffb6d80 <siter_pqh>)
    at /home/harvie/Temp/firmware-github/display-lvgl/deps/pq/src/pq.c:41
Detaching from program: /fw.elf, Remote target
Harvie commented 3 years ago

@markterrill i've published the code for parallel event loops, you might check it out here:

https://github.com/Harvie/pq https://github.com/mongoose-os-libs/freertos/issues/3

Harvie commented 3 years ago

What do you think about that pq module? Would you accept PR that would integrate that functionality to mongoose os to replace currently used event loop without breaking current API?

rojer commented 3 years ago

i think it's too soon to be replacing mgos main task with it, but i can offer putting it under mongoose-os-libs and see if it gains more usage. looks neat, i'll play with it as well.

Harvie commented 3 years ago

but i can offer putting it under mongoose-os-libs

That would be great! Would i still have read-write access to repo in such case?

i think it's too soon to be replacing mgos main task with it,

i can barely think of any obvious downsides, but i will be happy to receive any comments reviewing the code.

rojer commented 3 years ago

created https://github.com/mongoose-os-libs/pq and sent you an admin invite

Harvie commented 3 years ago

OK, thanks. I've already imported the code.

markterrill commented 3 years ago

Hi @Harvie, I really like the rationale behind this module, but I'd love to see some example code as hinted as coming in the readme. My use case that I've got working fairly well is LEDC fading and blinking via xtask. What I'm finding though is our buzzer alarm sounds (playing RTTL encoded 'songs'!) can get interrupted and not fully execute with certain events like wifi disconnection. The wifi (and cloud) disconnection events fire, which triggers playing the sound and running the status light sequence, and more than likely there is some wifi activity as it tries to reconnect that kicks in a second or so later, but there are multiple contenders there elbowing each other out of the way and the buzzer can get caught on a tone that continues till it's called again!

Overall it's not clear really how they should be setup. I contributed code to the mongoose PWM library that handles the xtask for ledc. I'm guessing to add your library and call any of the buzzer sounds via a low priority pq thread. Appreciate any ideas!

Harvie commented 3 years ago

Hi @Harvie, I really like the rationale behind this module, but I'd love to see some example code as hinted as coming in the readme.

There is some basic example in the readme as you have noticed.

Exact configuration might need you to have bit insight to FreeRTOS tasks in general as pq runs in such task with all consequences. Properly selecting task priority, keeping watchdogs fed, understanding freertos ticks, time slicing and similar stuff will affect performance of the system. For realtime audio output it might be better to use hardware timer in some cases.

markterrill commented 3 years ago

I saw the example and it's good how it shows how to call the various exposed functions. I've got working knowledge of freeRTOS tasks, having taken that deep dive into that world and written a PR that adds the missing code for ESP32 hardware driven LEDC control (https://github.com/mongoose-os-libs/pwm/pull/7).

A concrete implementation mongoose demo app showing how the different priorities and competing jobs work in real life would be a lot more useful.

PS, It's a simple buzzer sounding a couple notes. Not a wave file. The best I can understand is if I queued two PQ jobs with the same priority they would neatly line up sequentially but they could easily just get blown away from MOS's higher priority system functions (ie wifi disconnects) and render the whole exercise pointless.

Harvie commented 3 years ago

PS, It's a simple buzzer sounding a couple notes. Not a wave file.

But it still requires you to follow strict timing and reconfigure the PWM at the exact time the note should be played. Therefore i suggested HW interrupt. Another solution might be to pregenerate PWM waveform to memory buffer and let esp32s i2s peripheral to bitbang it for you on background. (not actualy using i2s DAC, but leveraging I2S DMA to output bunch of bits ressembling PWM. People do this all the time for various reasons).

just get blown away from MOS's higher priority system functions (ie wifi disconnects)

ESP32 has 2 CPU cores, so one might be handling your code while another will handle wifi. Also you can create queue with HIGH priority. I don't completely understand why you create queue with low/same priority and then complain it gets preempted. Just create high priority queue and leave it blocked until it actualy has something to do. That way you will allow lower-priority tasks to run except for few brief moments that you need to use high priority task.

Harvie commented 3 years ago

BTW is it safe to use logging from esp_log.h (esp-idf) in parallel tasks (and/or ISR) ? Eg. calling ESP_LOGI().

rojer commented 3 years ago

yes, logging, mos and idf, is protected by locks and should be safe to use from any task.