cesanta / mongoose-os

Mongoose OS - an IoT Firmware Development Framework. Supported microcontrollers: ESP32, ESP8266, CC3220, CC3200, STM32F4, STM32L4, STM32F7. Amazon AWS IoT, Microsoft Azure, Google IoT Core integrated. Code in C or JavaScript.
https://mongoose-os.com
Other
2.48k stars 430 forks source link

ESP32: Weird firmware corruption when using CONFIG_PM_ENABLE=y #556

Closed Harvie closed 3 years ago

Harvie commented 3 years ago

There is some weird string printed during boot:

image

I've also found that when my code contains any function with IRAM_ATTR attribute, it will get even worse:

image

Even when i do not call the function at all. Just having eg.:

void IRAM_ATTR test(void) {                                                                                                                 
        return;                                                                                                                             
}   
Harvie commented 3 years ago

Just found it only happens when i have CONFIG_PM_ENABLE=y in ESP_IDF_SDKCONFIG_OPTS

rojer commented 3 years ago

this looks like bus clock gets changed and UART baud rate gets out of whack briefly. does it happen at other times or just once during boot?

Harvie commented 3 years ago

Yes, that is what i was thinking actually. Haven't noticed during any other time. But why it does produce more gibberish when using IRAM_ATTR? does it make any sense?

Harvie commented 3 years ago

I have some new observations on this.

i made my own PQ_LOG() macro based on mgos_cd_printf() ( see https://github.com/cesanta/mongoose-os/issues/480#issuecomment-716488590 for more ) and i found that when i use PQ_LOG() with PM enabled it gets corrupted as well, while output of LOG() does not get corrupted.

So i guess the problem lays in fact that mgos_cd_printf() does not use power management lock to prevent CPU from downscaling frequency while transmitting.

Also i've noticed that on different project, where i have lot of tasks and events doing something, the mgos_cd_printf() works well, because firmware probably does not idle enough to allow for PM to downscale very often.

Can you please add power management lock to mgos_cd_printf() in order to prevent this issues?

rojer commented 3 years ago

thanks, i definitely need to look into power management and how it's done. no ETA though...

rojer commented 3 years ago

mgoscd* should definitely not be acquiring any locks, it's only intended to be used during early stages of boot and exception handling (cd is for core dump).

Harvie commented 3 years ago

power management lock is not a mutex-like lock, which would block itself or other tasks until some resource is free. it just prevents the esp32 hardware to go into lower CPU frequency while it is locked and ensures that it would run at full speed while transmitting to the UART...

Harvie commented 3 years ago

I know that cd is for core dump, but i had to use it ( see #480 ), because LOG() is not completely thread safe :-(

rojer commented 3 years ago

ok, i see. isn't whatever manager bus frequency supposed to adjust divider when it changes bus freq? because uart can also receive, not just transmit.

we need to solve the LOG thread safety problem. it's supposed to be, we acquire the lock while logging, so why not?

Harvie commented 3 years ago

we acquire the lock while logging

The power management lock or general loging mutex?

Harvie commented 3 years ago

BTW more info on the PM locks: https://docs.espressif.com/projects/esp-idf/en/latest/esp32/api-reference/system/power_management.html#power-management-locks

rojer commented 3 years ago

we acquire the lock while logging

The power management lock or general loging mutex?

the latter - https://github.com/cesanta/mongoose-os/blob/master/src/mgos_debug.c#L46-L53 cs_log_lock is called from cs_log_print_prefix, cs_log_unlock is called from cs_log_printf after output is done

Harvie commented 3 years ago

The problem is that when i have high priority rtos task blocked on mutex and then i unlock that mutex from ISR, freertos will switch the context immediately from the ISR to the recently unblocked task instead of finishing the interrupted task. Which is okay as long as you have everything synchronised using FreeRTOS primitives...

similar situation can occur when we are time slicing between two task of the same priority. which happens in freertos tick, which is ISR as well.

rojer commented 3 years ago

this is the relevant part - https://docs.espressif.com/projects/esp-idf/en/latest/esp32/api-reference/system/power_management.html#dynamic-frequency-scaling-and-peripheral-drivers

so, it looks like the approach is to either use reftick as clock source or take the lock to pin AHB/APB to prevent scaling. i looked into using reftick, but the problem with reftick is that it's very low frequency and does not allow reasonable baud rates. it's 1 mhz, with minimum dividiers you can't even do 115200 with it.

rojer commented 3 years ago

The problem is that when i have high priority rtos task blocked on mutex and then i unlock that mutex from ISR, freertos will switch the context immediately from the ISR to the recently unblocked task instead of finishing the interrupted task.

high-pri task activation does not automatically mean that lower pri task will lose its lock. if a high-pri task get scheduled and it wants to log, but the lock is held by a low-pri task, it will have to yield to low-pri task to let it finish and release the lock, only then proceed.

Which is okay as long as you have everything synchronised using FreeRTOS primitives...

are there alternatives? freertos is our kernel, we use the locking primitives it provides.

Harvie commented 3 years ago

are there alternatives? freertos is our kernel, we use the locking primitives it provides.

This is good. I can only reccomend that. I believe that if this is actualy the case there should be no problem. Maybe something is unsynchronized? Dunno... Some buffer where the logs go before being sent over network in case of udp/mqtt loging?

rojer commented 3 years ago

before we get into the complications of network logging from multiple threads, can you confirm if uart logging is ok?

Harvie commented 3 years ago

I think i tried disabling network loging without any improvement, may try it again. but this backtrace seems as if the panic occured already during the locking phase, before network loging starts. Maybe there is a bug in esp-idf wrapper code for freertos locks.

Remote debugging using 127.0.0.1:1234
0x40087c27 in invoke_abort ()
    at /opt/Espressif/esp-idf/components/esp32/panic.c:156
156         *((int *) 0) = 0;
#0  0x40087c27 in invoke_abort ()
    at /opt/Espressif/esp-idf/components/esp32/panic.c:156
#1  0x40087dac in abort ()
    at /opt/Espressif/esp-idf/components/esp32/panic.c:171
#2  0x4008287e in lock_acquire_generic (lock=0x3ffae9d8, delay=4294967295, 
    mutex_type=4 '\004')
    at /opt/Espressif/esp-idf/components/newlib/locks.c:143
#3  0x4008299c in _lock_acquire_recursive (lock=0x3ffae9d8)
    at /opt/Espressif/esp-idf/components/newlib/locks.c:171
#4  0x4014e8d9 in _fwrite_r (ptr=0x3ffc8d08, buf=<optimized out>, size=1, 
    count=24, fp=0x3ffae980) at ../../../.././newlib/libc/stdio/fwrite.c:168
#5  0x4014e954 in fwrite (buf=0x3ffc8b70, size=1, count=24, fp=0x3ffae980)
    at ../../../.././newlib/libc/stdio/fwrite.c:211
#6  0x4017b7a0 in cs_log_print_prefix (level=LL_INFO, file=<optimized out>, 
    ln=<optimized out>) at src/common/cs_dbg.c:86
#7  0x40107c31 in siterRF_RX_cb (pqh=<optimized out>, cb_arg=<optimized out>)
    at /home/harvie/Temp/firmware-github/display-lvgl/deps/siterRF/src/siterRF.cpp:183
#8  0x4010463c in pq_task (arg=0x3ffb6358 <siter_pqh>)
    at /home/harvie/Temp/firmware-github/display-lvgl/deps/pq/src/pq.c:40
Detaching from program: /fw.elf, Remote target
Found core at 23 - 459467
rojer commented 3 years ago

ah, that one... yeah, looks weird. line numbers are slightly off, but the fwrite invocation is this one. and it's after we've acquired the logging lock. however, what can happen is if this happens while a write is in flight to the same file from some other location. in this case "file" is stderr, maybe it's being written to from somewhere else? are you positive nothin is using logging or printf or such from ISR code?

rojer commented 3 years ago

looked again into using reftick for uart - it is in theory possible to change frequency from 1 mhz to something higher, but it's defined in the IDF in a way taht is really nor supposed to be changed: here.

Harvie commented 3 years ago

it is in theory possible to change frequency from 1 mhz to something higher

Why? It feels to me that 1MHz is enough for 115200. If someone has weird usecase, where he needs more, then he can just disable PM and use different clock. (perhaps mongoose can have conditional to disable REF_TICK when baudrate higher than 115200 is specified). For debuging it should work. If you need high-speed bus, then you should use SPI anyway...

If i remember correctly, it is usual that UARTs internaly divide the frequency by factor of 8. So 1MHz/8 = 125000, which should be good for 115200 bauds probably accounting for start/stop bits. But we have to test it... Is there easy way to switch baudrate and clock source of mongoose debug uart?

Harvie commented 3 years ago

Fixed in #511 Thanks