earlephilhower / arduino-pico

Raspberry Pi Pico Arduino core, for all RP2040 and RP2350 boards
GNU Lesser General Public License v2.1
2.09k stars 434 forks source link

Multicore crash with printf(%f) #614

Closed savejeff closed 2 years ago

savejeff commented 2 years ago

Hi,

i just switched over from the mbed Arduino core for the RP2040 because I'm pretty disappointed with how the code is written and maintained. Multicore is practically impossible to do there.

I ran into a very weird behaviour with Printing to SerialUSB from both cores. I use propper Mutex and the weird thing is: it works when I use const char* or format but not when I call it with a const String&. then it just freezes after the first print

here the code that easily reproduced the problem:


#include "Arduino.h"

mutex_t MUTEX_PRINT;

//#define LOGFUNC Log // DOES WORK
//#define LOGFUNC Log2 // DOES WORK
#define LOGFUNC Log3 // DOES NOT WORK
//#define LOGFUNC Log4 // DOES NOT WORK

#include <stdarg.h>
extern void Log(const char* format, ...) __attribute__((format(printf, 1, 2)));
void Log(const char* format, ...)
{
    mutex_enter_blocking(&MUTEX_PRINT);

    char loc_buf[64];
    char* temp = loc_buf;
    va_list arg;
    va_list copy;
    va_start(arg, format);
    va_copy(copy, arg);
    int len = vsnprintf(temp, sizeof(loc_buf), format, copy);
    va_end(copy);
    if (len < 0) {
        va_end(arg);
        return;
    };
    if (len >= sizeof(loc_buf)) {
        temp = (char*)malloc(len + 1);
        if (temp == NULL) {
            va_end(arg);
            return;
        }
        len = vsnprintf(temp, len + 1, format, arg);
    }
    va_end(arg);

    Serial.print("[");
    Serial.print(millis());
    Serial.print("] ");
    Serial.println(temp);

    if (temp != loc_buf) {
        free(temp);
    }

    mutex_exit(&MUTEX_PRINT);
}

void Log2(const char* msg)
{
    mutex_enter_blocking(&MUTEX_PRINT);

    Serial.print("[");
    Serial.print(millis());
    Serial.print("] ");
    Serial.println(msg);

    mutex_exit(&MUTEX_PRINT);
}

void Log3(const String& msg)
{
    mutex_enter_blocking(&MUTEX_PRINT);

    Serial.print("[");
    Serial.print(millis());
    Serial.print("] ");
    Serial.println(msg);

    mutex_exit(&MUTEX_PRINT);
}

void Log4(const String& msg)
{
    Log2(msg.c_str());
}

/*****************************************************/

static bool start_done = false;

void jtask0()
{
    mutex_init(&MUTEX_PRINT);
    start_done = true;

    uint32_t loopcount = 0;
    while(true)
    {
        loopcount++;

        LOGFUNC("main0: loop");

        delay(100);
    }
}

void jtask1() 
{
    while(!start_done);

    uint32_t loopcount = 0;
    while(true)
    {
        loopcount++;

        LOGFUNC("main1: loop");

        delay(100);
    }
}

void setup() {
    Serial.begin(115200);

    while (!Serial.available()) 
    {
        Serial.print(".");
        delay(1000);
    }
    while(Serial.available())
        Serial.read();
}

void setup1() {

}

void loop() {
    jtask0();
}

void loop1() {
    jtask1();
}

Could somebody try this? you can just switch between the Print Functions with the defines at the top.

I'm trying to figure this out for half a day now and I can't I think I'm going crazy x]

Thx

Here the output of the working Prints:


..[9034] main0: loop
[9034] main1: loop
[9134] main0: loop
[9134] main1: loop
[9234] main0: loop
[9234] main1: loop
[9334] main0: loop
[9334] main1: loop
[9435] main0: loop
[9435] main1: loop
[9535] main0: loop
[9535] main1: loop
[9635] main0: loop
[9635] main1: loop
[9735] main0: loop
[9735] main1: loop
[9835] main0: loop
[9835] main1: loop
[9935] main0: loop
[9935] main1: loop
[10035] main0: loop
[10035] main1: loop
[10135] main0: loop
[10135] main1: loop
[10235] main0: loop

and the not working


..[7003] main1: loop
[7003] main1: loop
[7103] main0: loop

>>> DEADLOCK HERE <<< 

Im using PlatformIO with this config


[env:PICO_ALT_CORE]
platform = https://github.com/maxgerhardt/platform-raspberrypi.git
board = pico
framework = arduino
board_build.core = earlephilhower
board_build.filesystem_size = 0.5m
; note that download link for toolchain is specific for OS. see https://github.com/earlephilhower/pico-quick-toolchain/releases.
platform_packages =
    maxgerhardt/framework-arduinopico@https://github.com/earlephilhower/arduino-pico.git
    maxgerhardt/toolchain-pico@https://github.com/earlephilhower/pico-quick-toolchain/releases/download/1.3.1-a/x86_64-w64-mingw32.arm-none-eabi-7855b0c.210706.zip
earlephilhower commented 2 years ago

Thanks for the detailed MCVE. However, for me using the Arduino IDE and the latest git head, all work fine.

All logs, from Log thru Log4 are running fine and iterating between main0:loop and main1:Loop for me.

Looking at the code, it also seems fine in all. The println(String) method really just uses the .c_str() anyway as you do in Log 4 and at that point a C string is a C string is a C string...

I highly doubt there's any recent change related to this, but if you have the ability to use the latest git master code and toolchain that would be worth trying.

savejeff commented 2 years ago

thanks for the quick response. Okay this alignes with what i was expecting.

This might be caused by switching between the Mbed core and this core. The Mbed core has massive problems with any code running on the second core. The Deadlock started happening after a switch back and forth between the mbed and this core. I'll delete both platform cores and re-install them. I'll also try this on my second laptop to confirm that this is something only happening on this device.

FYI: from what i,ve seen of the Arduino pico code, I find this a very nice project. a year ago I decided against using it due to the hope that the mbed core would be better supported. but that seems now to have been the wrong decision. The use of RTOS in the background is a very good idea. Are there plans to publish an official version in PlatformIO ? Or maybe even a replacement for the current mbed based PIO platform?

earlephilhower commented 2 years ago

The use of RTOS in the background is a very good idea. Are there plans to publish an official version in PlatformIO ? Or maybe even a replacement for the current mbed based PIO platform?

Actually, by default there is no RTOS at all. The 2nd core is fully supported by the SDK and only takes a few lines to enable . We do have FreeRTOS, too, which is a true RTOS and has threads which can move around between cores/etc.

I've just published the toolchain in PIO and it's in the process of being included as a 2nd option. The MBED core is fine for many folks and will probably work better on Arduino branded HW (since their RP2040 Connect Nano board has tons of add'l onboard stuff like a WiFi interface, MEMS accelerometerr, etc.)

maxgerhardt commented 2 years ago

Please remove

; note that download link for toolchain is specific for OS. see https://github.com/earlephilhower/pico-quick-toolchain/releases.
platform_packages =
    maxgerhardt/framework-arduinopico@https://github.com/earlephilhower/arduino-pico.git
    maxgerhardt/toolchain-pico@https://github.com/earlephilhower/pico-quick-toolchain/releases/download/1.3.1-a/x86_64-w64-mingw32.arm-none-eabi-7855b0c.210706.zip

and execute in a CLI:

pio platform update https://github.com/maxgerhardt/platform-raspberrypi.git

The integration was just updated to use the latest core and toolchain. You for example are using a toolchain version that's several versions behind.

If you're still not getting the same behavior as in the Arduino IDE, then that's a problem.

savejeff commented 2 years ago

Thx @maxgerhardt , Ill try that first thing in the morning.

i took the config straight from here. It said "latest". Maybe worth updating the site.

maxgerhardt commented 2 years ago

Indeed, docs are outdated due to me updating the way it works a few hours ago. That's the next target. This is bleeding edge stuff.

savejeff commented 2 years ago

Haha perfect. I'll test it right now

savejeff commented 2 years ago

I've just published the toolchain in PIO and it's in the process of being included as a 2nd option. The MBED core is fine for many folks and will probably work better on Arduino branded HW (since their RP2040 Connect Nano board has tons of add'l onboard stuff like a WiFi interface, MEMS accelerometerr, etc.)

In general, I agree, that most people would never notice the shortcomings of the mbed core. But in this case, having a hardware dual core but only supporting a single core (not by default, but as a hard limitation) is not the right spirit. The ESP32 core lets you start tasks if you want, but you would never notice it if as a casual arduino user. I'm a big fan of what arduino did for standardizing embedded hardware abstraction, but needlessly limiting potential is not the right way. Arduino should be a foundation to get started with always an open door to go further beyond. This arduino-pico core fits this way better than the mbed core.

savejeff commented 2 years ago

I just tested @maxgerhardt instructions and it works now. I'll test it more deeply tomorrow. its getting late here in Germany ^^

Thanks for the help so far!

earlephilhower commented 2 years ago

Looks like we're all set here. Closing

savejeff commented 2 years ago

I didn't close the Issue because i wanted to run some more test to make sure.

sadly I again ran into the same problem. I have no idea want is going on. I ran the exact code on a different laptop with no RP2040 platform installed before and i took a new Pico: still freeze after first 3 prints.

@maxgerhardt could you test if the project runs for you? I've reduced the Pio project to its minimum Demo_RP2040_Deadlock_Clean.zip

and this is the ulf that i get and doesn't work: firmware.zip

Edit: I Tested it on another PC with a complete new installation of VSC. Same Deadlock problem. I also tested the code in Arduino IDE and I can confirm that here everything works with NO Deadlock This might be a problem in the platformio version of the core

maxgerhardt commented 2 years ago

Alright if it doesn't occur in the Arduino IDE but only with PlatformIO, we'll handle the issue in platform-raspberrypi further.

earlephilhower commented 2 years ago

The pico-sdk malloc/etc. is protected with mutexes:

(gdb) disassemble __wrap_free 
Dump of assembler code for function __wrap_free:
   0x10003ab4 <+0>: push    {r4, r5, r6, lr}
   0x10003ab6 <+2>: movs    r4, r0
   0x10003ab8 <+4>: ldr r5, [pc, #20]   ; (0x10003ad0 <__wrap_free+28>)
   0x10003aba <+6>: movs    r0, r5
   0x10003abc <+8>: bl  0x1000cd08 <__mutex_enter_blocking_veneer>
   0x10003ac0 <+12>:    movs    r0, r4
   0x10003ac2 <+14>:    bl  0x10006e24 <free>
   0x10003ac6 <+18>:    movs    r0, r5
   0x10003ac8 <+20>:    bl  0x1000cc88 <__mutex_exit_veneer>
   0x10003acc <+24>:    pop {r4, r5, r6, pc}
   0x10003ace <+26>:    nop         ; (mov r8, r8)
   0x10003ad0 <+28>:    lsrs    r4, r1, #15
   0x10003ad2 <+30>:    movs    r0, #0
End of assembler dump.
(gdb) disassemble __wrap_free 
Dump of assembler code for function __wrap_free:
   0x10003ab4 <+0>: push    {r4, r5, r6, lr}
   0x10003ab6 <+2>: movs    r4, r0
   0x10003ab8 <+4>: ldr r5, [pc, #20]   ; (0x10003ad0 <__wrap_free+28>)
   0x10003aba <+6>: movs    r0, r5
   0x10003abc <+8>: bl  0x1000cd08 <__mutex_enter_blocking_veneer>
   0x10003ac0 <+12>:    movs    r0, r4
   0x10003ac2 <+14>:    bl  0x10006e24 <free>
   0x10003ac6 <+18>:    movs    r0, r5
   0x10003ac8 <+20>:    bl  0x1000cc88 <__mutex_exit_veneer>
   0x10003acc <+24>:    pop {r4, r5, r6, pc}
   0x10003ace <+26>:    nop         ; (mov r8, r8)
   0x10003ad0 <+28>:    lsrs    r4, r1, #15
   0x10003ad2 <+30>:    movs    r0, #0
End of assembler dump.
(gdb) disassemble mutex_enter_blocking
Dump of assembler code for function mutex_enter_blocking:
   0x2000010c <+0>: movs    r3, #208    ; 0xd0
   0x2000010e <+2>: push    {r4, r5, lr}
   0x20000110 <+4>: movs    r5, #0
   0x20000112 <+6>: lsls    r3, r3, #24
   0x20000114 <+8>: movs    r4, #0
   0x20000116 <+10>:    ldrsb   r5, [r3, r5]
   0x20000118 <+12>:    b.n 0x20000128 <mutex_enter_blocking+28>
   0x2000011a <+14>:    ldr r3, [r0, #0]
   0x2000011c <+16>:    dmb sy
   0x20000120 <+20>:    str r4, [r3, #0]
   0x20000122 <+22>:    msr PRIMASK, r1
   0x20000126 <+26>:    wfe
   0x20000128 <+28>:    ldr r2, [r0, #0]
   0x2000012a <+30>:    mrs r1, PRIMASK
   0x2000012e <+34>:    cpsid   i
   0x20000130 <+36>:    ldr r3, [r2, #0]
   0x20000132 <+38>:    cmp r3, #0
   0x20000134 <+40>:    beq.n   0x20000130 <mutex_enter_blocking+36>
   0x20000136 <+42>:    dmb sy
   0x2000013a <+46>:    ldrb    r3, [r0, #4]
   0x2000013c <+48>:    cmp r3, #127    ; 0x7f
   0x2000013e <+50>:    bls.n   0x2000011a <mutex_enter_blocking+14>
   0x20000140 <+52>:    strb    r5, [r0, #4]
   0x20000142 <+54>:    ldr r3, [r0, #0]
   0x20000144 <+56>:    dmb sy
   0x20000148 <+60>:    movs    r2, #0
   0x2000014a <+62>:    str r2, [r3, #0]
   0x2000014c <+64>:    msr PRIMASK, r1
   0x20000150 <+68>:    pop {r4, r5, pc}
End of assembler dump.
(gdb) disassemble __mutex_enter_blocking_veneer
Dump of assembler code for function __mutex_enter_blocking_veneer:
   0x1000cd08 <+0>: push    {r0}
   0x1000cd0a <+2>: ldr r0, [pc, #8]    ; (0x1000cd14 <__mutex_enter_blocking_veneer+12>)
   0x1000cd0c <+4>: mov r12, r0
   0x1000cd0e <+6>: pop {r0}
   0x1000cd10 <+8>: bx  r12
   0x1000cd12 <+10>:    nop
   0x1000cd14 <+12>:    lsls    r5, r1, #4
   0x1000cd16 <+14>:    movs    r0, #0
End of assembler dump.
earlephilhower commented 2 years ago

And the crash shows

(gdb) where
#0  isr_hardfault () at /home/earle/Arduino/hardware/pico/rp2040/pico-sdk/src/rp2_common/pico_standard_link/crt0.S:98
#1  <signal handler called>
#2  0x1000a6f8 in _free_r (reent_ptr=0x20000554 <impure_data>, mem=<optimized out>) at /workdir/repo/newlib/newlib/libc/stdlib/mallocr.c:2708
#3  0x10006e30 in free (aptr=<optimized out>) at /workdir/repo/newlib/newlib/libc/stdlib/malloc.c:170
#4  0x10003ac6 in __wrap_free (mem=0x20002258) at /home/earle/Arduino/hardware/pico/rp2040/pico-sdk/src/rp2_common/pico_malloc/pico_malloc.c:69
#5  0x10000a46 in arduino::String::~String (this=0x20040fd4, __in_chrg=<optimized out>) at /home/earle/Arduino/hardware/pico/rp2040/cores/rp2040/api/String.cpp:148
#6  0x100003d0 in jtask1 () at /home/earle/Arduino/sketch_jun09a/sketch_jun09a.ino:121
#7  0x100003ee in loop1 () at /home/earle/Arduino/sketch_jun09a/sketch_jun09a.ino:149
#8  0x10000c26 in main1 () at /home/earle/Arduino/hardware/pico/rp2040/cores/rp2040/main.cpp:55
#9  0x10003b0a in core1_wrapper (entry=0x10000c0d <main1()>, stack_base=<optimized out>) at /home/earle/Arduino/hardware/pico/rp2040/pico-sdk/src/rp2_common/pico_multicore/multicore.c:89
#10 0x00000172 in ?? ()
earlephilhower commented 2 years ago

A simple malloc() stress test:

void setup() {
  Serial.begin(115200);
}

void loop() {
  int sz = rand() & 127;
  void *ptr = malloc(sz);
  memset(ptr, 0, sz);
  free(ptr);
}

void setup1() {
}

void loop1() {
  int sz = rand() & 127;
  void *ptr = malloc(sz);
  memset(ptr, 1, sz);
  free(ptr);
}

Doesn't crash, but does hang(!) due to the malloc mutex(!!):

(gdb) where
#0  mutex_enter_blocking (mtx=mtx@entry=0x20000bcc <malloc_mutex>) at /home/earle/Arduino/hardware/pico/rp2040/pico-sdk/src/common/pico_sync/mutex.c:38
#1  0x10003570 in __wrap_malloc (size=79) at /home/earle/Arduino/hardware/pico/rp2040/pico-sdk/src/rp2_common/pico_malloc/pico_malloc.c:33
#2  0x10001102 in GetDescHIDReport (len=len@entry=0x20041e24) at /home/earle/Arduino/hardware/pico/rp2040/cores/rp2040/RP2040USB.cpp:156
#3  0x1000127e in tud_descriptor_configuration_cb (index=<optimized out>) at /home/earle/Arduino/hardware/pico/rp2040/cores/rp2040/RP2040USB.cpp:192
#4  0x10004348 in process_get_descriptor (p_request=0x20041f40, rhport=0 '\000') at /home/earle/Arduino/hardware/pico/rp2040/pico-sdk/lib/tinyusb/src/device/usbd.c:1035
#5  process_control_request (p_request=0x20041f40, rhport=0 '\000') at /home/earle/Arduino/hardware/pico/rp2040/pico-sdk/lib/tinyusb/src/device/usbd.c:732
#6  tud_task () at /home/earle/Arduino/hardware/pico/rp2040/pico-sdk/lib/tinyusb/src/device/usbd.c:539
#7  0x10001180 in usb_irq () at /home/earle/Arduino/hardware/pico/rp2040/cores/rp2040/RP2040USB.cpp:280
#8  <signal handler called>
#9  0x10006afa in _malloc_r (reent_ptr=0x20000554 <impure_data>, bytes=<optimized out>) at /workdir/repo/newlib/newlib/libc/stdlib/mallocr.c:2522
#10 0x10006948 in malloc (nbytes=<optimized out>) at /workdir/repo/newlib/newlib/libc/stdlib/malloc.c:164
#11 0x10003576 in __wrap_malloc (size=34) at /home/earle/Arduino/hardware/pico/rp2040/pico-sdk/src/rp2_common/pico_malloc/pico_malloc.c:35
#12 0x1000033c in loop () at /home/earle/Arduino/sketch_jun10c/sketch_jun10c.ino:7
#13 0x10000be6 in main () at /home/earle/Arduino/hardware/pico/rp2040/cores/rp2040/main.cpp:132

The app called malloc, which grabbed the mutex and started doing work, but before it was done it got a USB interrupt...which also tried to malloc but can't grab the mutex since, well, it's already grabbed by the routine it interrupted.

This is a SDK/TinyUSB issue and not related to the core. Ouch. :sob:

earlephilhower commented 2 years ago

Adding a delay(5000) to both setup()s and it works fine. It's been running for ~800 million allocations without any hangs so I don't think there is a malloc() issue. Maybe a GCC optimizer problem, or something very subtle in Strings???

earlephilhower commented 2 years ago

https://github.com/earlephilhower/arduino-pico/issues/614#issuecomment-1152769879 should be fixed. Now only the strange crash under certain optimizations...

savejeff commented 2 years ago

I'm still getting weird freezes when using the second core. I'm on PlatformIO, how can I check if I have the fix? This could be, because the platformIO core is not yet updated with the fix.

earlephilhower commented 2 years ago

I'm still getting weird freezes when using the second core. I'm on PlatformIO, how can I check if I have the fix?

You could run your simple sketch above. Log3 and Log4 run fine w/the latest master, under all optimization options, but hang with prior versions.

There's also probably a pio update command of some sort to re-pull the git tree. Again, I'm not a PIO user so can't give exact configs.

savejeff commented 2 years ago

I've looked into the diff log and checked if the changes can be found locally, which is the case. so its up to date. then I'll search elsewhere.

maxgerhardt commented 2 years ago

One fishy thing I fixed in your sketch is the usage of a shared global variable that does not have the volatile on it -- the compiler might then optimize away re-reading this variable and changes done it by the other core might be ignored. Declaring it as

static volatile bool start_done = false;

solves that problem.

Other than that, the sketch happily runs for me with the latest core version without an observable crash and at optimization level -O0 and -Og.

Can you share the piece of code that still hangs in PlatformIO?

savejeff commented 2 years ago

I've not yet had time to search for the problem. I think I'll post a new issue if I'm able no narrow it down.

The task is for a Battery Management System. To this point, I did all the work on the first core alone. The plan is to move all non-critical tasks to the second core. That means i want to use the I2C bus on the second core while the first core only uses the SPI Bus to talk to peripheral devices. I have confirmed that the code is running stable (I've run it continuously for over 24h with no problem) but when I move the I2c code to the second core it starts to freeze up after around 5 to 10 minutes. The weird thing is, I switched to a (proven and well-tested) software I2C implementation that also freezes up.

I have two ideas about what this problem might be:

My guess would be the second option and i want to make sure its not a stupid mistake on my side before posting it. I've not yet tried to attach the debugger. Maybe ill get to it today

savejeff commented 2 years ago

I've done some debugging today. Getting the Picoprobe running with PlatformIO was a breeze.

I recorded a dozen crashes. I can't really figure out what is happening but its something with printf (especially print float) I've attached some screenshots i took when one core crashed. Maybe somebody with more experience understands what is happening here

The other core is mostly currently just sleeping and it happens on both cores. I thought about maybe its a stack overflow but increasing heap and stack to 0x1000 did not help

Crash_3 Crash_2 Crash_1 Crash_5 Crash_4

maxgerhardt commented 2 years ago

Very weird things. It's definitely calling the __wrap_realloc function that was added to multithread-proof its users (e.g. the String class), but still crashes. The _printf_float does a _Balloc (which per this is a big integer alloc), which should use _calloc_r internally, which should (?) call the wrapped calloc though. Maybe that doesn't hold for the _r reentrant versions?

The heap, or some critical RAM content, seems corrupted to me. The second screenshot crashes while loading 32-bits from the address pointed to by r0, which is 0x33373000, which doesn't seem like a RAM address at all (0x20000000 to 0x20042000), but the ASCII string "370" is you read it from left to right.. Hm.

Can you please check that at least one of the crashes still appears with the latest version? (https://github.com/earlephilhower/arduino-pico/issues/614#issuecomment-1151603840 to update)

I'll try and reproduce this with the two cores using printf with %f to make them use _printf_float. The Software I2C might also have something to do with it though.

savejeff commented 2 years ago

The RAM addresses might be incorrect, as i don't know what core the registers are from. but i guess they are the registers from the core that crashed. is there a way to switch to the registers of the other core?

The Load from a non-RAM address might be the load of the format string that is a const char* in the flash. (only guessing here)

I can run the code with as many variations and repetitions as needed. I can switch to the hardware i2c implementation or turn off the i2c communication altogether. but it will be quite hard to reduce the code to a level where I can post it here. the surrounding peripherals are needed for the code. If worse comes to worst I can replace the devices with simulated ones that just return default values. I have a strong feeling it's not about hardware mutex. It's something about vsprintf and memory usage. but why is the other core sleeping then?

I'll continue testing stuff and try to find more evidence

Edit: I don't use volatile. I never had to use it to this point and i run a lot of parallel code on the ESP32 with almost the same code. I can run the code in -O0 to make sure its an optimization problem.

earlephilhower commented 2 years ago

A quick scan of a copy of Newlib shows Balloc calling _calloc_r which is not wrapped. That's the internal allocator and could wind up causing corruption as seen before, which only manifests itself some time later...

http://sourceware.org/git/?p=newlib-cygwin.git;a=blob;f=newlib/libc/stdlib/mprec.c;h=930c984cadbc3831c56537b3e0f4e754808ee859;hb=HEAD#l96

savejeff commented 2 years ago

can this be fixed the same way realloc was wrapped ?

earlephilhower commented 2 years ago

Just thinking about it there seems to be 2 ways to adjust this:

  1. Wrap the XXXXx_r instead of the "well known names"
  2. Adjust Newlib to call the well known, wrapped name here (and elsewhere?)

1 would actually be an issue due to pico-sdk. Need changes there and then they'd be tied to newlib and not glibc.

2 is under our control (I have patch support already) but requires a scan to see if there's any reason they're calling the internal fcn. as well as finding out where else it might need to be patched.

I think 2 is the way we'll go, but it'll be over the weekend to find out. Either way requires a toolchain rebuild (no biggie, but another 80MB download).

savejeff commented 2 years ago

image

just ran without any optimizsations -O0 and got the same crash after around the same usual time of 15-20min. i think that rules out volatile usage

earlephilhower commented 2 years ago

The key issue here wasn't the optimization, but the timing of memory operations in both cores overlapping. There are actually quite a few malloc_rs in the printf code and other bits of Newlib, so I think I'll actually try (1) but with the wrapper of XXX_r being local to this repo and in addition to the upper-level Pico-SDK locks.

IIRC, the ESP32 uses newlib but it also uses its own memory manager(UMM, like we use on the ESP8266 Arduino core), so these kinds of issues wouldn't pop up for them as the code is completely different.

savejeff commented 2 years ago

I just want to make sure its not something i messed up before you sit on this over the weekend. I tried the hardware I2C implementation -> same printf float error.

earlephilhower commented 2 years ago

Looks like the proper way is to fix my Newlib port. Right now sys/lock.h is a noop.

Wrapping calloc_r and malloc_r probably would not work as calloc may end up calling the internal malloc_r if it can't expand the block...

From mlock.c

The <<malloc>> family of routines call these functions when they need to lock
the memory pool.  The version of these routines supplied in the library use
the lock API defined in sys/lock.h.  If multiple threads of execution can
call <<malloc>>, or if <<malloc>> can be called reentrantly, then you need to
define your own versions of these functions in order to safely lock the
memory pool during a call.  If you do not, the memory pool may become
corrupted.

A call to <<malloc>> may call <<__malloc_lock>> recursively; that is,
the sequence of calls may go <<__malloc_lock>>, <<__malloc_lock>>,
<<__malloc_unlock>>, <<__malloc_unlock>>.  Any implementation of these
routines must be careful to avoid causing a thread to wait for a lock
that it already holds.
savejeff commented 2 years ago

Hm. All my printf calls go through a single function what returns a String and its called s_printf(...). If this is actually a concurrency problem i should be able to fix this with a mutex in that function. ill try that tomorrow morning

earlephilhower commented 2 years ago

@savejeff can you give #638 a try? It's not a 100% fix, but should clean up any memory allocator issues at a high level.

Also, if you can run your failing sketch w/o any add'l hardware, can you post it here? I'm trying to make my own MCVE but you seem to have a much more comprehensive way of making the core crash! :laughing:

savejeff commented 2 years ago

Okay i just updated to the latest version. I'll report back if the crashes are fixed with this. If this still failed ill try the mutex in my personal s_printf function.

about a reduced demo version, I already thought about how to do this. I'll try to figure something out but this might take some time. Posting the code here would also mean publishing a lot of the code that I developed for a product I am selling. We would need to arrange something that I can give you the code in private. I'll get back to you when I have a working version without special hardware requirements. Do you have one of these SSH1306 .96" Oled Display breakouts laying around? That would make it somewhat easier.

EDIT1: Okay the lastest version crashed after 8min. EDIT2: The Version where i added mutex to my personal s_printf func is now running for 1hour no problem. At this point the code usually crashed already. I'll peek it running but it seems right now that this dirty-fixes the problem.

maxgerhardt commented 2 years ago

EDIT1: Okay the lastest version crashed after 8min.

Are you using that special fix branch with

platform_packages =
   framework-arduinopico@https://github.com/earlephilhower/arduino-pico.git#preservatif

in the platformio.ini? This is not in the master branch, so regular updating won't get you that update.

savejeff commented 2 years ago

ah okay. no i just did a pio platform update.

I'll try this out later. The version where i added mutex myself is running without crashes so far. pretty sure we have at least proven we are looking in the right place.

savejeff commented 2 years ago

EDIT1: Okay the lastest version crashed after 8min.

Are you using that special fix branch with

platform_packages =
   framework-arduinopico@https://github.com/earlephilhower/arduino-pico.git#preservatif

in the platformio.ini? This is not in the master branch, so regular updating won't get you that update.

I'm still getting the same crashes with this branch/config

image

On the other hand, the manually added mutex in s_printf does work reliably. ran for 4h yesterday.

earlephilhower commented 2 years ago

@savejeff can you try #640

There is a new toolchain build and protection around the floating point format routine, which seems to not be thread safe for some reason. I was able to come up with a stress test which crashed almost instantly. Basically, allocate/deallocate/printf seemed fine, but as soon as I had 2 floating point formats printing in parallel I could make it crash instantly.

With #640 that is cleared up in my own test (plus it has legit locking per the Newlib spec)./

void setup() {
  Serial.begin(115200);
}

void loop() {
  static uint64_t cnt = 0;
  int sz = rand() & 127;
  void *ptr = malloc(sz);
  memset(ptr, 0, sz);
  ptr = realloc(ptr, sz + (rand() & 127));
  snprintf((char *)ptr, sz, "%0.12f", cnt / 1.2e12);
  free(ptr);
  cnt++;
  if (! (cnt % 1000)) Serial.printf("Core 0: %d\n", cnt);
}

void setup1() {
}

void loop1() {
  static uint64_t cnt = 0;
  int sz = rand() & 127;
  void *ptr = malloc(sz);
  memset(ptr, 1, sz);
  ptr = realloc(ptr, sz + (rand() & 127));
  snprintf((char *)ptr, sz, "%0.12f", cnt / 1.2e12);
  free(ptr);
  cnt++;
  if (! (cnt % 1000)) Serial.printf("Core 1: %ld\n", cnt);
}
earlephilhower commented 2 years ago

@savejeff I meant, try #640 . No idea how that typo'd number got there!

Note that you will need the new toolchain installed, too. Not sure what's needed for Platform.io to make that happen. Maybe @maxgerhardt can suggest something?

earlephilhower commented 2 years ago

And with some thinking about it I see the issue. Locks were 1 part, but there is a global __impure_ptr which is used for some inter-call state. Which printf_float uses. In a 1-core multithread setup, you'd set the impure_ptr to a thread-local copy. For multicore I need to see if Newlib has dynamic reentry capabilities

earlephilhower commented 2 years ago

Please wait to try #640. I have implemented the real dynamic reent structs (needs toolchain which is undergoing a full build now). With the change no need to artificially mutex lock functions, and there should be significantly less multicore issues (i.e. they'd all be in my code, not in newlib, at least!).

But the full toolchain rebuild won't finish for a while and it's 4am local time for me, so the final good push will have to wait on :sleeping: !

savejeff commented 2 years ago

There is a new toolchain build and protection around the floating point format routine, which seems to not be thread safe for some reason. I was able to come up with a stress test which crashed almost instantly. Basically, allocate/deallocate/printf seemed fine, but as soon as I had 2 floating-point formats printing in parallel I could make it crash instantly.

ah yes, i was also thinking about a stress test like that. I'm basically doing the same thing in my Code but with less frequent prints.

I'll wait for the new toolchain build and till then I'll test my current implementation with mutex in my s_printf func. Really looking forward to a reliable dual-core version of my software. The second core allowed me to improve timings for safety checks drastically. I think in general the community will greatly benefit from an actually usable RP2040 in dual-core mode.

Just post here when the build is done and good rest till then ;] I might need help from @maxgerhardt how i can get the latest commits and changes into PIO. I currently only know of the pio platform update command and i am not up to speed with how pio puts the platform and platform_package stuff together. A short elaboration on what these configs do in the background would be helpful.

maxgerhardt commented 2 years ago

The PlatformIO integration part will be easy, once the new toolchain build has appeared on https://github.com/earlephilhower/pico-quick-toolchain/releases and since the fix branch is now called lockmeister, you'll just have add instructions for PlatformIO to source the toolchain package (toolchain-rp2040-earlephilhower) and framework package (framework-arduinopico) from a download link / a branch, like

platform_packages =
   framework-arduinopico@https://github.com/earlephilhower/arduino-pico.git#lockmeister
   toolchain-rp2040-earlephilhower@https://github.com/earlephilhower/pico-quick-toolchain/releases/download/1.4.0-a/x86_64-w64-mingw32.arm-none-eabi-f38fa5a.220618.zip

I'll update and test myself once the toolchain package is out.

earlephilhower commented 2 years ago

The toolchain was just uploaded and the lockmeister branch #640 now has the proper links. 1.4.0-b.

I have not published the toolchain to Platfor.IO because it seems like they manually curate things and spamming them with releases would be unwelcome. It is published on GH (since it's 0-effort and automated).

When you guys report back that things are looking good on your end I'll do the publishing of the branch and merce/publishing of a new release.

earlephilhower commented 2 years ago

Latest push also fixes FreeRTOS, properly. The torture test has been running for 20 mins now under it. I think we are good to go...

savejeff commented 2 years ago

I just started reliability tests on my testbench. So far no crashes.

using this config

platform = https://github.com/maxgerhardt/platform-raspberrypi.git
platform_packages =
   framework-arduinopico@https://github.com/earlephilhower/arduino-pico.git#lockmeister
   toolchain-rp2040-earlephilhower@https://github.com/earlephilhower/pico-quick-toolchain/releases/download/1.4.0-b/x86_64-w64-mingw32.arm-none-eabi-cb31b54.220619.zip

If there is no crash in the next hour its a green light from me.

Code ran for an hour without any hiccups. ✅ I'll continue running tests but I have a good feeling about this.

savejeff commented 2 years ago

Very nice. Thx for the thourough investigation!

At what point can i use this in PIO with the config

platform = https://github.com/maxgerhardt/platform-raspberrypi.git
board = pico
framework = arduino
board_build.core = earlephilhower

As far a i understand there needs to be a separate release for it to be updated to the latest arduino-pico code.

earlephilhower commented 2 years ago

Once #645 is in I'm going to do a release which should also do a PIO publish. It takes a couple days to get that out, and then watch https://github.com/platformio/platform-raspberrypi/pull/36 which is the actual inclusion of the core option.

savejeff commented 2 years ago

Ok thx! I have no deadline, so no hurry. I just wanted to know when i can switch fully from the mbed core.