Freeze/crash in About menu since 9f15fd1

nlfx commented 3 years ago

I've had frequent crashes since yesterday merges. Here is how to reliably reproduce the issue:

Open and close About menu twice
Open About menu a third time
Swipe up to try to access screen 2
Watch freezes and is reset by watchdog a few seconds later

https://user-images.githubusercontent.com/69880003/116891847-b9bb7a80-ac1e-11eb-9d82-254a69eaa988.mp4

I've bisected the problem to commit 9f15fd1820e620e42fb717fb5d692d44a3f0f70d. However given its content, I don't understand how it could create this crash. Maybe it's a problem with some completely unrelated part of the code which only becomes visible after this commit due to some timing or memory position constraint?

Could someone with a dev kit and SWD access please see if they manage to get more info about what is creating the above crash?

In case this bug is memory location dependent and strings with file path in the firmware change the behavior, here is my version of the above commit.

Thanks a lot for your help!

ObiKeahloa commented 3 years ago

What exactly have you changed in the commit? My first response too was "memory allocation maybe?" but there seems to be enough memory for doing a simple vibration. It might be something else triggering this maybe the driver?

nlfx commented 3 years ago

What exactly have you changed in the commit?

No change, I just did a git checkout --recurse-submodules 9f15fd1820e620e42fb717fb5d692d44a3f0f70d, compiled it and flashed it.

My first response too was "memory allocation maybe?" but there seems to be enough memory for doing a simple vibration.

What is weird, is that the the crash happens on the About menu (SystemInfo.cpp) which has nothing to do with the motor controller change in this commit.

So indeed, I also initially though that the problem might be that the memory for the stack was at its limit and the additional motorController argument passed down the call stack was enough to start overwriting other memory region. But the fact that the crash happen only when accessing the About menu for the 3rd time would make no sense if that was the real problem. Unless I'm missing something?

So then I though that it might be a memory leak or memory corruption in the LVGL, and that's why opening the same menu multiple times resulted in a crash. But then it makes no sense that I cannot reproduce this crash on the previous commit (d141888b828d9a47bbcdee86c3c22f1a6d415c56) which contains exactly the same LVGL version and same menus.

So without SWD access, I really don't think I can solve this issue. If no one has time to test this, I'll see if I can re-order a dev kit when they'll be back in stock.

Avamander commented 3 years ago

@JF002

But the fact that the crash happen only when accessing the About menu for the 3rd time would make no sense if that was the real problem.

Potential memory leak?

On another note: Something like Valgrind but for remotely debugged MCUs would be so nice. If anyone knows tools like that that work over a gdb remote, please let me know.

ObiKeahloa commented 3 years ago

Is the vibration motor causing some kind of limit? Not really sure as it happening only around the third time as @nlfx told is weird.

Is it maybe a glitch in the Watchdog?

Avamander commented 3 years ago

Vibration motor could cause a brownout event when the battery has a very high internal resistance... but someone with that specific watch would have to actually measure and test out this theory. It's really just predicting on tea leaves when the issue isn't reproducible.

nlfx commented 3 years ago

Is the vibration motor causing some kind of limit?

The crash happens without ever activating the motor. As explained / shown in the above bug description, just opening / closing the About menu 3 time and trying to go to its 2nd screen is enough to crash the watch.

It's really just predicting on tea leaves when the issue isn't reproducible.

Is it not? Did anyone try to reproduce the steps I've described in my bug report with commit 9f15fd1820e620e42fb717fb5d692d44a3f0f70d or any commit more recent? On my watch it reliably crashes 100% of the time if I follow the steps I've described.

ObiKeahloa commented 3 years ago

Is the vibration motor causing some kind of limit?

The crash happens without ever activating the motor. As explained / shown in the above bug description, just opening / closing the About menu 3 time and trying to go to its 2nd screen is enough to crash the watch.

It's really just predicting on tea leaves when the issue isn't reproducible.

Is it not? Did anyone try to reproduce the steps I've described in my bug report with commit 9f15fd1 or any commit more recent? On my watch it reliably crashes 100% of the time if I follow the steps I've described.

Oh thanks for explaining , this seems really weird and the fact that it only happens after three tries (On the third one) , can you try this again but try to access the second menu on the fourth try instead of the third?

nlfx commented 3 years ago

Oh thanks for explaining , this seems really weird and the fact that it only happens after three tries (On the third one) , can you try this again but try to access the second menu on the fourth try instead of the third?

It also happens on the 4th, 5th, etc. If the Screen1 of the About menu has been opened 3 times or more, the watch crashes when trying to access the Screen2.

ObiKeahloa commented 3 years ago

Oh thanks for explaining , this seems really weird and the fact that it only happens after three tries (On the third one) , can you try this again but try to access the second menu on the fourth try instead of the third?

It also happens on the 4th, 5th, etc. If the Screen1 of the About menu has been opened 3 times or more, the watch crashes when trying to access the Screen2.

What intrigued me was that it only happens at the third time.

ObiKeahloa commented 3 years ago

I am not sure if this will fix the error or not , can you try replacing the "components/motor/MotorController.cpp" file with this one(Modified it a bit , not sure if it will fix it or not):

MotorController.cpp.txt

Do rename it back to MotorController.cpp from MotorController.cpp.txt

JF002 commented 3 years ago

The behavior you're describing makes me think of a memory fragmentation issue : after a few executions, the memory (from lvgl, freertos,...) is too fragmented and it cannot find a memory segment big enough for its needs.

However, I've just tried to reproduce the issue on the latest commit on develop with no success for now. It doesn't mean that the issue was auto-magically fixed, but it probably "moved" elsewhere in the code and execution paths :/

EDIT : it happened to me too on this last commit : just need to open/close the app multiple time, and then to scroll down to display the next page, it crashes.

JF002 commented 3 years ago

I can easily reproduce this issue : open/close SystemInfo 2-3 times, then, scroll to the 2nd screen of displayInfo aaand... crash!

I spent far too much hours trying to figure out the cause of this crash, with no luck :/ It looks like the stack of the Display task gets corrupted when it crashes and falls into HardFault_handler:

Here's the stacktrace when the screen is displayed correctly:

I've tried many things : analyze heap, stack and freertos heap/stack usage, disable all the code inside CreateScreen1() and CreateScreen2(), logging, breakpoints,.. but I cannot understand what's happening!

ObiKeahloa commented 3 years ago

This sounds really weird , what happens of we were to start removing info from the screen?

JF002 commented 3 years ago

I've tried to remove the content of the info screen completely, it would only display a black screen... and it still crashed!

And I have no clue why its hardfaulting that way. Next experiments: run the whole Screen/ScreenList/Settings/SystemInfo classes on a computer to check with memcheck that everything is ok, check IRQ priority in FreeRTOS, disable some code until the problem disappears,...

ObiKeahloa commented 3 years ago

I've tried to remove the content of the info screen completely, it would only display a black screen... and it still crashed!

And I have no clue why its hardfaulting that way. Next experiments: run the whole Screen/ScreenList/Settings/SystemInfo classes on a computer to check with memcheck that everything is ok, check IRQ priority in FreeRTOS, disable some code until the problem disappears,...

Well , Good Luck👍

JF002 commented 3 years ago

This makes no sense:

checkout and build develop (79f0fcb07aa80eb70385223272e29f2ba5657bc8) -> No error
then checkout 9f15fd1 (the commit referenced in this issue) -> No error
then clean the build folder and try again -> crash

This bug seems to be "moving" on each commit : it's present in 9f15fd1 and ff00873f974cb1400acf7721251167634e60107a (branch move-heap-to-static) but not on current develop (79f0fcb07aa80eb70385223272e29f2ba5657bc8) and this random commit in-between : 13e3463276114dff838fc8fe281754eecfbe9538.

I've identified 2 consecutive commits:

crash : 8c3b250dbfe17be61462adc8f84760ce9a648e55
no crash : 13e3463276114dff838fc8fe281754eecfbe9538

All I can say for now is that this issue is not caused by compiler optimizations (-O3) because it also happens in -Og.

Soooo... This bug is probably present for some time, but it probably corrupts memory that is sometimes critical, sometimes not.

I still have no idea how to debug this... any help is welcome!

EDIT: even more weird:

on ff00873f974cb1400acf7721251167634e60107a, add a 50B buffer in Screen -> No crash!


class Screen {
public:
explicit Screen(DisplayApp* app) : app {app} {
std::fill(bumper.begin(), bumper.end(), 0xaa);
}

//... private: std::array<uint8_t, 50> bumper;


PLEAAAASE help! Don't make me commit this! :D

jonvmey commented 3 years ago

Are any bits set in any of the fault status registers when the problem occurs?

Raupinger commented 3 years ago

@jonvmey CFSR is 0x20000 (INVSTATE) when HardFault_Handler is called HFSR is 0x40000000 (FORCED)

JF002 commented 3 years ago

Logs from NRF SDK:

<error> hardfault: HARD FAULT at 0x00010E2E
<error> hardfault:   R0:  0x20010000  R1:  0x40011008  R2:  0xE000EF30  R3:  0x00000010
<error> hardfault:   R12: 0x20001F74  LR:  0x0001113D  PSR: 0x61000000
<error> hardfault: Cause: The processor has attempted to execute an instruction that makes illegal use of the EPSR.

What does it mean? I don't know!

Itai-Nelken commented 3 years ago

I'm pretty sure someone already found this, but does this help (see the answer): https://stackoverflow.com/questions/53253652/debugging-a-hard-fault-in-arm-cortex-m4

JF002 commented 3 years ago

More info from : https://interrupt.memfault.com/blog/cortex-m-fault-debug

(gdb) p/a *(uint32_t[8] *)$msp
$2 = {0x20010000, 0x40011008, 0xe000ef30, 0x10 <_lv_gc_clear_roots+16>, 0x20001f74 <systemTask+1788>, 0x1113d <xPortStartScheduler+156>, 0x10e2e <vPortStartFirstTask+22>, 0x61000000}

(gdb) p/a *(uint32_t[8] *)$psp
$3 = {0x200080e8 <ucHeap+14016>, 0x2000e4ac, 0x2000db48, 0x2600 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+408>, 0x0 <_lv_gc_clear_roots>, 0xec7 <std::function<std::unique_ptr<Pinetime::Applications::Screens::Screen, std::default_delete<Pinetime::Applications::Screens::Screen> > ()>::operator()() const+12>, 0x2600 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+408>, 0x0 <_lv_gc_clear_roots>}

So, if we look at MSP (Main Stack Pointer):

R0 = 0x20010000, R1 = 0x40011008, R2 = 0xe000ef30, R3 = 0x10 <_lv_gc_clear_roots+16>, R12 = 0x20001f74 <systemTask+1788>, LR = 0x1113d <xPortStartScheduler+156>, PC = 0x10e2e <vPortStartFirstTask+22>, xPSR = 0x61000000

-> This is the main stack, where the scheduler of FreeRTOS was called. Disassembly:

vPortStartFirstTask():
   10e18:   4805        ldr r0, [pc, #20]   ; (10e30 <vPortStartFirstTask+0x18>)
   10e1a:   6800        ldr r0, [r0, #0]
   10e1c:   f380 8808   msr MSP, r0
   10e20:   b662        cpsie   i
   10e22:   b661        cpsie   f
   10e24:   f3bf 8f4f   dsb sy
   10e28:   f3bf 8f6f   isb sy
   10e2c:   df00        svc 0
   10e2e:   bf00        nop
   10e30:   00000000    .word   0x00000000

And the coresponding instruction:

   10e2e:   bf00        nop

PSP (Process Stack Pointer) :

R0 = 0x200080e8 <ucHeap+14016> R1 = 0x2000e4ac, R2 = 0x2000db48, R3 = 0x2600 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+408> R12 = 0x0 <_lv_gc_clear_roots>, LR = 0xec7 <std::function<std::unique_ptr<Pinetime::Applications::Screens::Screen, std::default_delete > ()>::operator()() const+12>, PC = 0x2600 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+408>, xPSR = 0x0 <_lv_gc_clear_roots>

According to the map file, 0x2600 is an instruction from Pinetime::Applications::Screens::SystemInfo::CreateScreen2() which is the method called to create the 2nd screen of system info, exactly what I was trying to do at that time. Here is the dissassembly:

Pinetime::Applications::Screens::SystemInfo::CreateScreen2():
    2468:   e92d 4ff0   stmdb   sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
    246c:   ed2d 8b02   vpush   {d8}
    2470:   b099        sub sp, #100    ; 0x64
    2472:   4681        mov r9, r0
    2474:   460c        mov r4, r1
    2476:   690d        ldr r5, [r1, #16]
_ZNKSt5arrayIiLj5EE4dataEv():
    2478:   1d2b        adds    r3, r5, #4
_ZNKSt5arrayIiLj5EE3endEv():
    247a:   f105 0018   add.w   r0, r5, #24
    247e:   2200        movs    r2, #0
_ZSt10accumulateIPKiiET0_T_S3_S2_():
    2480:   4298        cmp r0, r3
    2482:   d003        beq.n   248c <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x24>
    2484:   f853 1b04   ldr.w   r1, [r3], #4
    2488:   440a        add r2, r1
    248a:   e7f9        b.n 2480 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x18>
_ZNK8Pinetime11Controllers10CircBufferILi5EE10GetAverageEv():
    248c:   7e2b        ldrb    r3, [r5, #24]
    248e:   fb92 f3f3   sdiv    r3, r2, r3
    2492:   9314        str r3, [sp, #80]   ; 0x50
_ZNK8Pinetime11Controllers7Battery7VoltageEv():
    2494:   edd5 7a07   vldr    s15, [r5, #28]
Pinetime::Applications::Screens::SystemInfo::CreateScreen2():
    2498:   69e3        ldr r3, [r4, #28]
_ZNK8Pinetime7Drivers12WatchdogView11ResetReasonEv():
    249a:   681b        ldr r3, [r3, #0]
_ZNK8Pinetime7Drivers8Watchdog11ResetReasonEv():
    249c:   681b        ldr r3, [r3, #0]
    249e:   2b08        cmp r3, #8
    24a0:   f200 80a8   bhi.w   25f4 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x18c>
    24a4:   e8df f003   tbb [pc, r3]
    24a8:   979405a3    .word   0x979405a3
    24ac:   91a09d9a    .word   0x91a09d9a
    24b0:   a9              .byte   0xa9
    24b1:   00              .byte   0x00
    24b2:   4b53        ldr r3, [pc, #332]  ; (2600 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x198>)
    24b4:   9315        str r3, [sp, #84]   ; 0x54
Pinetime::Applications::Screens::SystemInfo::CreateScreen2():
    24b6:   68e3        ldr r3, [r4, #12]
_ZNK8Pinetime11Controllers8DateTime6UptimeEv():
    24b8:   699a        ldr r2, [r3, #24]
Pinetime::Applications::Screens::SystemInfo::CreateScreen2():
    24ba:   4e52        ldr r6, [pc, #328]  ; (2604 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x19c>)
    24bc:   fba6 3602   umull   r3, r6, r6, r2
    24c0:   0c36        lsrs    r6, r6, #16
    24c2:   4b51        ldr r3, [pc, #324]  ; (2608 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1a0>)
    24c4:   fb03 2216   mls r2, r3, r6, r2
    24c8:   4f50        ldr r7, [pc, #320]  ; (260c <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1a4>)
    24ca:   fba7 3702   umull   r3, r7, r7, r2
    24ce:   0aff        lsrs    r7, r7, #11
    24d0:   f44f 6361   mov.w   r3, #3600   ; 0xe10
    24d4:   fb03 2317   mls r3, r3, r7, r2
    24d8:   4d4d        ldr r5, [pc, #308]  ; (2610 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1a8>)
    24da:   fba5 2503   umull   r2, r5, r5, r3
    24de:   096d        lsrs    r5, r5, #5
    24e0:   ebc5 1205   rsb r2, r5, r5, lsl #4
    24e4:   eba3 0382   sub.w   r3, r3, r2, lsl #2
    24e8:   9311        str r3, [sp, #68]   ; 0x44
    24ea:   eebc 7ae7   vcvt.u32.f32    s14, s15
    24ee:   ed8d 7a0e   vstr    s14, [sp, #56]  ; 0x38
    24f2:   f89d 2038   ldrb.w  r2, [sp, #56]   ; 0x38
    24f6:   ee08 2a10   vmov    s16, r2
    24fa:   eeb8 7ac8   vcvt.f32.s32    s14, s16
    24fe:   ee77 7ac7   vsub.f32    s15, s15, s14
    2502:   ed9f 7a44   vldr    s14, [pc, #272] ; 2614 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1ac>
    2506:   ee67 7a87   vmul.f32    s15, s15, s14
    250a:   eefc 7ae7   vcvt.u32.f32    s15, s15
    250e:   edcd 7a0e   vstr    s15, [sp, #56]  ; 0x38
    2512:   f89d 2038   ldrb.w  r2, [sp, #56]   ; 0x38
    2516:   920e        str r2, [sp, #56]   ; 0x38
    2518:   f037 f8b6   bl  39688 <lv_disp_get_default>
lv_scr_act():
    251c:   f02c fb0a   bl  2eb34 <lv_disp_get_scr_act>
Pinetime::Applications::Screens::SystemInfo::CreateScreen2():
    2520:   2100        movs    r1, #0
    2522:   f040 f95b   bl  427dc <lv_label_create>
    2526:   4680        mov r8, r0
    2528:   2101        movs    r1, #1
    252a:   f040 f945   bl  427b8 <lv_label_set_recolor>
    252e:   68e3        ldr r3, [r4, #12]
_ZNK8Pinetime11Controllers8DateTime3DayEv():
    2530:   78da        ldrb    r2, [r3, #3]
    2532:   9213        str r2, [sp, #76]   ; 0x4c
_ZNK8Pinetime11Controllers8DateTime5MonthEv():
    2534:   7899        ldrb    r1, [r3, #2]
    2536:   9112        str r1, [sp, #72]   ; 0x48
_ZNK8Pinetime11Controllers8DateTime4YearEv():
    2538:   8818        ldrh    r0, [r3, #0]
    253a:   900f        str r0, [sp, #60]   ; 0x3c
_ZNK8Pinetime11Controllers8DateTime5HoursEv():
    253c:   f893 c005   ldrb.w  ip, [r3, #5]
    2540:   f8cd c040   str.w   ip, [sp, #64]   ; 0x40
_ZNK8Pinetime11Controllers8DateTime7MinutesEv():
    2544:   f893 a006   ldrb.w  sl, [r3, #6]
_ZNK8Pinetime11Controllers8DateTime7SecondsEv():
    2548:   f893 b007   ldrb.w  fp, [r3, #7]
Pinetime::Applications::Screens::SystemInfo::CreateScreen2():
    254c:   6960        ldr r0, [r4, #20]
    254e:   f00b fb01   bl  db54 <Pinetime::Controllers::BrightnessController::ToString()>
    2552:   9b15        ldr r3, [sp, #84]   ; 0x54
    2554:   930c        str r3, [sp, #48]   ; 0x30
    2556:   900b        str r0, [sp, #44]   ; 0x2c
    2558:   980e        ldr r0, [sp, #56]   ; 0x38
    255a:   900a        str r0, [sp, #40]   ; 0x28
    255c:   ed8d 8a09   vstr    s16, [sp, #36]  ; 0x24
    2560:   f89d 3050   ldrb.w  r3, [sp, #80]   ; 0x50
    2564:   9308        str r3, [sp, #32]
    2566:   9911        ldr r1, [sp, #68]   ; 0x44
    2568:   9107        str r1, [sp, #28]
    256a:   9506        str r5, [sp, #24]
    256c:   9705        str r7, [sp, #20]
    256e:   9604        str r6, [sp, #16]
    2570:   f8cd b00c   str.w   fp, [sp, #12]
    2574:   f8cd a008   str.w   sl, [sp, #8]
    2578:   9d10        ldr r5, [sp, #64]   ; 0x40
    257a:   9501        str r5, [sp, #4]
    257c:   980f        ldr r0, [sp, #60]   ; 0x3c
    257e:   9000        str r0, [sp, #0]
    2580:   9b12        ldr r3, [sp, #72]   ; 0x48
    2582:   9a13        ldr r2, [sp, #76]   ; 0x4c
    2584:   4924        ldr r1, [pc, #144]  ; (2618 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1b0>)
    2586:   4640        mov r0, r8
    2588:   f040 f88a   bl  426a0 <lv_label_set_text_fmt>
    258c:   f037 f87c   bl  39688 <lv_disp_get_default>
lv_scr_act():
    2590:   f02c fad0   bl  2eb34 <lv_disp_get_scr_act>
    2594:   4601        mov r1, r0
Pinetime::Applications::Screens::SystemInfo::CreateScreen2():
    2596:   2200        movs    r2, #0
    2598:   9200        str r2, [sp, #0]
    259a:   4613        mov r3, r2
    259c:   4640        mov r0, r8
    259e:   f02d f86d   bl  2f67c <lv_obj_align>
_ZSt11make_uniqueIN8Pinetime12Applications7Screens5LabelEJiiRPNS1_10DisplayAppERP9_lv_obj_tEENSt9_MakeUniqIT_E15__single_objectEDpOT0_():
    25a2:   6865        ldr r5, [r4, #4]
    25a4:   2028        movs    r0, #40 ; 0x28
    25a6:   f044 fd39   bl  4701c <operator new(unsigned int)>
    25aa:   4604        mov r4, r0
    25ac:   f8cd 8000   str.w   r8, [sp]
    25b0:   462b        mov r3, r5
    25b2:   2204        movs    r2, #4
    25b4:   2101        movs    r1, #1
    25b6:   f000 ffb9   bl  352c <Pinetime::Applications::Screens::Label::Label(unsigned char, unsigned char, Pinetime::Applications::DisplayApp*, _lv_obj_t*)>
_ZNSt10_Head_baseILj0EPN8Pinetime12Applications7Screens6ScreenELb0EEC4IRS4_EEOT_():
    25ba:   f8c9 4000   str.w   r4, [r9]
Pinetime::Applications::Screens::SystemInfo::CreateScreen2():
    25be:   4648        mov r0, r9
    25c0:   b019        add sp, #100    ; 0x64
    25c2:   ecbd 8b02   vpop    {d8}
    25c6:   e8bd 8ff0   ldmia.w sp!, {r4, r5, r6, r7, r8, r9, sl, fp, pc}
operator()():
    25ca:   4b14        ldr r3, [pc, #80]   ; (261c <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1b4>)
    25cc:   9315        str r3, [sp, #84]   ; 0x54
    25ce:   e772        b.n 24b6 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x4e>
    25d0:   4b13        ldr r3, [pc, #76]   ; (2620 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1b8>)
    25d2:   9315        str r3, [sp, #84]   ; 0x54
    25d4:   e76f        b.n 24b6 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x4e>
    25d6:   4b13        ldr r3, [pc, #76]   ; (2624 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1bc>)
    25d8:   9315        str r3, [sp, #84]   ; 0x54
    25da:   e76c        b.n 24b6 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x4e>
    25dc:   4b12        ldr r3, [pc, #72]   ; (2628 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1c0>)
    25de:   9315        str r3, [sp, #84]   ; 0x54
    25e0:   e769        b.n 24b6 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x4e>
    25e2:   4b12        ldr r3, [pc, #72]   ; (262c <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1c4>)
    25e4:   9315        str r3, [sp, #84]   ; 0x54
    25e6:   e766        b.n 24b6 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x4e>
    25e8:   4b11        ldr r3, [pc, #68]   ; (2630 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1c8>)
    25ea:   9315        str r3, [sp, #84]   ; 0x54
    25ec:   e763        b.n 24b6 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x4e>
    25ee:   4b11        ldr r3, [pc, #68]   ; (2634 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1cc>)
    25f0:   9315        str r3, [sp, #84]   ; 0x54
    25f2:   e760        b.n 24b6 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x4e>
    25f4:   4b10        ldr r3, [pc, #64]   ; (2638 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1d0>)
    25f6:   9315        str r3, [sp, #84]   ; 0x54
    25f8:   e75d        b.n 24b6 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x4e>
    25fa:   4b10        ldr r3, [pc, #64]   ; (263c <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1d4>)
    25fc:   9315        str r3, [sp, #84]   ; 0x54
    25fe:   e75a        b.n 24b6 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x4e>
    2600:   0004c778    .word   0x0004c778
    2604:   c22e4507    .word   0xc22e4507
    2608:   00015180    .word   0x00015180
    260c:   91a2b3c5    .word   0x91a2b3c5
    2610:   88888889    .word   0x88888889
    2614:   42c80000    .word   0x42c80000
    2618:   0004c780    .word   0x0004c780
    261c:   0004c750    .word   0x0004c750
    2620:   0004c754    .word   0x0004c754
    2624:   0004c75c    .word   0x0004c75c
    2628:   0004c764    .word   0x0004c764
    262c:   0004c768    .word   0x0004c768
    2630:   0004c770    .word   0x0004c770
    2634:   0004c774    .word   0x0004c774
    2638:   0004c744    .word   0x0004c744
    263c:   0004c748    .word   0x0004c748

And the corresponding instruction :

    2600:   0004c778    .word   0x0004c778

What is it?

In my opinion, it makes more sense to analyze the PSP, as the display task is running at that time and PC correspond exactly to what the firmware was supposed to do at that time. Now, why does this instruction at 0x2600 crashes the CPU?

jonvmey commented 3 years ago

0x2600 doesn't contain an instruction it contains data, so it makes sense that executing it would cause a fault. The real question is how is 0x2600 getting into the PC?

JF002 commented 3 years ago

The real question is how is 0x2600 getting into the PC? The answer might be 42 ;-D

Ok, I've dug into freertos, irq priorities, task priorities,... everything looks good to me.

And then, I looked at the code... for the 1000th time... and here's what I found:

DisplayApp::currentScreen points to the current screen, allocated on the heap.
When Settings is running, DisplayApp::currentScreen points to that instance of Settings
Settings has a list of List as member variable
When you tap on an entry from this list, DisplayApp::StartApp() is called
This method destroys DisplayApp::currentScreen to create a new app (the one you selected in the menu)
Wait... let's have a look at the callstack:
- DisplayApp::lv_taskl_handler
- ...
- List::OnButtonEvent()
- DisplayApp::StartApp()
- DisplayAp::LoadApp()
- currentScreen.reset(nullptr);
When currentScreen is reset, the dtor() of the object (Settings, in this case) is called. The dtor() of Settings will destroy all member variables of Settings, List included! It means that the code is destroying the object that called this method... What happens to the stack in this case? Probably nothing good.

You know what's scary? ApplicationList and Settings work this way since 1.0!

jonvmey commented 3 years ago

Oh... Yeah, that's bad. Definitely use-after-free issues in some (all?) of the "nested" application constructs.

Would it make sense for DisplayApp to have something like a stack of applications for situations like this? Then entering a nested application from another would push the new app onto the stack and make it the currentScreen and exiting a nested application would pop it off of the stack and restore the parent (now the top of the stack) to the currentScreen.

JF002 commented 3 years ago

Yes, I think so. This pattern is used by ApplicationList and Settings to load an app when the entry is selected.

We have a design issue here : we have a unique_ptr AND a raw ptr pointing to the app:

DisplayApp maintains the current app in currentScreen (unique_ptr). Ideally, that unique_ptr should be the only ptr to the app.
The ptr to this that is stored in user_data of the button when creating the app. In fact, the app is leaking a ptr to itself in the constructor

In this case, we are calling the app using the raw pointer, it calls DisplayApp which destroys the app using the unique_ptr... It's "the snake that bites its tail"...

JF002 commented 3 years ago

Would it make sense for DisplayApp to have something like a stack of applications for situations like this?

That's an idea. I tried to avoid nested app to keep the memory usage under control and to keep the code simple, but we can think about that.

JF002 commented 3 years ago

Should be fixed in https://github.com/JF002/InfiniTime/pull/415

JF002 commented 3 years ago

For those interested, here is a simple project that reproduces the issue that was fixed:

A.h:

#pragma once
#include <memory>
#include "B.h"

class A {
  public:
    A() {
      ptr = std::make_unique<B>(this);
    }

    void Test() {
      std::cout << "A::Test" << std::endl;
      ptr->Test();
    }

    void Delete() {
      std::cout << "A::Delete()" << std::endl;
      ptr.reset(nullptr);
    }

  private:
    std::unique_ptr<B> ptr;
};

B.h:

#pragma once

class A;
class B {
  public:
    B(A* a);

    ~B();

    void Test();

  private:
    A* ptr;
    int tt = 0;
};

B.cpp:

#include <iostream>
#include "B.h"
#include "A.h"
B::B(A* a) : ptr{a} {

}

void B::Test() {
  std::cout << "B::Test" << std::endl;
  ptr->Delete();

  std::cout << "Is B alive?" << std::endl;
  tt++;
}

B::~B() {
  std::cout << "B::~B() " << std::to_string(tt) << std::endl;
}

main.cpp:

#include <iostream>
#include "A.h"

int main() {
  std::cout << "Hello, World!" << std::endl;

  A a;
  a.Test();
  return 0;
}

Valgrind is not happy with that code:

$ valgrind ./untitled1 
==24823== Memcheck, a memory error detector
==24823== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==24823== Using Valgrind-3.17.0 and LibVEX; rerun with -h for copyright info
==24823== Command: ./untitled1
==24823== 
Hello, World!
A::Test
B::Test
A::Delete()
B::~B() 0
Is B alive?
==24823== Invalid read of size 4
==24823==    at 0x109A25: B::Test() (B.cpp:13)
==24823==    by 0x10943D: A::Test() (A.h:13)
==24823==    by 0x109260: main (main.cpp:8)
==24823==  Address 0x4d9c0c8 is 8 bytes inside a block of size 16 free'd
==24823==    at 0x48429AB: operator delete(void*, unsigned long) (vg_replace_malloc.c:814)
==24823==    by 0x1096B4: std::default_delete<B>::operator()(B*) const (unique_ptr.h:85)
==24823==    by 0x109770: std::__uniq_ptr_impl<B, std::default_delete<B> >::reset(B*) (unique_ptr.h:182)
==24823==    by 0x109E10: std::unique_ptr<B, std::default_delete<B> >::reset(B*) (unique_ptr.h:456)
==24823==    by 0x109CF6: A::Delete() (A.h:18)
==24823==    by 0x1099F8: B::Test() (B.cpp:10)
==24823==    by 0x10943D: A::Test() (A.h:13)
==24823==    by 0x109260: main (main.cpp:8)
==24823==  Block was alloc'd at
==24823==    at 0x483FF3F: operator new(unsigned long) (vg_replace_malloc.c:417)
==24823==    by 0x109565: std::_MakeUniq<B>::__single_object std::make_unique<B, A*>(A*&&) (unique_ptr.h:962)
==24823==    by 0x10939E: A::A() (A.h:8)
==24823==    by 0x109254: main (main.cpp:7)
==24823== 
==24823== Invalid write of size 4
==24823==    at 0x109A2F: B::Test() (B.cpp:13)
==24823==    by 0x10943D: A::Test() (A.h:13)
==24823==    by 0x109260: main (main.cpp:8)
==24823==  Address 0x4d9c0c8 is 8 bytes inside a block of size 16 free'd
==24823==    at 0x48429AB: operator delete(void*, unsigned long) (vg_replace_malloc.c:814)
==24823==    by 0x1096B4: std::default_delete<B>::operator()(B*) const (unique_ptr.h:85)
==24823==    by 0x109770: std::__uniq_ptr_impl<B, std::default_delete<B> >::reset(B*) (unique_ptr.h:182)
==24823==    by 0x109E10: std::unique_ptr<B, std::default_delete<B> >::reset(B*) (unique_ptr.h:456)
==24823==    by 0x109CF6: A::Delete() (A.h:18)
==24823==    by 0x1099F8: B::Test() (B.cpp:10)
==24823==    by 0x10943D: A::Test() (A.h:13)
==24823==    by 0x109260: main (main.cpp:8)
==24823==  Block was alloc'd at
==24823==    at 0x483FF3F: operator new(unsigned long) (vg_replace_malloc.c:417)
==24823==    by 0x109565: std::_MakeUniq<B>::__single_object std::make_unique<B, A*>(A*&&) (unique_ptr.h:962)
==24823==    by 0x10939E: A::A() (A.h:8)
==24823==    by 0x109254: main (main.cpp:7)
==24823== 
==24823== 
==24823== HEAP SUMMARY:
==24823==     in use at exit: 0 bytes in 0 blocks
==24823==   total heap usage: 3 allocs, 3 frees, 73,744 bytes allocated
==24823== 
==24823== All heap blocks were freed -- no leaks are possible
==24823== 
==24823== For lists of detected and suppressed errors, rerun with: -s
==24823== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)

nlfx commented 3 years ago

@JF002 Wow, that was some nasty bug, and must have been a pain to troubleshoot. Thank you so much for tracking this down, finding the root cause, and fixing it :-)

Avamander commented 3 years ago

I'll use the opportunity to reference https://liberapay.com/JF002, there's a nice way to support the project and its lead maintainer.

JF002 commented 3 years ago

@JF002 Wow, that was some nasty bug, and must have been a pain to troubleshoot. Thank you so much for tracking this down, finding the root cause, and fixing it :-)

Oooh yeah, I spent way too much time on this issue! But I'm happy I found the issue and fixed it! Thanks for your support !

InfiniTimeOrg / InfiniTime

Freeze/crash in About menu since 9f15fd1 #327