Closed nlfx closed 3 years ago
What exactly have you changed in the commit? My first response too was "memory allocation maybe?" but there seems to be enough memory for doing a simple vibration. It might be something else triggering this maybe the driver?
What exactly have you changed in the commit?
No change, I just did a git checkout --recurse-submodules 9f15fd1820e620e42fb717fb5d692d44a3f0f70d
, compiled it and flashed it.
My first response too was "memory allocation maybe?" but there seems to be enough memory for doing a simple vibration.
What is weird, is that the the crash happens on the About menu (SystemInfo.cpp) which has nothing to do with the motor controller change in this commit.
So indeed, I also initially though that the problem might be that the memory for the stack was at its limit and the additional motorController
argument passed down the call stack was enough to start overwriting other memory region. But the fact that the crash happen only when accessing the About menu for the 3rd time would make no sense if that was the real problem. Unless I'm missing something?
So then I though that it might be a memory leak or memory corruption in the LVGL, and that's why opening the same menu multiple times resulted in a crash. But then it makes no sense that I cannot reproduce this crash on the previous commit (d141888b828d9a47bbcdee86c3c22f1a6d415c56) which contains exactly the same LVGL version and same menus.
So without SWD access, I really don't think I can solve this issue. If no one has time to test this, I'll see if I can re-order a dev kit when they'll be back in stock.
@JF002
But the fact that the crash happen only when accessing the About menu for the 3rd time would make no sense if that was the real problem.
Potential memory leak?
On another note: Something like Valgrind but for remotely debugged MCUs would be so nice. If anyone knows tools like that that work over a gdb remote, please let me know.
Is the vibration motor causing some kind of limit? Not really sure as it happening only around the third time as @nlfx told is weird.
Is it maybe a glitch in the Watchdog?
Vibration motor could cause a brownout event when the battery has a very high internal resistance... but someone with that specific watch would have to actually measure and test out this theory. It's really just predicting on tea leaves when the issue isn't reproducible.
Is the vibration motor causing some kind of limit?
The crash happens without ever activating the motor. As explained / shown in the above bug description, just opening / closing the About menu 3 time and trying to go to its 2nd screen is enough to crash the watch.
It's really just predicting on tea leaves when the issue isn't reproducible.
Is it not? Did anyone try to reproduce the steps I've described in my bug report with commit 9f15fd1820e620e42fb717fb5d692d44a3f0f70d or any commit more recent? On my watch it reliably crashes 100% of the time if I follow the steps I've described.
Is the vibration motor causing some kind of limit?
The crash happens without ever activating the motor. As explained / shown in the above bug description, just opening / closing the About menu 3 time and trying to go to its 2nd screen is enough to crash the watch.
It's really just predicting on tea leaves when the issue isn't reproducible.
Is it not? Did anyone try to reproduce the steps I've described in my bug report with commit 9f15fd1 or any commit more recent? On my watch it reliably crashes 100% of the time if I follow the steps I've described.
Oh thanks for explaining , this seems really weird and the fact that it only happens after three tries (On the third one) , can you try this again but try to access the second menu on the fourth try instead of the third?
Oh thanks for explaining , this seems really weird and the fact that it only happens after three tries (On the third one) , can you try this again but try to access the second menu on the fourth try instead of the third?
It also happens on the 4th, 5th, etc. If the Screen1 of the About menu has been opened 3 times or more, the watch crashes when trying to access the Screen2.
Oh thanks for explaining , this seems really weird and the fact that it only happens after three tries (On the third one) , can you try this again but try to access the second menu on the fourth try instead of the third?
It also happens on the 4th, 5th, etc. If the Screen1 of the About menu has been opened 3 times or more, the watch crashes when trying to access the Screen2.
What intrigued me was that it only happens at the third time.
I am not sure if this will fix the error or not , can you try replacing the "components/motor/MotorController.cpp" file with this one(Modified it a bit , not sure if it will fix it or not):
Do rename it back to MotorController.cpp from MotorController.cpp.txt
The behavior you're describing makes me think of a memory fragmentation issue : after a few executions, the memory (from lvgl, freertos,...) is too fragmented and it cannot find a memory segment big enough for its needs.
However, I've just tried to reproduce the issue on the latest commit on develop with no success for now. It doesn't mean that the issue was auto-magically fixed, but it probably "moved" elsewhere in the code and execution paths :/
EDIT : it happened to me too on this last commit : just need to open/close the app multiple time, and then to scroll down to display the next page, it crashes.
I can easily reproduce this issue : open/close SystemInfo 2-3 times, then, scroll to the 2nd screen of displayInfo aaand... crash!
I spent far too much hours trying to figure out the cause of this crash, with no luck :/ It looks like the stack of the Display task gets corrupted when it crashes and falls into HardFault_handler:
Here's the stacktrace when the screen is displayed correctly:
I've tried many things : analyze heap, stack and freertos heap/stack usage, disable all the code inside CreateScreen1() and CreateScreen2(), logging, breakpoints,.. but I cannot understand what's happening!
This sounds really weird , what happens of we were to start removing info from the screen?
I've tried to remove the content of the info screen completely, it would only display a black screen... and it still crashed!
And I have no clue why its hardfaulting that way. Next experiments: run the whole Screen/ScreenList/Settings/SystemInfo classes on a computer to check with memcheck that everything is ok, check IRQ priority in FreeRTOS, disable some code until the problem disappears,...
I've tried to remove the content of the info screen completely, it would only display a black screen... and it still crashed!
And I have no clue why its hardfaulting that way. Next experiments: run the whole Screen/ScreenList/Settings/SystemInfo classes on a computer to check with memcheck that everything is ok, check IRQ priority in FreeRTOS, disable some code until the problem disappears,...
Well , Good Luck👍
This makes no sense:
develop
(79f0fcb07aa80eb70385223272e29f2ba5657bc8) -> No errorThis bug seems to be "moving" on each commit : it's present in 9f15fd1 and ff00873f974cb1400acf7721251167634e60107a (branch move-heap-to-static) but not on current develop (79f0fcb07aa80eb70385223272e29f2ba5657bc8) and this random commit in-between : 13e3463276114dff838fc8fe281754eecfbe9538.
I've identified 2 consecutive commits:
All I can say for now is that this issue is not caused by compiler optimizations (-O3) because it also happens in -Og.
Soooo... This bug is probably present for some time, but it probably corrupts memory that is sometimes critical, sometimes not.
I still have no idea how to debug this... any help is welcome!
EDIT: even more weird:
class Screen {
public:
explicit Screen(DisplayApp* app) : app {app} {
std::fill(bumper.begin(), bumper.end(), 0xaa);
}
//... private: std::array<uint8_t, 50> bumper;
PLEAAAASE help! Don't make me commit this! :D
Are any bits set in any of the fault status registers when the problem occurs?
@jonvmey CFSR is 0x20000 (INVSTATE) when HardFault_Handler is called HFSR is 0x40000000 (FORCED)
Logs from NRF SDK:
<error> hardfault: HARD FAULT at 0x00010E2E
<error> hardfault: R0: 0x20010000 R1: 0x40011008 R2: 0xE000EF30 R3: 0x00000010
<error> hardfault: R12: 0x20001F74 LR: 0x0001113D PSR: 0x61000000
<error> hardfault: Cause: The processor has attempted to execute an instruction that makes illegal use of the EPSR.
What does it mean? I don't know!
I'm pretty sure someone already found this, but does this help (see the answer): https://stackoverflow.com/questions/53253652/debugging-a-hard-fault-in-arm-cortex-m4
More info from : https://interrupt.memfault.com/blog/cortex-m-fault-debug
(gdb) p/a *(uint32_t[8] *)$msp
$2 = {0x20010000, 0x40011008, 0xe000ef30, 0x10 <_lv_gc_clear_roots+16>, 0x20001f74 <systemTask+1788>, 0x1113d <xPortStartScheduler+156>, 0x10e2e <vPortStartFirstTask+22>, 0x61000000}
(gdb) p/a *(uint32_t[8] *)$psp
$3 = {0x200080e8 <ucHeap+14016>, 0x2000e4ac, 0x2000db48, 0x2600 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+408>, 0x0 <_lv_gc_clear_roots>, 0xec7 <std::function<std::unique_ptr<Pinetime::Applications::Screens::Screen, std::default_delete<Pinetime::Applications::Screens::Screen> > ()>::operator()() const+12>, 0x2600 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+408>, 0x0 <_lv_gc_clear_roots>}
So, if we look at MSP (Main Stack Pointer):
R0 = 0x20010000, R1 = 0x40011008, R2 = 0xe000ef30, R3 = 0x10 <_lv_gc_clear_roots+16>, R12 = 0x20001f74 <systemTask+1788>, LR = 0x1113d <xPortStartScheduler+156>, PC = 0x10e2e <vPortStartFirstTask+22>, xPSR = 0x61000000
-> This is the main stack, where the scheduler of FreeRTOS was called. Disassembly:
vPortStartFirstTask():
10e18: 4805 ldr r0, [pc, #20] ; (10e30 <vPortStartFirstTask+0x18>)
10e1a: 6800 ldr r0, [r0, #0]
10e1c: f380 8808 msr MSP, r0
10e20: b662 cpsie i
10e22: b661 cpsie f
10e24: f3bf 8f4f dsb sy
10e28: f3bf 8f6f isb sy
10e2c: df00 svc 0
10e2e: bf00 nop
10e30: 00000000 .word 0x00000000
And the coresponding instruction:
10e2e: bf00 nop
PSP (Process Stack Pointer) :
R0 = 0x200080e8 <ucHeap+14016>
R1 = 0x2000e4ac,
R2 = 0x2000db48,
R3 = 0x2600 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+408>
R12 = 0x0 <_lv_gc_clear_roots>,
LR = 0xec7 <std::function<std::unique_ptr<Pinetime::Applications::Screens::Screen, std::default_delete
According to the map file, 0x2600 is an instruction from Pinetime::Applications::Screens::SystemInfo::CreateScreen2()
which is the method called to create the 2nd screen of system info, exactly what I was trying to do at that time. Here is the dissassembly:
Pinetime::Applications::Screens::SystemInfo::CreateScreen2():
2468: e92d 4ff0 stmdb sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
246c: ed2d 8b02 vpush {d8}
2470: b099 sub sp, #100 ; 0x64
2472: 4681 mov r9, r0
2474: 460c mov r4, r1
2476: 690d ldr r5, [r1, #16]
_ZNKSt5arrayIiLj5EE4dataEv():
2478: 1d2b adds r3, r5, #4
_ZNKSt5arrayIiLj5EE3endEv():
247a: f105 0018 add.w r0, r5, #24
247e: 2200 movs r2, #0
_ZSt10accumulateIPKiiET0_T_S3_S2_():
2480: 4298 cmp r0, r3
2482: d003 beq.n 248c <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x24>
2484: f853 1b04 ldr.w r1, [r3], #4
2488: 440a add r2, r1
248a: e7f9 b.n 2480 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x18>
_ZNK8Pinetime11Controllers10CircBufferILi5EE10GetAverageEv():
248c: 7e2b ldrb r3, [r5, #24]
248e: fb92 f3f3 sdiv r3, r2, r3
2492: 9314 str r3, [sp, #80] ; 0x50
_ZNK8Pinetime11Controllers7Battery7VoltageEv():
2494: edd5 7a07 vldr s15, [r5, #28]
Pinetime::Applications::Screens::SystemInfo::CreateScreen2():
2498: 69e3 ldr r3, [r4, #28]
_ZNK8Pinetime7Drivers12WatchdogView11ResetReasonEv():
249a: 681b ldr r3, [r3, #0]
_ZNK8Pinetime7Drivers8Watchdog11ResetReasonEv():
249c: 681b ldr r3, [r3, #0]
249e: 2b08 cmp r3, #8
24a0: f200 80a8 bhi.w 25f4 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x18c>
24a4: e8df f003 tbb [pc, r3]
24a8: 979405a3 .word 0x979405a3
24ac: 91a09d9a .word 0x91a09d9a
24b0: a9 .byte 0xa9
24b1: 00 .byte 0x00
24b2: 4b53 ldr r3, [pc, #332] ; (2600 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x198>)
24b4: 9315 str r3, [sp, #84] ; 0x54
Pinetime::Applications::Screens::SystemInfo::CreateScreen2():
24b6: 68e3 ldr r3, [r4, #12]
_ZNK8Pinetime11Controllers8DateTime6UptimeEv():
24b8: 699a ldr r2, [r3, #24]
Pinetime::Applications::Screens::SystemInfo::CreateScreen2():
24ba: 4e52 ldr r6, [pc, #328] ; (2604 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x19c>)
24bc: fba6 3602 umull r3, r6, r6, r2
24c0: 0c36 lsrs r6, r6, #16
24c2: 4b51 ldr r3, [pc, #324] ; (2608 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1a0>)
24c4: fb03 2216 mls r2, r3, r6, r2
24c8: 4f50 ldr r7, [pc, #320] ; (260c <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1a4>)
24ca: fba7 3702 umull r3, r7, r7, r2
24ce: 0aff lsrs r7, r7, #11
24d0: f44f 6361 mov.w r3, #3600 ; 0xe10
24d4: fb03 2317 mls r3, r3, r7, r2
24d8: 4d4d ldr r5, [pc, #308] ; (2610 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1a8>)
24da: fba5 2503 umull r2, r5, r5, r3
24de: 096d lsrs r5, r5, #5
24e0: ebc5 1205 rsb r2, r5, r5, lsl #4
24e4: eba3 0382 sub.w r3, r3, r2, lsl #2
24e8: 9311 str r3, [sp, #68] ; 0x44
24ea: eebc 7ae7 vcvt.u32.f32 s14, s15
24ee: ed8d 7a0e vstr s14, [sp, #56] ; 0x38
24f2: f89d 2038 ldrb.w r2, [sp, #56] ; 0x38
24f6: ee08 2a10 vmov s16, r2
24fa: eeb8 7ac8 vcvt.f32.s32 s14, s16
24fe: ee77 7ac7 vsub.f32 s15, s15, s14
2502: ed9f 7a44 vldr s14, [pc, #272] ; 2614 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1ac>
2506: ee67 7a87 vmul.f32 s15, s15, s14
250a: eefc 7ae7 vcvt.u32.f32 s15, s15
250e: edcd 7a0e vstr s15, [sp, #56] ; 0x38
2512: f89d 2038 ldrb.w r2, [sp, #56] ; 0x38
2516: 920e str r2, [sp, #56] ; 0x38
2518: f037 f8b6 bl 39688 <lv_disp_get_default>
lv_scr_act():
251c: f02c fb0a bl 2eb34 <lv_disp_get_scr_act>
Pinetime::Applications::Screens::SystemInfo::CreateScreen2():
2520: 2100 movs r1, #0
2522: f040 f95b bl 427dc <lv_label_create>
2526: 4680 mov r8, r0
2528: 2101 movs r1, #1
252a: f040 f945 bl 427b8 <lv_label_set_recolor>
252e: 68e3 ldr r3, [r4, #12]
_ZNK8Pinetime11Controllers8DateTime3DayEv():
2530: 78da ldrb r2, [r3, #3]
2532: 9213 str r2, [sp, #76] ; 0x4c
_ZNK8Pinetime11Controllers8DateTime5MonthEv():
2534: 7899 ldrb r1, [r3, #2]
2536: 9112 str r1, [sp, #72] ; 0x48
_ZNK8Pinetime11Controllers8DateTime4YearEv():
2538: 8818 ldrh r0, [r3, #0]
253a: 900f str r0, [sp, #60] ; 0x3c
_ZNK8Pinetime11Controllers8DateTime5HoursEv():
253c: f893 c005 ldrb.w ip, [r3, #5]
2540: f8cd c040 str.w ip, [sp, #64] ; 0x40
_ZNK8Pinetime11Controllers8DateTime7MinutesEv():
2544: f893 a006 ldrb.w sl, [r3, #6]
_ZNK8Pinetime11Controllers8DateTime7SecondsEv():
2548: f893 b007 ldrb.w fp, [r3, #7]
Pinetime::Applications::Screens::SystemInfo::CreateScreen2():
254c: 6960 ldr r0, [r4, #20]
254e: f00b fb01 bl db54 <Pinetime::Controllers::BrightnessController::ToString()>
2552: 9b15 ldr r3, [sp, #84] ; 0x54
2554: 930c str r3, [sp, #48] ; 0x30
2556: 900b str r0, [sp, #44] ; 0x2c
2558: 980e ldr r0, [sp, #56] ; 0x38
255a: 900a str r0, [sp, #40] ; 0x28
255c: ed8d 8a09 vstr s16, [sp, #36] ; 0x24
2560: f89d 3050 ldrb.w r3, [sp, #80] ; 0x50
2564: 9308 str r3, [sp, #32]
2566: 9911 ldr r1, [sp, #68] ; 0x44
2568: 9107 str r1, [sp, #28]
256a: 9506 str r5, [sp, #24]
256c: 9705 str r7, [sp, #20]
256e: 9604 str r6, [sp, #16]
2570: f8cd b00c str.w fp, [sp, #12]
2574: f8cd a008 str.w sl, [sp, #8]
2578: 9d10 ldr r5, [sp, #64] ; 0x40
257a: 9501 str r5, [sp, #4]
257c: 980f ldr r0, [sp, #60] ; 0x3c
257e: 9000 str r0, [sp, #0]
2580: 9b12 ldr r3, [sp, #72] ; 0x48
2582: 9a13 ldr r2, [sp, #76] ; 0x4c
2584: 4924 ldr r1, [pc, #144] ; (2618 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1b0>)
2586: 4640 mov r0, r8
2588: f040 f88a bl 426a0 <lv_label_set_text_fmt>
258c: f037 f87c bl 39688 <lv_disp_get_default>
lv_scr_act():
2590: f02c fad0 bl 2eb34 <lv_disp_get_scr_act>
2594: 4601 mov r1, r0
Pinetime::Applications::Screens::SystemInfo::CreateScreen2():
2596: 2200 movs r2, #0
2598: 9200 str r2, [sp, #0]
259a: 4613 mov r3, r2
259c: 4640 mov r0, r8
259e: f02d f86d bl 2f67c <lv_obj_align>
_ZSt11make_uniqueIN8Pinetime12Applications7Screens5LabelEJiiRPNS1_10DisplayAppERP9_lv_obj_tEENSt9_MakeUniqIT_E15__single_objectEDpOT0_():
25a2: 6865 ldr r5, [r4, #4]
25a4: 2028 movs r0, #40 ; 0x28
25a6: f044 fd39 bl 4701c <operator new(unsigned int)>
25aa: 4604 mov r4, r0
25ac: f8cd 8000 str.w r8, [sp]
25b0: 462b mov r3, r5
25b2: 2204 movs r2, #4
25b4: 2101 movs r1, #1
25b6: f000 ffb9 bl 352c <Pinetime::Applications::Screens::Label::Label(unsigned char, unsigned char, Pinetime::Applications::DisplayApp*, _lv_obj_t*)>
_ZNSt10_Head_baseILj0EPN8Pinetime12Applications7Screens6ScreenELb0EEC4IRS4_EEOT_():
25ba: f8c9 4000 str.w r4, [r9]
Pinetime::Applications::Screens::SystemInfo::CreateScreen2():
25be: 4648 mov r0, r9
25c0: b019 add sp, #100 ; 0x64
25c2: ecbd 8b02 vpop {d8}
25c6: e8bd 8ff0 ldmia.w sp!, {r4, r5, r6, r7, r8, r9, sl, fp, pc}
operator()():
25ca: 4b14 ldr r3, [pc, #80] ; (261c <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1b4>)
25cc: 9315 str r3, [sp, #84] ; 0x54
25ce: e772 b.n 24b6 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x4e>
25d0: 4b13 ldr r3, [pc, #76] ; (2620 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1b8>)
25d2: 9315 str r3, [sp, #84] ; 0x54
25d4: e76f b.n 24b6 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x4e>
25d6: 4b13 ldr r3, [pc, #76] ; (2624 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1bc>)
25d8: 9315 str r3, [sp, #84] ; 0x54
25da: e76c b.n 24b6 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x4e>
25dc: 4b12 ldr r3, [pc, #72] ; (2628 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1c0>)
25de: 9315 str r3, [sp, #84] ; 0x54
25e0: e769 b.n 24b6 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x4e>
25e2: 4b12 ldr r3, [pc, #72] ; (262c <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1c4>)
25e4: 9315 str r3, [sp, #84] ; 0x54
25e6: e766 b.n 24b6 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x4e>
25e8: 4b11 ldr r3, [pc, #68] ; (2630 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1c8>)
25ea: 9315 str r3, [sp, #84] ; 0x54
25ec: e763 b.n 24b6 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x4e>
25ee: 4b11 ldr r3, [pc, #68] ; (2634 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1cc>)
25f0: 9315 str r3, [sp, #84] ; 0x54
25f2: e760 b.n 24b6 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x4e>
25f4: 4b10 ldr r3, [pc, #64] ; (2638 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1d0>)
25f6: 9315 str r3, [sp, #84] ; 0x54
25f8: e75d b.n 24b6 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x4e>
25fa: 4b10 ldr r3, [pc, #64] ; (263c <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x1d4>)
25fc: 9315 str r3, [sp, #84] ; 0x54
25fe: e75a b.n 24b6 <Pinetime::Applications::Screens::SystemInfo::CreateScreen2()+0x4e>
2600: 0004c778 .word 0x0004c778
2604: c22e4507 .word 0xc22e4507
2608: 00015180 .word 0x00015180
260c: 91a2b3c5 .word 0x91a2b3c5
2610: 88888889 .word 0x88888889
2614: 42c80000 .word 0x42c80000
2618: 0004c780 .word 0x0004c780
261c: 0004c750 .word 0x0004c750
2620: 0004c754 .word 0x0004c754
2624: 0004c75c .word 0x0004c75c
2628: 0004c764 .word 0x0004c764
262c: 0004c768 .word 0x0004c768
2630: 0004c770 .word 0x0004c770
2634: 0004c774 .word 0x0004c774
2638: 0004c744 .word 0x0004c744
263c: 0004c748 .word 0x0004c748
And the corresponding instruction :
2600: 0004c778 .word 0x0004c778
What is it?
In my opinion, it makes more sense to analyze the PSP, as the display task is running at that time and PC correspond exactly to what the firmware was supposed to do at that time. Now, why does this instruction at 0x2600 crashes the CPU?
0x2600 doesn't contain an instruction it contains data, so it makes sense that executing it would cause a fault. The real question is how is 0x2600 getting into the PC?
The real question is how is 0x2600 getting into the PC? The answer might be 42 ;-D
Ok, I've dug into freertos, irq priorities, task priorities,... everything looks good to me.
And then, I looked at the code... for the 1000th time... and here's what I found:
DisplayApp::currentScreen
points to the current screen, allocated on the heap.DisplayApp::currentScreen
points to that instance of Settings
List
as member variableDisplayApp::StartApp()
is calledDisplayApp::currentScreen
to create a new app (the one you selected in the menu)DisplayApp::lv_taskl_handler
List::OnButtonEvent()
DisplayApp::StartApp()
DisplayAp::LoadApp()
currentScreen.reset(nullptr);
You know what's scary? ApplicationList and Settings work this way since 1.0!
Oh... Yeah, that's bad. Definitely use-after-free issues in some (all?) of the "nested" application constructs.
Would it make sense for DisplayApp
to have something like a stack of applications for situations like this? Then entering a nested application from another would push the new app onto the stack and make it the currentScreen
and exiting a nested application would pop it off of the stack and restore the parent (now the top of the stack) to the currentScreen
.
Yes, I think so. This pattern is used by ApplicationList and Settings to load an app when the entry is selected.
We have a design issue here : we have a unique_ptr AND a raw ptr pointing to the app:
this
that is stored in user_data
of the button when creating the app. In fact, the app is leaking a ptr to itself in the constructorIn this case, we are calling the app using the raw pointer, it calls DisplayApp which destroys the app using the unique_ptr... It's "the snake that bites its tail"...
Would it make sense for DisplayApp to have something like a stack of applications for situations like this?
That's an idea. I tried to avoid nested app to keep the memory usage under control and to keep the code simple, but we can think about that.
Should be fixed in https://github.com/JF002/InfiniTime/pull/415
For those interested, here is a simple project that reproduces the issue that was fixed:
A.h:
#pragma once
#include <memory>
#include "B.h"
class A {
public:
A() {
ptr = std::make_unique<B>(this);
}
void Test() {
std::cout << "A::Test" << std::endl;
ptr->Test();
}
void Delete() {
std::cout << "A::Delete()" << std::endl;
ptr.reset(nullptr);
}
private:
std::unique_ptr<B> ptr;
};
B.h:
#pragma once
class A;
class B {
public:
B(A* a);
~B();
void Test();
private:
A* ptr;
int tt = 0;
};
B.cpp:
#include <iostream>
#include "B.h"
#include "A.h"
B::B(A* a) : ptr{a} {
}
void B::Test() {
std::cout << "B::Test" << std::endl;
ptr->Delete();
std::cout << "Is B alive?" << std::endl;
tt++;
}
B::~B() {
std::cout << "B::~B() " << std::to_string(tt) << std::endl;
}
main.cpp:
#include <iostream>
#include "A.h"
int main() {
std::cout << "Hello, World!" << std::endl;
A a;
a.Test();
return 0;
}
Valgrind is not happy with that code:
$ valgrind ./untitled1
==24823== Memcheck, a memory error detector
==24823== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==24823== Using Valgrind-3.17.0 and LibVEX; rerun with -h for copyright info
==24823== Command: ./untitled1
==24823==
Hello, World!
A::Test
B::Test
A::Delete()
B::~B() 0
Is B alive?
==24823== Invalid read of size 4
==24823== at 0x109A25: B::Test() (B.cpp:13)
==24823== by 0x10943D: A::Test() (A.h:13)
==24823== by 0x109260: main (main.cpp:8)
==24823== Address 0x4d9c0c8 is 8 bytes inside a block of size 16 free'd
==24823== at 0x48429AB: operator delete(void*, unsigned long) (vg_replace_malloc.c:814)
==24823== by 0x1096B4: std::default_delete<B>::operator()(B*) const (unique_ptr.h:85)
==24823== by 0x109770: std::__uniq_ptr_impl<B, std::default_delete<B> >::reset(B*) (unique_ptr.h:182)
==24823== by 0x109E10: std::unique_ptr<B, std::default_delete<B> >::reset(B*) (unique_ptr.h:456)
==24823== by 0x109CF6: A::Delete() (A.h:18)
==24823== by 0x1099F8: B::Test() (B.cpp:10)
==24823== by 0x10943D: A::Test() (A.h:13)
==24823== by 0x109260: main (main.cpp:8)
==24823== Block was alloc'd at
==24823== at 0x483FF3F: operator new(unsigned long) (vg_replace_malloc.c:417)
==24823== by 0x109565: std::_MakeUniq<B>::__single_object std::make_unique<B, A*>(A*&&) (unique_ptr.h:962)
==24823== by 0x10939E: A::A() (A.h:8)
==24823== by 0x109254: main (main.cpp:7)
==24823==
==24823== Invalid write of size 4
==24823== at 0x109A2F: B::Test() (B.cpp:13)
==24823== by 0x10943D: A::Test() (A.h:13)
==24823== by 0x109260: main (main.cpp:8)
==24823== Address 0x4d9c0c8 is 8 bytes inside a block of size 16 free'd
==24823== at 0x48429AB: operator delete(void*, unsigned long) (vg_replace_malloc.c:814)
==24823== by 0x1096B4: std::default_delete<B>::operator()(B*) const (unique_ptr.h:85)
==24823== by 0x109770: std::__uniq_ptr_impl<B, std::default_delete<B> >::reset(B*) (unique_ptr.h:182)
==24823== by 0x109E10: std::unique_ptr<B, std::default_delete<B> >::reset(B*) (unique_ptr.h:456)
==24823== by 0x109CF6: A::Delete() (A.h:18)
==24823== by 0x1099F8: B::Test() (B.cpp:10)
==24823== by 0x10943D: A::Test() (A.h:13)
==24823== by 0x109260: main (main.cpp:8)
==24823== Block was alloc'd at
==24823== at 0x483FF3F: operator new(unsigned long) (vg_replace_malloc.c:417)
==24823== by 0x109565: std::_MakeUniq<B>::__single_object std::make_unique<B, A*>(A*&&) (unique_ptr.h:962)
==24823== by 0x10939E: A::A() (A.h:8)
==24823== by 0x109254: main (main.cpp:7)
==24823==
==24823==
==24823== HEAP SUMMARY:
==24823== in use at exit: 0 bytes in 0 blocks
==24823== total heap usage: 3 allocs, 3 frees, 73,744 bytes allocated
==24823==
==24823== All heap blocks were freed -- no leaks are possible
==24823==
==24823== For lists of detected and suppressed errors, rerun with: -s
==24823== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)
@JF002 Wow, that was some nasty bug, and must have been a pain to troubleshoot. Thank you so much for tracking this down, finding the root cause, and fixing it :-)
I'll use the opportunity to reference https://liberapay.com/JF002, there's a nice way to support the project and its lead maintainer.
@JF002 Wow, that was some nasty bug, and must have been a pain to troubleshoot. Thank you so much for tracking this down, finding the root cause, and fixing it :-)
Oooh yeah, I spent way too much time on this issue! But I'm happy I found the issue and fixed it! Thanks for your support !
I've had frequent crashes since yesterday merges. Here is how to reliably reproduce the issue:
https://user-images.githubusercontent.com/69880003/116891847-b9bb7a80-ac1e-11eb-9d82-254a69eaa988.mp4
I've bisected the problem to commit 9f15fd1820e620e42fb717fb5d692d44a3f0f70d. However given its content, I don't understand how it could create this crash. Maybe it's a problem with some completely unrelated part of the code which only becomes visible after this commit due to some timing or memory position constraint?
Could someone with a dev kit and SWD access please see if they manage to get more info about what is creating the above crash?
In case this bug is memory location dependent and strings with file path in the firmware change the behavior, here is my version of the above commit.
Thanks a lot for your help!