apache / nuttx

Apache NuttX is a mature, real-time embedded operating system (RTOS)
https://nuttx.apache.org/
Apache License 2.0
2.87k stars 1.17k forks source link

unwind_frame has issues on arm cortext M7 #12687

Closed gneworld closed 3 months ago

gneworld commented 4 months ago

image

During the backtrace process, the system freezes, and the anomaly shown in the image above,such as lr is 0xFFFFFFFF and pc is 0xFFFFFFFC

image

When I add this conditional check, the problem disappeared,so is this a real issue or how should I do to avoid this issue?

Now let's simplify this problem with hello app,I find that when build with -O3, hello app would freezes during backtrace,but if we build it with -O0, hello app works well(with wrong pc value), so do the -funwind-tables have conflicts with -O3 in some cases ?

int main(int argc, FAR char *argv[])

 {
      _alert("sched_dumpstack ....\n");
      sched_dumpstack(0);
      _alert("sched_dumpstack ok\n");
      return 0;
 }

image

acassis commented 4 months ago

@gneworld probably it was caused by some function returning -1 to LR, what explain this 0xFFFFFFFF value.

Hi @anchao could you please help? I think you added support for it, right?

anjiahao1 commented 4 months ago

@gneworld

image

try it

anjiahao1 commented 4 months ago

Perhaps the above method is not a good method. The reason for this problem is that the idle thread needs to clear lr before calling nxstart, otherwise the unwind backtrace is likely to go wrong because the value of lr is uncertain before __start.

gneworld commented 4 months ago

Perhaps the above method is not a good method. The reason for this problem is that the idle thread needs to clear lr before calling nxstart, otherwise the unwind backtrace is likely to go wrong because the value of lr is uncertain before __start.

@anjiahao1 but why does -O0 work correctly, while -O3 does not?

anchao commented 4 months ago

Hi @anchao could you please help? I think you added support for it, right?

@acassis @anjiahao1 is more familiar with the details of unwind table backtrace than me. @anjiahao1 @gneworld Aren't you guys on the same floor? why not confirm the issue offline?

anjiahao1 commented 4 months ago

The root cause is that at reset, the value of the general register is usually not fixed, and lr needs to be set to 0 before calling nxstart, which requires modifying a lot of arch/chips code

anchao commented 4 months ago
  1. Seems as expected, you need to confirm with vendor the behavior of RAR(Reset all registers) when lockstep is disabled, which is fixed on design phase.
  2. Zephyr does something similar, I think initializing the registers is necessary https://github.com/zephyrproject-rtos/zephyr/pull/20473

ARM_ECM_0690721_Cortex_M33_DCLS.pdf

20240715-194023

anchao commented 4 months ago

https://developer.arm.com/documentation/101773/0001/Functional-Description/CPU?lang=en

20240715-194317

acassis commented 4 months ago

Hi @anchao could you please help? I think you added support for it, right?

@acassis @anjiahao1 is more familiar with the details of unwind table backtrace than me. @anjiahao1 @gneworld Aren't you guys on the same floor? why not confirm the issue offline?

@anchao I think if they can do it at same room is fine, but please don't report the details here to let more people see what was the issue; how the root causes was discovered and why "that new commit" is the right solution :-)

lywind commented 4 months ago
  1. Seems as expected, you need to confirm with vendor the behavior of RAR(Reset all registers) when lockstep is disabled, which is fixed on design phase.
  2. Zephyr does something similar, I think initializing the registers is necessary arch: arm: Rewrite Cortex-R reset vector function. zephyrproject-rtos/zephyr#20473

ARM_ECM_0690721_Cortex_M33_DCLS.pdf

20240715-194023

In fact, this is a Cortex-M7 MCU which is the ARM v7M-E architecture. And it does not have LOCKSTEP or RAR configurations. So I think it might not be the DCLS problem.

anchao commented 4 months ago

In fact, this is a Cortex-M7 MCU which is the ARM v7M-E architecture. And it does not have LOCKSTEP or RAR configurations. So I think it might not be the DCLS problem.

DCLS is configurable on Cortex-M7, I just suspect that the case they are facing is a issue on the lock-step core.

https://developer.arm.com/Processors/Cortex-M7

20240716-160632

lywind commented 4 months ago

In fact, this is a Cortex-M7 MCU which is the ARM v7M-E architecture. And it does not have LOCKSTEP or RAR configurations. So I think it might not be the DCLS problem.

DCLS is configurable on Cortex-M7, I just suspect that the case they are facing is a issue on the lock-step core.

https://developer.arm.com/Processors/Cortex-M7

20240716-160632

Double checked and it's comfirmed that the LOCKSTEP and RAR are both enabled. And I'm wondering why the function unwind_find_entry(frame->pc) does not return NULL when frame->pc == 0xFFFFFFFC? It's obvious that 0xFFFFFFFC exceeds __exidx_end.

lywind commented 4 months ago

@gneworld probably it was caused by some function returning -1 to LR, what explain this 0xFFFFFFFF value.

Hi @anchao could you please help? I think you added support for it, right?

The reason of the 0xFFFFFFFF being on the stack is that the compiler considers the __start as a normal function and pushes LR in the stack。 image And when the core boots from reset, the core will set LR to 0xFFFFFFFF. image B1.5.5 of Arm®v7-M Architecture Reference Manual

anchao commented 4 months ago

I'm not sure if the naked attribute could avoid this issue, which ensures that the unwind extab does not contain any push content:

diff --git a/include/nuttx/init.h b/include/nuttx/init.h
index af3dce335f..98b9ba68f8 100644
--- a/include/nuttx/init.h
+++ b/include/nuttx/init.h
@@ -98,7 +98,7 @@ EXTERN uint8_t g_nx_initstate;  /* See enum nx_initstate_e */

 /* OS entry point called by boot logic */

-void nx_start(void);
+void nx_start(void) noreturn_function naked_function;

 #undef EXTERN
 #ifdef __cplusplus
gneworld commented 4 months ago

@anchao Unfortunately, this change cannot solve the problem.

anjiahao1 commented 3 months ago

hello all, i ceate a pr fix it https://github.com/apache/nuttx/pull/12787