Open LudwigKnuepfer opened 10 years ago
just to collect all traces:
Core was generated by `../RIOT/examples/ccn-lite-client/bin/native/ccn-lite-client.elf grid5x5_c1 -t 4'.
Program terminated with signal 8, Arithmetic exception.
#0 0x4a127baf in vfprintf () from /lib/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-20.fc19.i686
(gdb) bt
#0 0x4a127baf in vfprintf () from /lib/libc.so.6
#1 0x4a14e193 in vsnprintf () from /lib/libc.so.6
#2 0x0804a8fc in printf ()
#3 0x0805017f in thread_print_all ()
#4 0x0804f61c in handle_input_line ()
#5 0x0804f51f in shell_run ()
#6 0x0804ef4d in riot_ccn_runner ()
#7 0x0804ef83 in main ()
Oups... little FIX ME FIRST which wasn't fixed... I'll try to look at it asap.
I'd say this is a won't fix and we close the issue. Anyone against?
I'm against closing this one, as this is fixable (it just takes some heavy investigating). Just closing it with wontfix is a very lazy way of dealing with actual problems ;-).
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want me to ignore this issue, please mark it with the "State: don't stale" label. Thank you for your contributions.
I was about to open an issue when I saw this. On current master (e768a85f62) tests/thread_float
https://github.com/RIOT-OS/RIOT/tree/master/tests/thread_float raises a floating point exception. I'm thinking this is the underlying issue experienced in #15870 and #15878.
My versions:
Operating System Environment
----------------------------
Operating System: "Manjaro Linux"
Kernel: Linux 5.10.19-1-MANJARO x86_64 unknown
System shell: GNU bash, version 5.1.0(1)-release (x86_64-pc-linux-gnu)
make's shell: GNU bash, version 5.1.0(1)-release (x86_64-pc-linux-gnu)
Installed compiler toolchains
-----------------------------
native gcc: gcc (GCC) 10.2.0
arm-none-eabi-gcc: arm-none-eabi-gcc (Arch Repository) 10.2.0
avr-gcc: missing
mips-mti-elf-gcc: missing
msp430-elf-gcc: missing
riscv-none-elf-gcc: missing
riscv64-unknown-elf-gcc: missing
riscv-none-embed-gcc: missing
xtensa-esp32-elf-gcc: missing
xtensa-esp8266-elf-gcc: missing
clang: clang version 11.1.0
Installed compiler libs
-----------------------
arm-none-eabi-newlib: "4.1.0"
mips-mti-elf-newlib: missing
msp430-elf-newlib: missing
riscv-none-elf-newlib: missing
riscv64-unknown-elf-newlib: missing
riscv-none-embed-newlib: missing
xtensa-esp32-elf-newlib: missing
xtensa-esp8266-elf-newlib: missing
avr-libc: missing (missing)
Installed development tools
---------------------------
ccache: ccache version 4.2
cmake: cmake version 3.19.6
cppcheck: missing
doxygen: 1.9.1
git: git version 2.30.1
make: GNU Make 4.3
openocd: Open On-Chip Debugger 0.10.0
python: Python 3.9.2
python2: missing
python3: Python 3.9.2
flake8: error: /usr/bin/python3: No module named flake8
coccinelle: missing
I think I finally found what's going on with the corrupted FPU registers on native (I'm printing the ftag
before every function call):
Found trace frame 29, tracepoint 2
$162 = "setcontext"
$163 = 0xffff
Found trace frame 30, tracepoint 5
$164 = "native_isr_entry"
$165 = 0xffff
Found trace frame 31, tracepoint 1
$166 = "makecontext"
$167 = 0xffff
Found trace frame 32, tracepoint 6
$168 = void
$169 = 0x1555 <-- _native_sig_leave_tramp
Found trace frame 33, tracepoint 4
$170 = "swapcontext"
$171 = 0x1555 <-- This `swapcontext` saves corrupted FPU state
Found trace frame 34, tracepoint 7
$172 = "native_irq_handler"
$173 = 0xffff
Found trace frame 35, tracepoint 2
$174 = "setcontext"
$175 = 0xffff
Found trace frame 36, tracepoint 1
$176 = "makecontext"
$177 = 0xffff
Found trace frame 37, tracepoint 4
$178 = "swapcontext"
$179 = 0xffff
AFAIU although the Linux Kernel avoids doing FPU operations, it might still perform some. In any case, the Kernel does not care about user space FPU states.
On a "real" native IRQ (not the "RIOT" one we artificially create with makecontext
), the native
process will call a trampoline function (_native_sig_leave_tramp
) to invoke the "RIOT" ISR (native_handle_irq
). At the begininng of _native_sig_leave_tramp
, the FPU regs are already corrupted because the kernel does not restore user space context. But for whatever reason we call swapcontext
there, which successfully copies the corrupted FPU state into one of the thread stacks. At some point, setcontext
or swapcontext
will set the FPU register in a wrong state, which will eventually raise SIGFPE.
IMO the solution is to use setcontext
instead of swapcontext
in the trampoline function. But this would require to keep the state of the last thread somewhere.
MO the solution is to use
setcontext
instead ofswapcontext
in the trampoline function. But this would require to keep the state of the last thread somewhere.
is there any point in the signal handling where the old context hasn't been corrupted, and can be stored?
is there any point in the signal handling where the old context hasn't been corrupted, and can be stored?
The corruption takes place between the end of the signal handler (_native_isr_entry
) and the beginning of the trampoline function (_native_sig_leave_tramp
).
Probably calling getcontext
in the signal handler and setcontext
in the trampoline should do the trick.
hmmmm digging deeper into the topic, it seems the context
arg of the signal handler stores the last valid FPU register state. In this patch they set the FP regs manually from uc_mcontext
. Maybe that's the way to go.
In this patch they set the FP regs manually from
uc_mcontext
.
I think I need more context - how will the LoongArch UAPI help us here? Or do you suggest that this is indeed a bug in the Linux x86 UAPI and the kernel needs to be fixed here?
When the FPU is used when an asynchronous context switch occurs, something goes wrong. Something (as far as quick testing could reveal): either the stack gets corrupted or a floating point exception occurs.
Test case: test_irq with a float instead of an int for the thread local variable (may need a few runs).