RIOT-OS / RIOT

RIOT - The friendly OS for IoT
https://riot-os.org
GNU Lesser General Public License v2.1
4.88k stars 1.98k forks source link

native not float safe #495

Open LudwigKnuepfer opened 10 years ago

LudwigKnuepfer commented 10 years ago

When the FPU is used when an asynchronous context switch occurs, something goes wrong. Something (as far as quick testing could reveal): either the stack gets corrupted or a floating point exception occurs.

Test case: test_irq with a float instead of an int for the thread local variable (may need a few runs).

mehlis commented 10 years ago

just to collect all traces:

Core was generated by `../RIOT/examples/ccn-lite-client/bin/native/ccn-lite-client.elf grid5x5_c1 -t 4'.
Program terminated with signal 8, Arithmetic exception.
#0  0x4a127baf in vfprintf () from /lib/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-20.fc19.i686
(gdb) bt
#0  0x4a127baf in vfprintf () from /lib/libc.so.6
#1  0x4a14e193 in vsnprintf () from /lib/libc.so.6
#2  0x0804a8fc in printf ()
#3  0x0805017f in thread_print_all ()
#4  0x0804f61c in handle_input_line ()
#5  0x0804f51f in shell_run ()
#6  0x0804ef4d in riot_ccn_runner ()
#7  0x0804ef83 in main ()
kYc0o commented 8 years ago

Oups... little FIX ME FIRST which wasn't fixed... I'll try to look at it asap.

kYc0o commented 6 years ago

I'd say this is a won't fix and we close the issue. Anyone against?

miri64 commented 6 years ago

I'm against closing this one, as this is fixable (it just takes some heavy investigating). Just closing it with wontfix is a very lazy way of dealing with actual problems ;-).

stale[bot] commented 5 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want me to ignore this issue, please mark it with the "State: don't stale" label. Thank you for your contributions.

miri64 commented 5 years ago

See https://github.com/RIOT-OS/RIOT/issues/495#issuecomment-405560534 and also https://github.com/RIOT-OS/RIOT/pull/11921#issuecomment-515882663 ;-)

leandrolanzieri commented 3 years ago

I was about to open an issue when I saw this. On current master (e768a85f62) tests/thread_floathttps://github.com/RIOT-OS/RIOT/tree/master/tests/thread_float raises a floating point exception. I'm thinking this is the underlying issue experienced in #15870 and #15878.

My versions:

Operating System Environment
----------------------------
         Operating System: "Manjaro Linux" 
                   Kernel: Linux 5.10.19-1-MANJARO x86_64 unknown
             System shell: GNU bash, version 5.1.0(1)-release (x86_64-pc-linux-gnu)
             make's shell: GNU bash, version 5.1.0(1)-release (x86_64-pc-linux-gnu)

Installed compiler toolchains
-----------------------------
               native gcc: gcc (GCC) 10.2.0
        arm-none-eabi-gcc: arm-none-eabi-gcc (Arch Repository) 10.2.0
                  avr-gcc: missing
         mips-mti-elf-gcc: missing
           msp430-elf-gcc: missing
       riscv-none-elf-gcc: missing
  riscv64-unknown-elf-gcc: missing
     riscv-none-embed-gcc: missing
     xtensa-esp32-elf-gcc: missing
   xtensa-esp8266-elf-gcc: missing
                    clang: clang version 11.1.0

Installed compiler libs
-----------------------
     arm-none-eabi-newlib: "4.1.0"
      mips-mti-elf-newlib: missing
        msp430-elf-newlib: missing
    riscv-none-elf-newlib: missing
riscv64-unknown-elf-newlib: missing
  riscv-none-embed-newlib: missing
  xtensa-esp32-elf-newlib: missing
xtensa-esp8266-elf-newlib: missing
                 avr-libc: missing (missing)

Installed development tools
---------------------------
                   ccache: ccache version 4.2
                    cmake: cmake version 3.19.6
                 cppcheck: missing
                  doxygen: 1.9.1
                      git: git version 2.30.1
                     make: GNU Make 4.3
                  openocd: Open On-Chip Debugger 0.10.0
                   python: Python 3.9.2
                  python2: missing
                  python3: Python 3.9.2
                   flake8: error: /usr/bin/python3: No module named flake8
               coccinelle: missing
jia200x commented 1 year ago

I think I finally found what's going on with the corrupted FPU registers on native (I'm printing the ftag before every function call):

Found trace frame 29, tracepoint 2
$162 = "setcontext"
$163 = 0xffff
Found trace frame 30, tracepoint 5
$164 = "native_isr_entry"
$165 = 0xffff
Found trace frame 31, tracepoint 1
$166 = "makecontext"
$167 = 0xffff
Found trace frame 32, tracepoint 6
$168 = void
$169 = 0x1555                      <-- _native_sig_leave_tramp
Found trace frame 33, tracepoint 4
$170 = "swapcontext"
$171 = 0x1555                      <-- This `swapcontext` saves corrupted FPU state
Found trace frame 34, tracepoint 7
$172 = "native_irq_handler"
$173 = 0xffff
Found trace frame 35, tracepoint 2
$174 = "setcontext"
$175 = 0xffff
Found trace frame 36, tracepoint 1
$176 = "makecontext"
$177 = 0xffff
Found trace frame 37, tracepoint 4
$178 = "swapcontext"
$179 = 0xffff

AFAIU although the Linux Kernel avoids doing FPU operations, it might still perform some. In any case, the Kernel does not care about user space FPU states. On a "real" native IRQ (not the "RIOT" one we artificially create with makecontext), the native process will call a trampoline function (_native_sig_leave_tramp) to invoke the "RIOT" ISR (native_handle_irq). At the begininng of _native_sig_leave_tramp, the FPU regs are already corrupted because the kernel does not restore user space context. But for whatever reason we call swapcontext there, which successfully copies the corrupted FPU state into one of the thread stacks. At some point, setcontext or swapcontext will set the FPU register in a wrong state, which will eventually raise SIGFPE.

IMO the solution is to use setcontext instead of swapcontext in the trampoline function. But this would require to keep the state of the last thread somewhere.

kaspar030 commented 1 year ago

MO the solution is to use setcontext instead of swapcontext in the trampoline function. But this would require to keep the state of the last thread somewhere.

is there any point in the signal handling where the old context hasn't been corrupted, and can be stored?

jia200x commented 1 year ago

is there any point in the signal handling where the old context hasn't been corrupted, and can be stored?

The corruption takes place between the end of the signal handler (_native_isr_entry) and the beginning of the trampoline function (_native_sig_leave_tramp).

Probably calling getcontext in the signal handler and setcontext in the trampoline should do the trick.

jia200x commented 1 year ago

hmmmm digging deeper into the topic, it seems the context arg of the signal handler stores the last valid FPU register state. In this patch they set the FP regs manually from uc_mcontext. Maybe that's the way to go.

benpicco commented 1 year ago

In this patch they set the FP regs manually from uc_mcontext.

I think I need more context - how will the LoongArch UAPI help us here? Or do you suggest that this is indeed a bug in the Linux x86 UAPI and the kernel needs to be fixed here?