Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

C11 atomics always yielding library call with -mcpu=cortex-m0plus #42573

Open Quuxplusone opened 4 years ago

Quuxplusone commented 4 years ago
Bugzilla Link PR43603
Status NEW
Importance P enhancement
Reported by Marian Buschsieweke (marian.buschsieweke@ovgu.de)
Reported on 2019-10-08 04:03:39 -0700
Last modified on 2019-10-10 01:10:52 -0700
Version 9.0
Hardware Other other
CC blitzrakete@gmail.com, dgregor@apple.com, efriedma@quicinc.com, erik.pilkington@gmail.com, llvm-bugs@lists.llvm.org, richard-llvm@metafoo.co.uk
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also
Using this minimum example code:

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
#include <stdatomic.h>

void foo(void);
extern atomic_int_least16_t bar;

void foo(void)
{
    atomic_store_explicit(&bar, 0x1337, memory_order_relaxed);
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Compiling with "clang -mcpu=cortex-m3 -mlittle-endian -mthumb -mfloat-abi=soft -
target arm-none-eabi -Weverything -Werror -c -o test.o test.c" works fine.

Compiling with "clang -mcpu=cortex-m0plus -mlittle-endian -mthumb -mfloat-
abi=soft -target arm-none-eabi -Weverything -Werror -c -o test.o test.c" yields:

> test.c:8:2: error: large atomic operation may incur significant performance
>       penalty [-Werror,-Watomic-alignment]
>         atomic_store_explicit(&bar, 0x1337, memory_order_relaxed);
>         ^
> /usr/lib/clang/9.0.0/include/stdatomic.h:118:31: note: expanded from macro
>       'atomic_store_explicit'
> #define atomic_store_explicit __c11_atomic_store
>                               ^
> 1 error generated.

However, compiling with "arm-none-eabi-gcc -mcpu=cortex-m0plus -mlittle-endian -
mthumb -mfloat-abi=soft -Wall -Wextra -pedantic -Werror -c -o test.o test.c"
works fine and generates:

> arm-none-eabi-objdump -d test.o
>
> test.o:     file format elf32-littlearm
>
>
> Disassembly of section .text:
>
> 00000000 <foo>:
>    0: b580        push    {r7, lr}
>    2: b082        sub sp, #8
>    4: af00        add r7, sp, #0
>    6: 4b07        ldr r3, [pc, #28]   ; (24 <foo+0x24>)
>    8: 607b        str r3, [r7, #4]
>    a: 1cbb        adds    r3, r7, #2
>    c: 4a06        ldr r2, [pc, #24]   ; (28 <foo+0x28>)
>    e: 801a        strh    r2, [r3, #0]
>   10: 1cbb        adds    r3, r7, #2
>   12: 2200        movs    r2, #0
>   14: 5e9b        ldrsh   r3, [r3, r2]
>   16: b29a        uxth    r2, r3
>   18: 687b        ldr r3, [r7, #4]
>   1a: 801a        strh    r2, [r3, #0]
>   1c: 46c0        nop         ; (mov r8, r8)
>   1e: 46bd        mov sp, r7
>   20: b002        add sp, #8
>   22: bd80        pop {r7, pc}
>   24: 00000000    .word   0x00000000
>   28: 00001337    .word   0x00001337

Therefore, I assume that the atomic store can be implemented without resorting
a a library call to __atomic_store_2().
Quuxplusone commented 4 years ago
I think that's a gcc bug.

If you write something like the following with gcc:

void foo2(void)
{
    atomic_fetch_add_explicit(&bar, 0x1337, memory_order_relaxed);
}

It produces the following:

foo2:
        push    {r4, lr}
        ldr     r1, .L5
        ldr     r0, .L5+4
        bl      __sync_fetch_and_add_2
        pop     {r4, pc}

Meaning, it's assuming that there's some lock-free atomic sequence that can
implement __sync_fetch_and_add_2. But there is no such sequence in a baremetal
environment; cortex-m0 doesn't have the ll/sc instructions.  (Well, you can
technically make an arbitrary sequence "atomic" by disabling interrupts, but I
don't think that really counts.)
Quuxplusone commented 4 years ago
Thanks for the reply.

The Cortex M0+ is an embedded processor (single-threaded single core CPU), in
which the only source of concurrency are interrupts. To my understanding in the
given context "atomic_store_explicit(&bar, 0x1337, memory_order_relaxed);" is
already correctly implemented if a single machine instructions implements the
store, as no ISR will interrupt in the middle of that instruction.

Resorting to "__sync_fetch_and_add_2" as GCC does for
"atomic_fetch_add_explicit(&bar, 0x1337, memory_order_relaxed);" is also
correct to my understanding, as two machine instructions are needed to
implement that.

> (Well, you can technically make an arbitrary sequence "atomic" by disabling
> interrupts, but I don't think that really counts.)

This is exactly how __sync_fetch_and_add_2() is implemented in the context of
RIOT [1]. But the use of __atomic_store_2() should not be required for properly
aligned stores - so why is it still used?

[1]: https://riot-os.org/
Quuxplusone commented 4 years ago
If a lock-free sequence does not exist for all possible atomic operations of a
given width, the compiler must generate a call to __atomic_* for all atomic
operations of that width.  Otherwise, the locking mechanism in libatomic breaks.

The problem with arm-none-eabi is that we don't really know much about where
the code is actually going to run.  For arm-pc-linux-eabi, specifically,
__sync_val_compare_and_swap_N where N=1,2,4,8 is implemented using a special
kernel-assisted sequence.  But we don't really want make the same assumption
for every baremetal target.

Maybe we could control it with a flag.
Quuxplusone commented 4 years ago
> The problem with arm-none-eabi is that we don't really know much about where
> the code is actually going to run.

Mind the "-mcpu=cortex-m0plus" flag that was given during compilation - so it
is already known that the code currently running simply can disable interrupts,
as there is no "user mode" and "kernel mode" on that platform. This means that
to my best knowledge the most efficient implementation of
__sync_fetch_and_add_2() and friends is to simply disable interrupts for the
duration of the function. And also: GCC is already enforcing that the
implementations has to disable interrupts, as otherwise concurrent lock free
operations might interfere with the operations implemented by
__sync_fetch_and_add_2() and friends.

So: Why not just also require that implementations of __sync_fetch_and_add_2()
and friends do disable interrupts on single-threaded single-core embedded CPUs
(no MMU, no kernel and user mode)?
Quuxplusone commented 4 years ago

In case my arguments for just assuming that __sync_fetch_and_add_2() and all its friends are not interruptible for bare metal targets are not convincing: I'd very happily take the compiler flag to tell clang that the library functions are indeed disabling interrupts.

Btw (but a off topic): If there would be flag to allow clang to just emit code to disable interrupts before the atomic operations and restore the previous interrupt state afterwards, this could also be very interesting for embedded developers. This could allow the compiler to group a number of atomic operations that can only be implemented via disabling interrupts and the ROM size would be reduced a bit, as less often interrupts would be disabled and restored. A flag to limit the number of atomics ops to group would be needed to limit the worst case delay for interrupts to be acted upon for those wanting hard real time capabilities, though.