ARM-software / LLVM-embedded-toolchain-for-Arm

A project dedicated to building LLVM toolchain for 32-bit Arm embedded targets.
Apache License 2.0
377 stars 85 forks source link

Compiling arm run time libraries based on newlib selects inferior memcpy-stub.c instead of memcpy-armv7a.S #444

Closed klaus1212 closed 1 month ago

klaus1212 commented 1 month ago

Hi,

I am using the 18.1.3 release of LLVM and LLVM-embedded-toolchain-for-Arm. Additionally I use newlib-4.3.0. I am building the following library variants for a baremetal arm 32bit cortex-a9 target on ubuntu. For reference I have attached the git patch with the changes I need to setup above. LLVM-embedded-toolchain-for-Arm.patch

add_library_variants_for_cpu( armv7a SUFFIX hard_neon COMPILE_FLAGS "-mfloat-abi=hard -march=armv7a -mfpu=neon" MULTILIB_FLAGS "--target=armv7-none-unknown-eabihf -mfpu=neon" QEMU_MACHINE "none" QEMU_CPU "cortex-a8" QEMU_PARAMS "-m 1G" BOOT_FLASH_ADDRESS 0x00000000 BOOT_FLASH_SIZE 0x1000 FLASH_ADDRESS 0x20000000 FLASH_SIZE 0x1000000 RAM_ADDRESS 0x21000000 RAM_SIZE 0x1000000 STACK_SIZE 4K ) add_library_variants_for_cpu( armv7a SUFFIX thumb_hard_neon COMPILE_FLAGS "-mfloat-abi=hard -march=armv7a -mfpu=neon" MULTILIB_FLAGS "--target=thumbv7-none-unknown-eabihf -mfpu=neon" QEMU_MACHINE "none" QEMU_CPU "cortex-a8" QEMU_PARAMS "-m 1G" BOOT_FLASH_ADDRESS 0x00000000 BOOT_FLASH_SIZE 0x1000 FLASH_ADDRESS 0x20000000 FLASH_SIZE 0x1000000 RAM_ADDRESS 0x21000000 RAM_SIZE 0x1000000 STACK_SIZE 4K )

My problem is that the run-time libraries seems to select an inferior memcpy in newlib called: memcpy-stub.c.

I Basically determined it to be inferior during tests of memcpy. When testing our current setup with gcc and newlib based run-time libraries memcpy is almost 6x faster compared to an identical test with clang and newlib based run-time libraries. This is looks to be because our gcc build run-time libraries select #include "memcpy-armv7a.S" instead of the memcpy-stub.c selected in our clang build run-time libraries.

I have identified why memcpy-stub.S is selected as follows: memcpy-stub.S is selected by repos\newlib\newlib\libc\machine\arm\memcpy.S

Basically because repos\newlib\newlib\include\arm-acle-compat.h is called with __ARM_ARCH defined.

When arm-acle-compat.h is called with __ARM_ARCH defined then repos\newlib\newlib\include\arm-acle-compat.h does not define __ARM_FEATURE_UNALIGNED.

When __ARM_FEATURE_UNALIGNED is not defined repos\newlib\newlib\libc\machine\arm\memcpy.S does not

include "memcpy-armv7a.S"

but instead includes memcpy-stub.c

I have not been able to determine for sure who defines __ARM_ARCH which is key to all of it. Therefore I am hoping someone here in the forum knows if the above behavior is on purpose or how I can setup my run-time libraries to use memcpy-armv7a.S

smithp35 commented 1 month ago

The __ARM_ARCH macro is defined by the compiler in https://github.com/llvm/llvm-project/blob/main/clang/lib/Basic/Targets/ARM.cpp#L740 The __ARM_FEATURE_UNALIGNED is defined by the compiler in https://github.com/llvm/llvm-project/blob/main/clang/lib/Basic/Targets/ARM.cpp#L787 but only when -munaligned-access is selected.

So it looks like you may need to add -munaligned-access to the compilation flags for the library. I'm not sure at the moment whether adding it to the multilib flags will work as I'm not sure unaligned access can be used as a parameter to select multilib on right now.

For a general toolchain, if we only have one library variant for unaligned access, I think we'd want to compile without unaligned access for maximum compatibility (many systems disable unaligned access).

klaus1212 commented 1 month ago

I have tried setting the -munaligned-access, it betters things since we are not running the implementation in memcpy-stub.c anymore then.

However now libc.a contains an

When compiling our source code with the above run-time library (libc) using clang, clang somehow maps our memcpy calls to __aeabi_memcpy (libc_a-aeabi_memcpy-armv7a.o). This is an issue because __aeabi_memcpy DOES NOT use the arm vectorization instructions and hence it is slower than it could be if it exploited the arm vectorization instructions.

I have raised the issue in the llvm forum see https://discourse.llvm.org/t/setting-mcpu-cortex-a9-mfpu-neon-for-arm-target-does-not-make-clang-pick-memcpy-optimized-for-the-co-processor/79336/5