ARM-software / optimized-routines

Optimized implementations of various library functions for ARM architecture processors
Other
567 stars 87 forks source link

[aarch64] memset without SIMD? #25

Open SciresM opened 4 years ago

SciresM commented 4 years ago

Hi there!

I'm writing aarch64 code intended for an EL1 environment with no access to standard library/etc. Unfortunately this means there's no guarantee the NEON registers are usable without taking an exception.

The memcpy/memmove/memcmp implementations appears to have separate with-and-without simd implementations, but memset unconditionally uses Q0.

Are there any plans for providing an optimized memset in the SIMD unavailable case? I've considered saving Q0 + CPACR_EL1 to stack and setting the EL1 enable bit in CPACR at the start of memset and branching to a restoration stub instead of returning, but this seems like it might negatively impact performance versus just doing stp with general registers.

Any/all guidance would be appreciated.

Wilco1 commented 4 years ago

Hi,

Generally our goal is to provide the fastest possible routines, so they will use SIMD if that happens to be fastest on most implementations. Many string functions use SIMD instructions, so adding separate scalar versions would be a lot of work...

I don't think it makes much sense to swap state - the best option is to avoid calling any string functions in your code, and as a workaround you could write minimal implementations of any you do need.

There used to be a scalar memset in newlib: https://github.com/bminor/newlib/commit/080e96f57c2ce2845dd160e4cf32d4f9f15b1a68#diff-268e268458e7e8d2131d5be82cfec144 That's a very old implementation, so not as optimized as recent ones, but it does not use any SIMD instructions, so might be good enough for your purpose.