support FORTIFY _chk variants of string functions

i'm one of the bionic maintainers, and i'm interested in moving us over to some of the arm-optimized-routines string routines for next year's release. even if i wasn't, we'd want the MTE/SVE stuff sooner or later :-)

one question i had was whether you'd consider adding the FORTIFY _chk variants directly?

note: on Android, FORTIFY is basically always on. all parts of the OS must be built with FORTIFY, and we've recently made it the default for apps too (though they're allowed to opt-out if they want). so the _chk variants are an important performance case for Android, where they probably aren't for other OSes.

we have existing cases where we

effectively duplicated the code: https://android.googlesource.com/platform/bionic/+/master/libc/arch-arm/cortex-a15/bionic/__strcat_chk.S
added a separate routine: https://android.googlesource.com/platform/bionic/+/master/libc/arch-arm/cortex-a15/bionic/memset.S
fall through to the optimized code if the trivial FORTIFY check passes: https://android.googlesource.com/platform/bionic/+/master/libc/arch-x86_64/string/sse2-memset-slm.S (ignore the fact that that's x86-64 --- it was just the first example i found! the most obvious arm64 one currently is a lot less obvious, but it's cortex-a55/bionic/__strcpy_chk.S if you're interested.)

(unfortunately Google has a blanket ban on signing any CLAs that have copyright assignments, so i can't send you patches...)

from the report it's not clear what issues you have in android, i don't see how asm for _chk functions can be justified.

ideally _FORTIFY_SOURCE would not need any libc support or new abi symbols (there are already too many string function entry points, adding more reduces maintainability which impacts security) see e.g. https://git.2f30.org/fortify-headers/ how this can be done.

if a header only solution does not work then the implementation should be just c code wrappers around standard functions, not asm code changes (those are already hard to maintain and test for the supported inputs and configurations).

if you have performance issues then first look at the details of the implementation (e.g. avoid multiple ifunc indirections), you can try to static link the _chk functions into the callers to avoid a plt indirection (but i don't think it helps much). I will only consider changing string function asm for this if there are demonstrated benefits.

I believe any performance issues are likely Bionic specific. So first I'd like to understand what performance issues you see that require assembly implementations and cannot be optimized in the Bionic fortify implementation. Which functions specifically, and Arm or AArch64?

A quick look in Bionic shows that fortify.cpp has many slow byte loops rather than calling strlen or memcpy. It even places fortify checks inside critical loops! It creates complex functions which need to push/pop callee-saves and allocate stack rather than just using tailcalls (both to error functions and the underlying string function). There is also overuse of ifuncs and interworking veneers in Bionic when they are unnecessary.

Most of the simple _chk implementions can and should be inlined in the compiler. This avoids all overheads since you can now optimize the check at each callsite. But even when they are not inlined, most _chk functions should not be more than a compare and branch without creating a stack frame or going through ifuncs or interworking veneers.

Note many generic string functions are also slow, eg. strcat uses 2 slow byte-by-byte loops instead of the obvious strlen+strcpy. I think this should be fixed before considering adding an assembler implementation! I see many __strcat_chk.S and __strcpy_chk.S files (which appear mostly identical copies - why?), but is there any evidence they are actually improving performance over a decent C implementation?

So in my view we should avoid unnecessary proliferation of new entry points and expensive to maintain assembly code without sound reasons.

(sorry, somehow missed this two years ago!)

I believe any performance issues are likely Bionic specific. So first I'd like to understand what performance issues you see that require assembly implementations and cannot be optimized in the Bionic fortify implementation. Which functions specifically, and Arm or AArch64?

well, "bionic specific" is "all 3 billion Android devices", so it's not a bad use of your time :-)

i don't really care about arm32 any more (it's hard to justify much work there, as everyone's busily trying to move to arm64), and in fact haven't cleaned up all the historical SoC-vendor-supplied assembler there. arm64 on the other hand is (as of this week; we'd missed a couple, perhaps because this bug stalled) now fully using code from [the most recent tagged release] from this repository.

A quick look in Bionic shows that fortify.cpp has many slow byte loops rather than calling strlen or memcpy. It even places fortify checks inside critical loops! It creates complex functions which need to push/pop callee-saves and allocate stack rather than just using tailcalls (both to error functions and the underlying string function). There is also overuse of ifuncs and interworking veneers in Bionic when they are unnecessary.

those horror shows are actually the best examples of the "we don't really have a good way to implement this one without custom assembler" functions. (though if you have specific ways to improve the C implementations that generate better code without breaking them, i'm all ears for that too.)

for the stuff that matters the most such as memset, we historically had a memset_chk that was just the comparison and a jump to the failure code (which is in C because "who cares?"), which would then fall through to the SoC vendor's assembler. right now we have https://cs.android.com/android/platform/superproject/+/master:bionic/libc/arch-arm64/string/__memset_chk.S which is "probably fine", though obviously we could save that branch on every memset if this was in the arm-optimized-routines code.

but, yes, there is an open high-level question of "does dynamic profiling show we end up in the horror show routines often?" and i don't think anyone has that data. (but that does sound like something Arm might want to look at, since fortify is always on for all Android devices.)

Most of the simple _chk implementions can and should be inlined in the compiler. This avoids all overheads since you can now optimize the check at each callsite.

we do that already. that's what all the cruft in the headers is for. you only end up calling the actual _chk functions if you don't know the values until runtime.

But even when they are not inlined, most _chk functions should not be more than a compare and branch without creating a stack frame or going through ifuncs or interworking veneers.

aye, but someone needs to do that work, and as long as we can't fall through, we end up needing a branch in the non-failure case too.

Note many generic string functions are also slow, eg. strcat uses 2 slow byte-by-byte loops instead of the obvious strlen+strcpy. I think this should be fixed before considering adding an assembler implementation! I see many __strcat_chk.S and __strcpy_chk.S files (which appear mostly identical copies - why?), but is there any evidence they are actually improving performance over a decent C implementation?

afaict, everything in this paragraph refers to the old arm32 cruft. i've just been ignoring it because i agree that that's not worth anybody's time. (and nor is the disruption of switching it over to arm-optimized-routines.)

@gburgessiv fyi

ARM-software / optimized-routines

support FORTIFY _chk variants of string functions #27