[sysvabi64] document requirement for bti c in more detail

nsz-arm commented 1 year ago

the text currently has

"An executable or shared library that supports BTI must have a bti c instruction at the start of any entry that might be called indirectly."

but it's not clear if compilers should consider potential linker inserted veneers with indirect call/jump or if the linker should ensure that when a veneer is inserted it does not break bti compatibility.

(gcc+ld.bfd made different choice than llvm+lld)

see discussion at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106671

smithp35 commented 1 year ago

Maybe worth ELF in addition to sysvabi as this would also affect bare-metal (pac-bti M) which would presumably also be affected if GCC was not emitting BTIs for functions that could require a stub.

My reading was that without a specific exception for linker created veneers/stubs code-generators had to assume that one might be created and generate code as if one could be inserted. I can remember clang always generating BTI instructions as it couldn't make an assumption that an indirect branch would be generated by the linker.

I think it is to the benefit of security to have fewer BTIs so having linker stubs that are BTI aware is an overall improvement so it is likely the preferred direction of travel. I think it is worth a wider discussion as IMO to make GCC behaviour not a bug, we would have to add a specific requirement for linkers to be BTI aware in the ABI and no such requirement exists at the moment.

Assuming we can get the agreement to add to the requirement, I'm thinking if there is anything that needs doing about transition. As I understand it:

GCC objects + (BFD prior to 106671 or LLD) are at risk of an indirect jump to a non-BTI compatible function.
Clang objects always have BTI so are safe with either linker. I'm not sure if there is anything we can do as a BTI aware linker will work with both. The only failing case is an older linker with objects with non-BTI compatible functions.

The other thing we may want to address is whether there is any additional marking we can do to make your optimisation possible without disassembling the binary.

MaskRay commented 3 months ago

Functions with LR signing gets PACI[AB]SP{,PC}. They have an implicit BTI. If PACI[AB]SP is absent (leaf functions, or when PAuth is not enabled), Clang adds "bti c" to every candidate function to be compatible with LLD and GNU ld before https://sourceware.org/bugzilla/show_bug.cgi?id=30076 in case range extension thunks (aka veneers aka stubs) are needed (https://reviews.llvm.org/D99417).

I assume that the LLD work is planned and Clang will eventually remove the "bti c" (BTW -fbinutils-version= exists if compatibility with older GNU ld versions is needed). Is there more information about the double veneer scheme used by GNU ld. Do we need a new relocation to mark "bti c"? (If there is concern with a new relocation type, NONE with a custom addend might be utilized.)

smithp35 commented 3 months ago

We've got an idea of where we want to go with this, I've been wanting to have an implementation in LLD ready before publishing and have not been able to find time to do this.

The change that needs making should make clear the requirements for code-generators and static linkers. The prevailing opinion within Arm is that we would like to enable code-generators to omit BTI if they can prove that the function will never be called indirectly (GCC behaviour). A static linker may therefore not assume that all indirect branch targets have a BTI compatible landing pad.

A "BTI compatible" thunk either doesn't use an indirect branch (chain of direct branches) or they are split up into two parts, the indirect branch, and a "header" that contains a BTI c, and ends with a direct branch. Something like:

caller:
  bl thunk_to_foo
  ...
thunk_to_foo:
  adrp x16, foo_bti_header
  add  x16, :lo12: foo_bti_header
  br   x16
  ...
foo_bti_header:
  bti  c
  b    foo
  ...
foo:

The "header" has a range limit (+-128Mib), and is essentially an alternative entry point for indirect calls. The presence of this alternative entry point undoes the compiler's hard work in omitting the BTI, but it will only be done if necessary.

As these "BTI compatible" thunks are larger and slower than normal we would want to only generate these when necessary. GNU ld has decided to disassemble the code at the destination. While this is an option, and is the most precise solution, if there are a lot of thunks then this could affect linker performance. If there are only a few then it probably doesn't matter.

I am hoping that I can find some heuristics that would let a linker decide based on symbol information so that the need for disassembly is lessened. Assuming GCCs implementation doesn't already break this, it could be possible to say that eliding BTI is only permitted for symbols with STB_LOCAL binding. This would reduce the number of candidates a static linker would need to disassemble to check for a BTI (or just assume it doesn't have one).

nsz-arm commented 3 months ago

additional details: multiple calls can share the same thunk and multiple thunks may share the same 'header'. and sometimes the header is already within reach of a call (even though the call target is not) and then the header is called directly (which actually would not even need a bti c, unless it is shared with an indirect thunk, bfd ld does not avoid bti c in this case). iirc the veneers are aligned up to 8byte boundary so branches and branch targets are not too close and thus a chain of single branches could take 8byte per veneer instead of just 4 (but such design would avoid any bti so could be safer and still less code if the distances are not too big: <= 3 direct jumps away. this was not tried in bfd ld).

Wilco1 commented 3 months ago

Yes if veneer insertion was a bit smarter, it could handle all ranges up to +-256MB using a single direct branch, or +-384MB using 2 direct branches. For even larger binaries it isn't worth worrying about avoiding the BTI header (since the extra size is negligible), and you could delay the final decision of the target of the indirect branch late during relocation when disassembly will be cheaply available.

smithp35 commented 3 months ago

LLD can do a limited form of inserting 1 direct branch, but due to restrictions on the placement of the branch it doesn't get the full 128 MiB extra range.

Inserting a chain of branches could be possible but it would add quite a bit of complexity to the existing implementation as there are limited points where the linker can insert the branch, as well as needing to insert thunks across output section boundaries.

The additional, unneeded BTI headers could be used as a landing pad by an attacker, but it would still be fewer landing pads than if the compiler always added BTI. I'll have a think about that when doing the LLD implementation.

ARM-software / abi-aa

[sysvabi64] document requirement for bti c in more detail #196