New code-gen options for retpolines and straight line speculation

Quuxplusone commented 2 years ago


Bugzilla Link	PR52323
Status	NEW
Importance	P enhancement
Reported by	Andrew Cooper (andrew.cooper3@citrix.com)
Reported on	2021-10-26 08:55:39 -0700
Last modified on	2021-11-22 14:00:07 -0800
Version	unspecified
Hardware	PC Linux
CC	andrew.cooper3@citrix.com, blitzrakete@gmail.com, chandlerc@gmail.com, dgregor@apple.com, efriedma@quicinc.com, erik.pilkington@gmail.com, jyknight@google.com, llvm-bugs@lists.llvm.org, manojgupta@google.com, ndesaulniers@google.com, pageexec@gmail.com, pengfei.wang@intel.com, richard-llvm@metafoo.co.uk, rnk@google.com
Fixed by commit(s)
Attachments
Blocks	PR4068
Blocked by
See also

Hello

[FYI, this is being cross-requested of GCC too]

Linux and other kernel level software makes use of -mindirect-branch=thunk-extern to be able to alter the handling of indirect branches at boot. It turns out to be advantageous to inline the thunks when retpoline is not in use. https://lore.kernel.org/lkml/20211026120132.613201817@infradead.org/ is some infrastructure to make this work.

In some cases, we want to be able to inline an lfence; jmp *%reg thunk. This is fine for the low 8 registers, but not fine for %r{8..15} where the REX prefix pushes the replacement size to being 6 bytes.

It would be very useful to have a code-gen option to write out call %cs:__x86_indirect_thunk_r{8..15} where the redundant %cs prefix will increase the instruction length to 6, allowing the non-retpoline form to be inlined.

Relatedly, x86 straight line speculation has been discussed before, but without any action taken. It would be helpful to have a code gen option which would emit int3 following any ret instruction, and any indirect jump, as neither of these two cases have following architectural execution.

The reason these two are related is that if both options are in use, we want an extra byte of replacement space to be able to inline lfence; jmp *%reg; int3.

Third Clang has been observed to spot conditional tail calls as Jcc __x86_indirect_thunk_*. This is a 6 byte source size, but needs up to 9 bytes of space for inlining including an int3 for straight line speculation reasons (See https://lore.kernel.org/lkml/20211026120310.359986601@infradead.org/ for full details). It might be enough to simply prohibit an optimisation like this when trying to pad retpolines for inlineability.

Quuxplusone commented 2 years ago

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102952 for GCC cross-request.

Quuxplusone commented 2 years ago

It looks like GCC has added support for -mindirect-branch-cs-prefix:

https://gcc.gnu.org/git/gitweb.cgi?p=gcc.git;h=2196a681d7810ad8b227bf983f38ba716620545e

This is being used when available in the Linux kernel:

https://lore.kernel.org/lkml/20211118185421.GK174703@worktop.programming.kicks-ass.net/

Quuxplusone commented 2 years ago

(In reply to Andrew Cooper from comment #0)

Relatedly, x86 straight line speculation has been discussed before, but without any action taken. It would be helpful to have a code gen option which would emit int3 following any ret instruction, and any indirect jump, as neither of these two cases have following architectural execution.

Is there documentation somewhere describing this mitigation? In particular:

What unconditional branches can lead straight-line speculation?
What instructions can be used to stop speculation? (Is int3 actually effective? Are there other instructions that would also work?)

Quuxplusone commented 2 years ago

(In reply to Eli Friedman from comment #3)

(In reply to Andrew Cooper from comment #0)

Relatedly, x86 straight line speculation has been discussed before, but without any action taken. It would be helpful to have a code gen option which would emit int3 following any ret instruction, and any indirect jump, as neither of these two cases have following architectural execution.

Is there documentation somewhere describing this mitigation? In particular:

What unconditional branches can lead straight-line speculation?

For AMD, it is discussed here https://developer.amd.com/wp-content/resources/Managing-Speculation-on-AMD-Processors.pdf, mitigation G-5 on the final page:

Place an LFENCE after an indirect branch instruction (RET, JMP reg or mem, CALL reg or mem) to help prevent possible sequential speculation.

For Intel, notes are included in SDM Vol2 for the CALL and JMP instructions:

Certain situations may lead to the next sequential instruction after a near indirect CALL being speculatively executed. If software needs to prevent this (e.g., in order to prevent a speculative execution side channel), then an LFENCE instruction opcode can be placed after the near indirect CALL in order to block speculative execution.

What instructions can be used to stop speculation? (Is int3 actually effective? Are there other instructions that would also work?)

As you can see, LFENCE is the official recommendation. It is about the only option for halting speculation which is safe to actually execute, and don't otherwise impact program state.

CALL has architectural execution following it. However, the code following a CALL instruction is typically preservation of the return value and a pile of dead registers wanting reloading, and is typically not a pointer deference involving a callee-clobbered register. Therefore, CALL's are unlikely to have subsequent instructions which are vulnerable to speculative type confusion, and are therefore uninteresting to protect.

JMP and RET are different. They are followed by arbitrary unrelated basic blocks, which could contain anything.

We could use LFENCE everywhere. However, as we don't architecturally execute the instruction, we don't care about architectural side effects. Basically any instruction which causes a decode exception, or is microcoded, halts speculation. INT3 is safe to use, and is 1/3 of the length of LFENCE, so has less of an impact on code size.

Quuxplusone commented 2 years ago

(In reply to Andrew Cooper from comment #4)
> CALL has architectural execution following it.  However, the code following
> a CALL instruction is typically preservation of the return value and a pile
> of dead registers wanting reloading, and is typically not a pointer
> deference involving a callee-clobbered register.

I'm a bit skeptical of heuristics like this; it's making very specific
assumptions about how the compiler generates code, which might not hold for
different codebases and/or optimizations.

> We could use LFENCE everywhere.  However, as we don't architecturally
> execute the instruction, we don't care about architectural side effects.
> Basically any instruction which causes a decode exception, or is microcoded,
> halts speculation.  INT3 is safe to use, and is 1/3 of the length of LFENCE,
> so has less of an impact on code size.

It looks like the current version of Intel manual actually explicitly mentions
INT3, so I guess that's fine.

Quuxplusone commented 2 years ago

(In reply to Eli Friedman from comment #5)
> It looks like the current version of Intel manual actually explicitly
> mentions INT3, so I guess that's fine.
Ah great - I'd missed that update coming though.  I'll pester the other guys to
document too.

> > CALL has architectural execution following it.  However, the code following
> > a CALL instruction is typically preservation of the return value and a pile
> > of dead registers wanting reloading, and is typically not a pointer
> > deference involving a callee-clobbered register.
>
> I'm a bit skeptical of heuristics like this; it's making very specific
> assumptions about how the compiler generates code, which might not hold for
> different codebases and/or optimizations.
Nevertheless, protecting JMP/RET with an INT3 is easy and cheap, while
protecting CALL with LFENCE is very much not, and risk profiles of the code is
very different.

My gut feeling is that anyone wanting protection in the CALL case would
probably be using Speculative Load Hardening instead.

Quuxplusone / LLVMBugzillaTest

New code-gen options for retpolines and straight line speculation #51290