Closed narpfel closed 8 months ago
cc @bjorn3
I suspect the sequence may be technically in violation of the redzone constraints of the ABI and we may need to move rsp
downward in steps... if we move rsp
downward all at once, before the probes, we risk putting it in other valid (unrelated) memory and then an async interruption (signal handler or whatnot) clobbers things. So perhaps the spec-compliant sequence is
sub rsp, 0x1000
mov dword [rsp], rsp
sub rsp, 0x1000
mov dword [rsp], rsp
...
sub rsp, 0xe20
mov dword [rsp], rsp
which is the literal unroll of the probe-loop. What do you think @bjorn3 / @afonso360 ?
we may need to move
rsp
downward in steps...
This is basically how LLVM does it:
00000000000075e0 <_ZN7project4main17ha1977755d345ddbbE>:
75e0: 48 81 ec 00 10 00 00 sub rsp,0x1000
75e7: 48 c7 04 24 00 00 00 mov QWORD PTR [rsp],0x0
75ee: 00
75ef: 48 81 ec 00 10 00 00 sub rsp,0x1000
75f6: 48 c7 04 24 00 00 00 mov QWORD PTR [rsp],0x0
75fd: 00
75fe: 48 81 ec 00 10 00 00 sub rsp,0x1000
7605: 48 c7 04 24 00 00 00 mov QWORD PTR [rsp],0x0
760c: 00
760d: 48 81 ec 00 10 00 00 sub rsp,0x1000
7614: 48 c7 04 24 00 00 00 mov QWORD PTR [rsp],0x0
761b: 00
761c: 48 81 ec a0 0d 00 00 sub rsp,0xda0
7623: 48 81 c4 a0 4d 00 00 add rsp,0x4da0
762a: c3 ret
762b: 0f 1f 44 00 00 nop DWORD PTR [rax+rax*1+0x0]
which is the literal unroll of the probe-loop. What do you think @bjorn3 / @afonso360 ?
Yeah, I think this makes sense. I also checked what clang generates for AArch64 / RISC-V, and it does the same thing, so we might also have to update those backends.
I suspect the sequence may be technically in violation of the redzone constraints of the ABI and we may need to move rsp downward in steps
The current instruction sequence should be fine with respect to the redzone constraints, right? When the mov runs there is no signal handler running, so no clobbering of the signal handler stack. And the signal handler clobbering the written data is fine as don't never read it again without a write in between.
In practice things should play out as you say, yes (so there isn't a "real" correctness bug or possibility of corruption here, AFAICT). But the ABI doc explicitly defines the redzone and Valgrind here is interpreting the stores as ordinary stack-frame stores (that would presumably contain data we want to preserve), I guess. The spec doesn't explicitly say anywhere that code must not write below rsp - 128
, as far as I have found, but I guess it could be inferred from the description of stack frame locations together with a conservative "any store to the stack is to a stack frame" interpretation. IMHO it's best to be a bit conservative here and LLVM apparently thought the same...
LLVM’s stack probing was apparently implemented in D68720, derived from the implementation in GCC (as per this article), and the discussion there links to the GCC mailing list, which has some insights why that specific strategy was chosen:
https://gcc.gnu.org/pipermail/gcc-patches/2017-June/477152.html:
Most ports first probe by pages for whatever space is requested, then after all probing is done, they actually allocate space. This runs afoul of valgrind in various unpleasant ways (including crashing valgrind on two targets).
Only x86-linux currently uses a "moving sp" allocation and probing strategy. ie, it actually allocates space, then probes the space.
--
After much poking around I concluded that we really need to implement allocation and probing via a "moving sp" strategy. Probing into unallocated areas runs afoul of valgrind, so that's a non-starter.
So both LLVM and GCC explicitly cite “we want to please valgrind” as a reason for their implementation strategy.
I tried
rustc_codegen_cranelift
on some of my projects, and found that even though the binaries appeared to run normally, they produced errors and segfaults invalgrind
. Looking at the disassembly, it appeared that valgrind doesn’t like the way Cranelift performs stack probing..clif
Test CaseThis is the most minimal Rust code that I came up with:
which generates the following
.clif
file:output file `main.clif/_ZN4main4main17hf30ba8656d3abcbbE.unopt.clif`
generated by `rustc -Z codegen-backend=cranelift src/main.rs --emit=llvm-ir` ```clif set opt_level=none set tls_model=elf_gd set libcall_call_conv=isa_default set probestack_size_log2=12 set probestack_strategy=inline set bb_padding_log2_minus_one=0 set regalloc_checker=0 set regalloc_verbose_logs=0 set enable_alias_analysis=1 set enable_verifier=0 set is_pic=1 set use_colocated_libcalls=0 set enable_float=1 set enable_nan_canonicalization=0 set enable_pinned_reg=0 set enable_atomics=1 set enable_safepoints=0 set enable_llvm_abi_extensions=1 set unwind_info=1 set preserve_frame_pointers=0 set machine_code_cfg_info=0 set enable_probestack=1 set probestack_func_adjusts_sp=0 set enable_jump_tables=1 set enable_heap_access_spectre_mitigation=1 set enable_table_access_spectre_mitigation=1 set enable_incremental_compilation_cache_checks=0 target x86_64 has_sse3=1 has_ssse3=1 has_sse41=1 has_sse42=1 has_avx=0 has_avx2=0 has_fma=0 has_avx512bitalg=0 has_avx512dq=0 has_avx512vl=0 has_avx512vbmi=0 has_avx512f=0 has_popcnt=1 has_bmi1=0 has_bmi2=0 has_lzcnt=0 function u0:8() system_v { ; symbol _ZN4main4main17hf30ba8656d3abcbbE ; instance Instance { def: Item(DefId(0:3 ~ main[b61b]::main)), args: [] } ; abi FnAbi { args: [], ret: ArgAbi { layout: TyAndLayout { ty: (), layout: Layout { size: Size(0 bytes), align: AbiAndPrefAlign { abi: Align(1 bytes), pref: Align(8 bytes) }, abi: Aggregate { sized: true }, fields: Arbitrary { offsets: [], memory_index: [] }, largest_niche: None, variants: Single { index: 0 }, max_repr_align: None, unadjusted_abi_align: Align(1 bytes) } }, mode: Ignore }, c_variadic: false, fixed_count: 0, conv: Rust, can_unwind: true } ; kind loc.idx param pass mode ty ; zst _0 () 0b 1, 8 align=8,offset= ; ret _0 - Ignore () ; kind local ty size align (abi,pref) ; stack _1 [u32; 5000_usize] 20000b 4, 4 storage=ss0 ss0 = explicit_slot 20000 block0: nop jump block1 block1: nop ; ; return return } ```Steps to Reproduce
Expected Results
When run in valgrind, this program should not produce any errors.
Actual Results
valgrind complains about out-of-bounds stack writes and then lets the program segfault on a write to an unmapped address:
valgrind output
```console $ RUSTFLAGS="-Z codegen-backend=cranelift" cargo build Compiling project v0.1.0 (/tmp/project) Finished dev [unoptimized + debuginfo] target(s) in 0.23s $ ./target/debug/project $ echo $? 0 $ valgrind ./target/debug/project ==9258== Memcheck, a memory error detector ==9258== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al. ==9258== Using Valgrind-3.21.0 and LibVEX; rerun with -h for copyright info ==9258== Command: ./target/debug/project ==9258== ==9258== Invalid write of size 4 ==9258== at 0x10F5FF: project::main (main.rs:1) ==9258== by 0x10F68F: core::ops::function::FnOnce::call_once (function.rs:250) ==9258== by 0x10F673: std::sys_common::backtrace::__rust_begin_short_backtrace (backtrace.rs:154) ==9258== by 0x10F720: std::rt::lang_start::{{closure}} (rt.rs:167) ==9258== by 0x1260A6: call_once<(), (dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> (function.rs:284) ==9258== by 0x1260A6: do_call<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> (panicking.rs:552) ==9258== by 0x1260A6: tryVersions and Environment
valgrind 3.21.0 (I realise this is not the current version, but I didn’t find anything related to stacks in the changelog for 3.22.0 in case this is a false positive in valgrind.)
Extra Info
This is the disassembly of
main
:valgrind doesn’t like that the stack is written to before the stack pointer is moved.
Inline stack probing was introduced in #4747. Only unrolled stack probing is problematic, the loop is okay for valgrind as the stack pointer is moved before the write.