Cranelift: inline probestack makes valgrind unhappy

narpfel commented 8 months ago

I tried rustc_codegen_cranelift on some of my projects, and found that even though the binaries appeared to run normally, they produced errors and segfaults in valgrind. Looking at the disassembly, it appeared that valgrind doesn’t like the way Cranelift performs stack probing.

`.clif` Test Case

This is the most minimal Rust code that I came up with:

fn main() {
    let _xs: [u32; 5_000];
}

which generates the following .clif file:

output file `main.clif/_ZN4main4main17hf30ba8656d3abcbbE.unopt.clif`

generated by `rustc -Z codegen-backend=cranelift src/main.rs --emit=llvm-ir` ```clif set opt_level=none set tls_model=elf_gd set libcall_call_conv=isa_default set probestack_size_log2=12 set probestack_strategy=inline set bb_padding_log2_minus_one=0 set regalloc_checker=0 set regalloc_verbose_logs=0 set enable_alias_analysis=1 set enable_verifier=0 set is_pic=1 set use_colocated_libcalls=0 set enable_float=1 set enable_nan_canonicalization=0 set enable_pinned_reg=0 set enable_atomics=1 set enable_safepoints=0 set enable_llvm_abi_extensions=1 set unwind_info=1 set preserve_frame_pointers=0 set machine_code_cfg_info=0 set enable_probestack=1 set probestack_func_adjusts_sp=0 set enable_jump_tables=1 set enable_heap_access_spectre_mitigation=1 set enable_table_access_spectre_mitigation=1 set enable_incremental_compilation_cache_checks=0 target x86_64 has_sse3=1 has_ssse3=1 has_sse41=1 has_sse42=1 has_avx=0 has_avx2=0 has_fma=0 has_avx512bitalg=0 has_avx512dq=0 has_avx512vl=0 has_avx512vbmi=0 has_avx512f=0 has_popcnt=1 has_bmi1=0 has_bmi2=0 has_lzcnt=0 function u0:8() system_v { ; symbol _ZN4main4main17hf30ba8656d3abcbbE ; instance Instance { def: Item(DefId(0:3 ~ main[b61b]::main)), args: [] } ; abi FnAbi { args: [], ret: ArgAbi { layout: TyAndLayout { ty: (), layout: Layout { size: Size(0 bytes), align: AbiAndPrefAlign { abi: Align(1 bytes), pref: Align(8 bytes) }, abi: Aggregate { sized: true }, fields: Arbitrary { offsets: [], memory_index: [] }, largest_niche: None, variants: Single { index: 0 }, max_repr_align: None, unadjusted_abi_align: Align(1 bytes) } }, mode: Ignore }, c_variadic: false, fixed_count: 0, conv: Rust, can_unwind: true } ; kind loc.idx param pass mode ty ; zst _0 () 0b 1, 8 align=8,offset= ; ret _0 - Ignore () ; kind local ty size align (abi,pref) ; stack _1 [u32; 5000_usize] 20000b 4, 4 storage=ss0 ss0 = explicit_slot 20000 block0: nop jump block1 block1: nop ; ; return return } ```

Steps to Reproduce

$ rustc -Z codegen-backend=cranelift src/main.rs
$ valgrind ./main

Expected Results

When run in valgrind, this program should not produce any errors.

Actual Results

valgrind complains about out-of-bounds stack writes and then lets the program segfault on a write to an unmapped address:

valgrind output

```console $ RUSTFLAGS="-Z codegen-backend=cranelift" cargo build Compiling project v0.1.0 (/tmp/project) Finished dev [unoptimized + debuginfo] target(s) in 0.23s $ ./target/debug/project $ echo $? 0 $ valgrind ./target/debug/project ==9258== Memcheck, a memory error detector ==9258== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al. ==9258== Using Valgrind-3.21.0 and LibVEX; rerun with -h for copyright info ==9258== Command: ./target/debug/project ==9258== ==9258== Invalid write of size 4 ==9258== at 0x10F5FF: project::main (main.rs:1) ==9258== by 0x10F68F: core::ops::function::FnOnce::call_once (function.rs:250) ==9258== by 0x10F673: std::sys_common::backtrace::__rust_begin_short_backtrace (backtrace.rs:154) ==9258== by 0x10F720: std::rt::lang_start::{{closure}} (rt.rs:167) ==9258== by 0x1260A6: call_once<(), (dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> (function.rs:284) ==9258== by 0x1260A6: do_call<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> (panicking.rs:552) ==9258== by 0x1260A6: try + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> (panicking.rs:516) ==9258== by 0x1260A6: catch_unwind<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> (panic.rs:142) ==9258== by 0x1260A6: {closure#2} (rt.rs:148) ==9258== by 0x1260A6: do_call (panicking.rs:552) ==9258== by 0x1260A6: try (panicking.rs:516) ==9258== by 0x1260A6: catch_unwind (panic.rs:142) ==9258== by 0x1260A6: std::rt::lang_start_internal (rt.rs:148) ==9258== by 0x10F6F5: std::rt::lang_start (rt.rs:166) ==9258== by 0x10F5F5: main (in /tmp/project/target/debug/project) ==9258== Address 0x1ffeffe5d0 is on thread 1's stack ==9258== 4096 bytes below stack pointer ==9258== ==9258== Invalid write of size 4 ==9258== at 0x10F606: project::main (main.rs:1) ==9258== by 0x10F68F: core::ops::function::FnOnce::call_once (function.rs:250) ==9258== by 0x10F673: std::sys_common::backtrace::__rust_begin_short_backtrace (backtrace.rs:154) ==9258== by 0x10F720: std::rt::lang_start::{{closure}} (rt.rs:167) ==9258== by 0x1260A6: call_once<(), (dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> (function.rs:284) ==9258== by 0x1260A6: do_call<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> (panicking.rs:552) ==9258== by 0x1260A6: try + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> (panicking.rs:516) ==9258== by 0x1260A6: catch_unwind<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> (panic.rs:142) ==9258== by 0x1260A6: {closure#2} (rt.rs:148) ==9258== by 0x1260A6: do_call (panicking.rs:552) ==9258== by 0x1260A6: try (panicking.rs:516) ==9258== by 0x1260A6: catch_unwind (panic.rs:142) ==9258== by 0x1260A6: std::rt::lang_start_internal (rt.rs:148) ==9258== by 0x10F6F5: std::rt::lang_start (rt.rs:166) ==9258== by 0x10F5F5: main (in /tmp/project/target/debug/project) ==9258== Address 0x1ffeffd5d0 is on thread 1's stack ==9258== 8192 bytes below stack pointer ==9258== ==9258== Invalid write of size 4 ==9258== at 0x10F60D: project::main (main.rs:1) ==9258== by 0x10F68F: core::ops::function::FnOnce::call_once (function.rs:250) ==9258== by 0x10F673: std::sys_common::backtrace::__rust_begin_short_backtrace (backtrace.rs:154) ==9258== by 0x10F720: std::rt::lang_start::{{closure}} (rt.rs:167) ==9258== by 0x1260A6: call_once<(), (dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> (function.rs:284) ==9258== by 0x1260A6: do_call<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> (panicking.rs:552) ==9258== by 0x1260A6: try + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> (panicking.rs:516) ==9258== by 0x1260A6: catch_unwind<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> (panic.rs:142) ==9258== by 0x1260A6: {closure#2} (rt.rs:148) ==9258== by 0x1260A6: do_call (panicking.rs:552) ==9258== by 0x1260A6: try (panicking.rs:516) ==9258== by 0x1260A6: catch_unwind (panic.rs:142) ==9258== by 0x1260A6: std::rt::lang_start_internal (rt.rs:148) ==9258== by 0x10F6F5: std::rt::lang_start (rt.rs:166) ==9258== by 0x10F5F5: main (in /tmp/project/target/debug/project) ==9258== Address 0x1ffeffc5d0 is not stack'd, malloc'd or (recently) free'd ==9258== ==9258== ==9258== Process terminating with default action of signal 11 (SIGSEGV): dumping core ==9258== Access not within mapped region at address 0x1FFEFFC5D0 ==9258== at 0x10F60D: project::main (main.rs:1) ==9258== by 0x10F68F: core::ops::function::FnOnce::call_once (function.rs:250) ==9258== by 0x10F673: std::sys_common::backtrace::__rust_begin_short_backtrace (backtrace.rs:154) ==9258== by 0x10F720: std::rt::lang_start::{{closure}} (rt.rs:167) ==9258== by 0x1260A6: call_once<(), (dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> (function.rs:284) ==9258== by 0x1260A6: do_call<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> (panicking.rs:552) ==9258== by 0x1260A6: try + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe)> (panicking.rs:516) ==9258== by 0x1260A6: catch_unwind<&(dyn core::ops::function::Fn<(), Output=i32> + core::marker::Sync + core::panic::unwind_safe::RefUnwindSafe), i32> (panic.rs:142) ==9258== by 0x1260A6: {closure#2} (rt.rs:148) ==9258== by 0x1260A6: do_call (panicking.rs:552) ==9258== by 0x1260A6: try (panicking.rs:516) ==9258== by 0x1260A6: catch_unwind (panic.rs:142) ==9258== by 0x1260A6: std::rt::lang_start_internal (rt.rs:148) ==9258== by 0x10F6F5: std::rt::lang_start (rt.rs:166) ==9258== by 0x10F5F5: main (in /tmp/project/target/debug/project) ==9258== If you believe this happened as a result of a stack ==9258== overflow in your program's main thread (unlikely but ==9258== possible), you can try to increase the size of the ==9258== main thread stack using the --main-stacksize= flag. ==9258== The main thread stack size used in this run was 8388608. ==9258== ==9258== HEAP SUMMARY: ==9258== in use at exit: 85 bytes in 3 blocks ==9258== total heap usage: 10 allocs, 7 frees, 2,157 bytes allocated ==9258== ==9258== LEAK SUMMARY: ==9258== definitely lost: 0 bytes in 0 blocks ==9258== indirectly lost: 0 bytes in 0 blocks ==9258== possibly lost: 0 bytes in 0 blocks ==9258== still reachable: 85 bytes in 3 blocks ==9258== suppressed: 0 bytes in 0 blocks ==9258== Rerun with --leak-check=full to see details of leaked memory ==9258== ==9258== For lists of detected and suppressed errors, rerun with: -s ==9258== ERROR SUMMARY: 4 errors from 3 contexts (suppressed: 0 from 0) [1] 9258 segmentation fault (core dumped) valgrind ./target/debug/project ```

Versions and Environment

$ rustc --version --verbose -Z codegen-backend=cranelift
rustc 1.75.0-nightly (75b064d26 2023-11-01)
binary: rustc
commit-hash: 75b064d26970ca8e7a487072f51835ebb057d575
commit-date: 2023-11-01
host: x86_64-unknown-linux-gnu
release: 1.75.0-nightly
Cranelift version: 0.101.2

valgrind 3.21.0 (I realise this is not the current version, but I didn’t find anything related to stacks in the changelog for 3.22.0 in case this is a false positive in valgrind.)

Extra Info

This is the disassembly of main:

00000000000075fb <_ZN7project4main17ha1977755d345ddbbE>:
    75fb:   55                      push   rbp
    75fc:   48 89 e5                mov    rbp,rsp
    75ff:   89 a4 24 00 f0 ff ff    mov    DWORD PTR [rsp-0x1000],esp
    7606:   89 a4 24 00 e0 ff ff    mov    DWORD PTR [rsp-0x2000],esp
    760d:   89 a4 24 00 d0 ff ff    mov    DWORD PTR [rsp-0x3000],esp
    7614:   89 a4 24 00 c0 ff ff    mov    DWORD PTR [rsp-0x4000],esp
    761b:   89 a4 24 00 b0 ff ff    mov    DWORD PTR [rsp-0x5000],esp
    7622:   48 81 ec 20 4e 00 00    sub    rsp,0x4e20
    7629:   48 81 c4 20 4e 00 00    add    rsp,0x4e20
    7630:   48 89 ec                mov    rsp,rbp
    7633:   5d                      pop    rbp
    7634:   c3                      ret

valgrind doesn’t like that the stack is written to before the stack pointer is moved.

Inline stack probing was introduced in #4747. Only unrolled stack probing is problematic, the loop is okay for valgrind as the stack pointer is moved before the write.

cfallin commented 8 months ago

cc @bjorn3

I suspect the sequence may be technically in violation of the redzone constraints of the ABI and we may need to move rsp downward in steps... if we move rsp downward all at once, before the probes, we risk putting it in other valid (unrelated) memory and then an async interruption (signal handler or whatnot) clobbers things. So perhaps the spec-compliant sequence is

sub rsp, 0x1000
mov dword [rsp], rsp
sub rsp, 0x1000
mov dword [rsp], rsp
...
sub rsp, 0xe20
mov dword [rsp], rsp

which is the literal unroll of the probe-loop. What do you think @bjorn3 / @afonso360 ?

narpfel commented 8 months ago

we may need to move rsp downward in steps...

This is basically how LLVM does it:

00000000000075e0 <_ZN7project4main17ha1977755d345ddbbE>:
    75e0:   48 81 ec 00 10 00 00    sub    rsp,0x1000
    75e7:   48 c7 04 24 00 00 00    mov    QWORD PTR [rsp],0x0
    75ee:   00 
    75ef:   48 81 ec 00 10 00 00    sub    rsp,0x1000
    75f6:   48 c7 04 24 00 00 00    mov    QWORD PTR [rsp],0x0
    75fd:   00 
    75fe:   48 81 ec 00 10 00 00    sub    rsp,0x1000
    7605:   48 c7 04 24 00 00 00    mov    QWORD PTR [rsp],0x0
    760c:   00 
    760d:   48 81 ec 00 10 00 00    sub    rsp,0x1000
    7614:   48 c7 04 24 00 00 00    mov    QWORD PTR [rsp],0x0
    761b:   00 
    761c:   48 81 ec a0 0d 00 00    sub    rsp,0xda0
    7623:   48 81 c4 a0 4d 00 00    add    rsp,0x4da0
    762a:   c3                      ret
    762b:   0f 1f 44 00 00          nop    DWORD PTR [rax+rax*1+0x0]

afonso360 commented 8 months ago

which is the literal unroll of the probe-loop. What do you think @bjorn3 / @afonso360 ?

Yeah, I think this makes sense. I also checked what clang generates for AArch64 / RISC-V, and it does the same thing, so we might also have to update those backends.

bjorn3 commented 8 months ago

I suspect the sequence may be technically in violation of the redzone constraints of the ABI and we may need to move rsp downward in steps

The current instruction sequence should be fine with respect to the redzone constraints, right? When the mov runs there is no signal handler running, so no clobbering of the signal handler stack. And the signal handler clobbering the written data is fine as don't never read it again without a write in between.

cfallin commented 8 months ago

In practice things should play out as you say, yes (so there isn't a "real" correctness bug or possibility of corruption here, AFAICT). But the ABI doc explicitly defines the redzone and Valgrind here is interpreting the stores as ordinary stack-frame stores (that would presumably contain data we want to preserve), I guess. The spec doesn't explicitly say anywhere that code must not write below rsp - 128, as far as I have found, but I guess it could be inferred from the description of stack frame locations together with a conservative "any store to the stack is to a stack frame" interpretation. IMHO it's best to be a bit conservative here and LLVM apparently thought the same...

narpfel commented 8 months ago

LLVM’s stack probing was apparently implemented in D68720, derived from the implementation in GCC (as per this article), and the discussion there links to the GCC mailing list, which has some insights why that specific strategy was chosen:

https://gcc.gnu.org/pipermail/gcc-patches/2017-June/477152.html:

Most ports first probe by pages for whatever space is requested, then after all probing is done, they actually allocate space. This runs afoul of valgrind in various unpleasant ways (including crashing valgrind on two targets).

Only x86-linux currently uses a "moving sp" allocation and probing strategy. ie, it actually allocates space, then probes the space.

--

After much poking around I concluded that we really need to implement allocation and probing via a "moving sp" strategy. Probing into unallocated areas runs afoul of valgrind, so that's a non-starter.

So both LLVM and GCC explicitly cite “we want to please valgrind” as a reason for their implementation strategy.

bytecodealliance / wasmtime