ParkMyCar / compact_str

A memory efficient string type that can store up to 24* bytes on the stack
MIT License
652 stars 47 forks source link

Branch in string access #268

Closed GoldsteinE closed 1 year ago

GoldsteinE commented 1 year ago

Hi!

I compiled this simple function:

pub fn deref_compact_string(string: &CompactString) -> &str {
    string
}

It generated the following asm on my x86_64 laptop:

deref_compact_string:
 mov     rax, rdi
 movzx   edx, byte, ptr, [rdi, +, 23]
 lea     ecx, [rdx, +, 64]
 movzx   esi, cl
 cmp     sil, 24
 mov     ecx, 24
; That’s actually branchless
 cmovb   ecx, esi
 cmp     dl, -2
; But that’s a branch
 jne     .LBB1_1
 mov     rdx, qword, ptr, [rax, +, 8]
 mov     rax, qword, ptr, [rax]
 ret
.LBB1_1:
 movzx   edx, cl
 ret

These two movs after the branch feel like they come from these two assignments: https://github.com/ParkMyCar/compact_str/blob/9e00e7d6fcfd36a6f2608f3b09211128af10d64c/compact_str/src/repr/mod.rs#L367-L370

So it seems that this if stopped compiling to cmov at some point.

ParkMyCar commented 1 year ago

Ah shoot, thanks for filing an issue! One question just to double check, did you compile in release mode?

I'm wondering if it has something to do with the implementation of Deref, but regardless ideally this would be branchless. I'll look into this soon :)

GoldsteinE commented 1 year ago

Ah shoot, thanks for filing an issue! One question just to double check, did you compile in release mode?

Yeah, it was obtained via cargo asm, which compiles in release mode by default.

GoldsteinE commented 1 year ago

I tried to apply this trick:

let c = usize::from(last_byte == HEAP_MASK);
let ptr_or_void = [&mut std::ptr::null(), pointer_ref];
let length_or_void = [&mut 0, length_ref];
*ptr_or_void[c] = heap_pointer;
*length_or_void[c] = heap_length;

It generates assembler code that is branchless but otherwise horrible, so is probably slower:

compact_str::deref_compact_string:
 movzx   ecx, byte, ptr, [rdi, +, 23]
 mov     qword, ptr, [rsp, -, 64], rdi
 mov     rax, qword, ptr, [rdi]
 lea     edx, [rcx, +, 64]
 movzx   edx, dl
 cmp     dl, 24
 mov     esi, 24
 cmovb   esi, edx
 movzx   edx, sil
 mov     qword, ptr, [rsp, -, 56], rdx
 mov     rdx, qword, ptr, [rdi, +, 8]
 lea     rsi, [rsp, -, 16]
 mov     qword, ptr, [rsp, -, 48], rsi
 lea     rsi, [rsp, -, 64]
 mov     qword, ptr, [rsp, -, 40], rsi
 lea     rsi, [rsp, -, 8]
 mov     qword, ptr, [rsp, -, 32], rsi
 lea     rsi, [rsp, -, 56]
 mov     qword, ptr, [rsp, -, 24], rsi
 xor     esi, esi
 cmp     cl, -2
 sete    sil
 mov     rcx, qword, ptr, [rsp, +, 8*rsi, -, 48]
 mov     qword, ptr, [rcx], rax
 mov     rax, qword, ptr, [rsp, +, 8*rsi, -, 32]
 mov     qword, ptr, [rax], rdx
 mov     rax, qword, ptr, [rsp, -, 64]
 mov     rdx, qword, ptr, [rsp, -, 56]
 ret

I wonder if there is a way to tell it that writing into the void can be safely optimized out.

GoldsteinE commented 1 year ago

I was curious whether cmov actually performs any better, so I inline-asm’ed it and it kinda does.

Patch ```diff diff --git a/compact_str/src/repr/mod.rs b/compact_str/src/repr/mod.rs index e6a0e811..8604d387 100644 --- a/compact_str/src/repr/mod.rs +++ b/compact_str/src/repr/mod.rs @@ -3,6 +3,7 @@ use core::{ mem, ptr, }; +use std::arch::asm; use std::borrow::Cow; #[cfg(feature = "bytes")] @@ -353,20 +354,23 @@ impl Repr { let mut pointer = self as *const Self as *const u8; let heap_pointer = self.0 as *const u8; - let pointer_ref = &mut pointer; - // initially has the value of the stack length, conditionally becomes the heap length let mut length = core::cmp::min((last_byte.wrapping_sub(LENGTH_MASK)) as usize, MAX_SIZE); let heap_length = self.1; - let length_ref = &mut length; // our discriminant is stored in the last byte and denotes stack vs heap // // Note: We should never add an `else` statement here, keeping the conditional simple allows // the compiler to optimize this to a conditional-move instead of a branch - if last_byte == HEAP_MASK { - *pointer_ref = heap_pointer; - *length_ref = heap_length; + unsafe { + asm!( + "cmp {0}, {1}", + "cmove {2}, {3}", + "cmove {4}, {5}", + in(reg_byte) last_byte, in(reg_byte) HEAP_MASK, + inout(reg) pointer, in(reg) heap_pointer, + inout(reg) length, in(reg) heap_length, + ); } // SAFETY: We know the data is valid, aligned, and part of the same contiguous allocated ```

Most benchmarks didn’t significantly improve, but there’s one where criterion detected improvement:

u128::to_compact_string/12345678909876543210123456789
                        time:   [31.835 ns 32.029 ns 32.283 ns]
                        change: [-4.4597% -2.8194% -1.1233%] (p = 0.00 < 0.05)
                        Performance has improved.

and some where improvement is within noise threshold:

A lot of criterion output ``` String::to_compact_string/hello world! time: [8.5981 ns 8.7600 ns 8.9191 ns] change: [+0.1533% +2.0519% +4.0135%] (p = 0.04 < 0.05) Change within noise threshold. char::to_compact_string/a time: [9.4501 ns 9.4793 ns 9.5136 ns] change: [-2.7346% -1.5083% -0.3400%] (p = 0.01 < 0.05) Change within noise threshold. bool::to_compact_string/true time: [2.6392 ns 2.6565 ns 2.6756 ns] change: [-3.7504% -2.3054% -0.6976%] (p = 0.00 < 0.05) Change within noise threshold. isize::to_compact_string/-9999999 time: [13.613 ns 13.791 ns 13.976 ns] change: [-3.7152% -2.0093% -0.2719%] (p = 0.02 < 0.05) Change within noise threshold. u32::to_compact_string/54321 time: [8.5940 ns 8.6496 ns 8.7202 ns] change: [-1.9642% -1.0126% -0.0641%] (p = 0.04 < 0.05) Change within noise threshold. ```

I’ll try to look into it more later.

GoldsteinE commented 1 year ago

Might be related to this issue: https://github.com/rust-lang/rust/issues/53823

The timing is off though. This problem exists from 1.25 onwards, and AFAIU branchless deref was working after that moment?