Closed GoldsteinE closed 1 year ago
Ah shoot, thanks for filing an issue! One question just to double check, did you compile in release mode?
I'm wondering if it has something to do with the implementation of Deref
, but regardless ideally this would be branchless. I'll look into this soon :)
Ah shoot, thanks for filing an issue! One question just to double check, did you compile in release mode?
Yeah, it was obtained via cargo asm
, which compiles in release mode by default.
I tried to apply this trick:
let c = usize::from(last_byte == HEAP_MASK);
let ptr_or_void = [&mut std::ptr::null(), pointer_ref];
let length_or_void = [&mut 0, length_ref];
*ptr_or_void[c] = heap_pointer;
*length_or_void[c] = heap_length;
It generates assembler code that is branchless but otherwise horrible, so is probably slower:
compact_str::deref_compact_string:
movzx ecx, byte, ptr, [rdi, +, 23]
mov qword, ptr, [rsp, -, 64], rdi
mov rax, qword, ptr, [rdi]
lea edx, [rcx, +, 64]
movzx edx, dl
cmp dl, 24
mov esi, 24
cmovb esi, edx
movzx edx, sil
mov qword, ptr, [rsp, -, 56], rdx
mov rdx, qword, ptr, [rdi, +, 8]
lea rsi, [rsp, -, 16]
mov qword, ptr, [rsp, -, 48], rsi
lea rsi, [rsp, -, 64]
mov qword, ptr, [rsp, -, 40], rsi
lea rsi, [rsp, -, 8]
mov qword, ptr, [rsp, -, 32], rsi
lea rsi, [rsp, -, 56]
mov qword, ptr, [rsp, -, 24], rsi
xor esi, esi
cmp cl, -2
sete sil
mov rcx, qword, ptr, [rsp, +, 8*rsi, -, 48]
mov qword, ptr, [rcx], rax
mov rax, qword, ptr, [rsp, +, 8*rsi, -, 32]
mov qword, ptr, [rax], rdx
mov rax, qword, ptr, [rsp, -, 64]
mov rdx, qword, ptr, [rsp, -, 56]
ret
I wonder if there is a way to tell it that writing into the void can be safely optimized out.
I was curious whether cmov
actually performs any better, so I inline-asm’ed it and it kinda does.
Most benchmarks didn’t significantly improve, but there’s one where criterion detected improvement:
u128::to_compact_string/12345678909876543210123456789
time: [31.835 ns 32.029 ns 32.283 ns]
change: [-4.4597% -2.8194% -1.1233%] (p = 0.00 < 0.05)
Performance has improved.
and some where improvement is within noise threshold:
I’ll try to look into it more later.
Might be related to this issue: https://github.com/rust-lang/rust/issues/53823
The timing is off though. This problem exists from 1.25 onwards, and AFAIU branchless deref was working after that moment?
Hi!
I compiled this simple function:
It generated the following asm on my x86_64 laptop:
These two
mov
s after the branch feel like they come from these two assignments: https://github.com/ParkMyCar/compact_str/blob/9e00e7d6fcfd36a6f2608f3b09211128af10d64c/compact_str/src/repr/mod.rs#L367-L370So it seems that this
if
stopped compiling tocmov
at some point.