ParkMyCar / compact_str

A memory efficient string type that can store up to 24* bytes on the stack
MIT License
557 stars 41 forks source link

Shave an instruction off `len()` #357

Open overlookmotel opened 5 months ago

overlookmotel commented 5 months ago

I believe it's possible to shave an instruction off len() (and probably as_slice() too) if static string had same discriminant as heap.

Static string could instead be represented as a heap string with capacity 0, which I assume isn't otherwise a possible state. Then:

const HEAP_MASK_AFTER_SUB: usize = (HEAP_MASK as usize)
  .wrapping_sub(LENGTH_MASK as usize);

pub fn len(&self) -> usize {
  let len_heap = ensure_read(self.1);
  let last_byte = self.last_byte() as usize;
  let mut len = last_byte.wrapping_sub(LENGTH_MASK as usize);
  let is_heap = len == HEAP_MASK_AFTER_SUB;
  len = len.min(MAX_SIZE);
  if is_heap {
    len = len_heap;
  }
  len
}

On 64-bit machines, MAX_SIZE and HEAP_MASK_AFTER_SUB are the same (24). So this shaves off a cmp instruction as len == HEAP_MASK_AFTER_SUB and .min(MAX_SIZE) both work off the result of a single cmp, rather than requiring 2. And (if I'm reading the ASM right) it also uses one less register.

https://godbolt.org/z/e7Kc4PMTE

Quite possibly handling the special case of zero-capacity heap strings (which are actually static strings) would introduce costs elsewhere, but I'm not familiar enough with the code to say. Can anyone advise if that's likely to negate the gain?

NobodyXu commented 5 months ago

It would probably affect as_str, as_bytes, reserve and as_mut_bytes, which is where the static str repr is special cased

overlookmotel commented 5 months ago

Thanks for coming back so swiftly. Do you think this is worth further investigation or, does it sound like it's not going to be viable for that reason?

NobodyXu commented 5 months ago

I think it worths giving it a shot

overlookmotel commented 5 months ago

OK cool. I'll take a closer look when I get some time.

Kijewski commented 5 months ago

This is great! In your variant the ensure_read() optimization barrier can also be removed, probably because the compiler understands your intentions better than it does with the current code:

pub fn len_new_new(c: &C) -> usize {
    let last_byte = c.5 as usize;
    let mut len = last_byte.wrapping_sub(LENGTH_MASK as usize);
    let is_heap = len == HEAP_MASK_AFTER_SUB;
    len = len.min(MAX_SIZE);
    if is_heap {
        len = c.1;
    }
    len
}
example::len_new:
        movq    8(%rdi), %rcx
        movzbl  23(%rdi), %edx
        addq    $-192, %rdx
        cmpq    $24, %rdx
        movl    $24, %eax
        cmovbq  %rdx, %rax
        cmoveq  %rcx, %rax
        retq

example::len_new_new:
        movzbl  23(%rdi), %ecx
        addq    $-192, %rcx
        cmpq    $24, %rcx
        movl    $24, %eax
        cmovbq  %rcx, %rax
        cmoveq  8(%rdi), %rax
        retq

Without the ensure_read() blackbox the compiler might optimize better down the line, e.g. omitting repeated calls to this method, because it can tell that the code is pure.

overlookmotel commented 5 months ago

Bingo! That's 2 less instructions, and 2 less registers.

I think I'd misunderstood the purpose of ensure_read(). I had thought the point of it was to kick off loading len_heap from memory as soon as possible, so that by the time you get to the cmoveq it may be ready in a register and so no stall while waiting for it to load.

Is that not the point of it at all?

And, if I may, another basic question: Aside from reducing instructions, is minimising the number of registers a function uses a useful thing to do?

(NB I'm new to all this reading the runes of ASM business, so really feeling my way here. Any pointers much appreciated.)

NobodyXu commented 5 months ago

And, if I may, another basic question: Aside from reducing instructions, is minimising the number of registers a function uses a useful thing to do?

It depends though, if it uses callee-saved register then I think it could be useful.

Looking at example::len_new_new posted by @Kijewski it already looks quite optimal to me.

mcronce commented 2 months ago

Also, if any call to the function is inlined, those saved registers can potentially have a positive impact on the caller

ParkMyCar commented 1 month ago

@overlookmotel this is great, thank you!

Static string could instead be represented as a heap string with capacity 0, which I assume isn't otherwise a possible state.

Yes! This is an idea I've had bouncing around for a bit, I think there are some optimizations we can make if we modify the discriminant a bit.

I'll whip up a PR this weekend, thanks again for the idea!

overlookmotel commented 1 month ago

Thanks for taking it up. Glad you like it. I've learned a lot from the bit-fiddling tricks this library uses, so glad to contribute an idea (even if not the implementation!)