fitzgen / bumpalo

A fast bump allocation arena for Rust
https://docs.rs/bumpalo
Apache License 2.0
1.41k stars 111 forks source link

Minimum alignment #241

Open CeleritasCelery opened 6 months ago

CeleritasCelery commented 6 months ago

I was reading your blog post while looking through the code and noticed that it did not utilize the "minimum alignment trick" mentioned at the end. I wanted to see how much difference it made. I implemented it and run the benchmarks but didn't see any difference in the "small" and "big" ones. The small one obviously won't benefit from alignment and the big benchmark is dominated by the overhead of copying. However I created a medium benchmark with word sized values.

#[derive(Default)]
struct Medium(usize);

In this benchmark I saw 12% throughput improvement when using minimum alignment. There is a similar improvement for allocations up to 8 words in size (much larger than that and the copying overhead starts to dominate the benchmark).

This seems like a sizable improvement. But I also understand that maybe not everyone wants to waste the extra memory if they are doing small or odd-sized allocations. Would you be open to adding this as an optional feature? Updating the code is fairly easy, but the harder part would be updating the tests, since many of them (like quickcheck) hardcode the assumption that there is no min alignment. Let me know what you think. I would submit the PR if this was something you were interested in.

fitzgen commented 6 months ago

FWIW, it isn't just the overhead of wasted space that is a potential downside here. Bumpalo also supports use cases where, if care is taken on the types of things allocated within the arena, you can read the allocated chunks out from the arena and use them directly for things like zero-copy serialization/deserialization. Inserting undefined padding bytes can break that use case.


I would expect that, for scenarios where we are allocating a type of a particular, fixed layout (eg using bump.alloc(value) and not bump.alloc_layout(some_dynamic_layout)) that the cost of aligning the bump pointer would be minimal. Have you dug into any disassemblies to confirm that there isn't something somewhere that is not being inlined or something like that?

That said, I suppose shaving off ~2 instructions from a sequence of 10-15 instructions could very well lead to a 12% speed up.

I'm not necessarily opposed to something like this, but at the same time I don't want to proliferate cargo features endlessly. They have maintenance cost, especially with regards to testing and correctness (as you noted). One thing I'd like to see worked out before going too much deeper into this would be that story.

CeleritasCelery commented 6 months ago

Those are all good considerations. Though as I am sure you already know using min alignment does not preclude reading allocated chunks directly, it just means that if you do so all allocations need to meet the minimum alignment.

For maintainability, I was thinking we could define a constant the specifies the minimum alignment.

#[cfg(feature = "min_align")]
pub const MIN_ALIGN: usize = core::mem::align_of::<usize>();
#[cfg(not(feature = "min_align"))]
pub const MIN_ALIGN: usize = core::mem::align_of::<u8>();

the only spot in non-test code that we would change is in try_alloc_layout_fast from this:

let ptr = ptr.wrapping_sub(layout.size());
let aligned_ptr = round_mut_ptr_down_to(ptr, layout.align());

to this:

#[cfg(not(feature = "min_align"))]
let aligned_ptr = {
    let ptr = ptr.wrapping_sub(layout.size());
    round_mut_ptr_down_to(ptr, layout.align())
};

#[cfg(feature = "min_align")]
let aligned_ptr = {
    let size = (layout.size() + MIN_ALIGN - 1) & !(MIN_ALIGN - 1);
    let ptr = ptr.sub(size);
    if layout.align() > MIN_ALIGN {
        round_mut_ptr_down_to(ptr, layout.align())
    } else {
        ptr
    }
};

Since the old behavior is equivalent to MIN_ALIGN=1 we would make the tests and quickcheck "MIN_ALIGN aware" using that constant, so they could handle any valid value. Though in practice it would only be 1 or word sized.

However this feature does rely on the compiler constant propagating layout. If I use alloc_layout and black box Layout so that it can't be constant propagated we see about a 15% performance regression with the min_align feature.

fn alloc_layout<T: Default>(n: usize) {
    let arena = bumpalo::Bump::with_capacity(n * std::mem::size_of::<T>());
    for _ in 0..n {
        let arena = black_box(&arena);
        let layout = std::alloc::Layout::new::<T>();
        let ptr = arena.alloc_layout(black_box(layout));
        unsafe {ptr.as_ptr().write(Default::default())};
        black_box(ptr);
    }
}

This is because the compiler is not able to inline layout.size() or layout.align() and so can't eliminate the extra code. I expect that layout not getting inlined would be rare in the real world, but it is a consideration none the less. The performance gain here relies on the compiler being able to making that optimization.

fitzgen commented 6 months ago

the only spot in non-test code that we would change is in try_alloc_layout_fast from this:

I think this could benefit from being factored out into two different versions of a helper function that is marked #[inline].

make the tests and quickcheck "MIN_ALIGN aware"

This sounds good to me.