fitzgen / bumpalo

A fast bump allocation arena for Rust
https://docs.rs/bumpalo
Apache License 2.0
1.39k stars 110 forks source link

Allocating and iterating forward? #101

Closed cemeyer closed 3 years ago

cemeyer commented 3 years ago

Hi,

Is there any reason Bumpalo allocates from high to low addresses? Or could this be some policy on which bumpalo is parameterized?

I would like to use something like bumpalo as an abstraction for a zero-copy message-encoding allocator. I have a proof-of-concept using Vec<Box<[u64]>> today, with similar mut borrows for allocations (but much less intelligent resizing-in-place / new chunk sizing logic). If I can ditch my unsafe code for something with much smarter minds behind it, I’d really prefer that!

I think the only real change needed would be to adjust the internal ptr cursor initialization and replace some checked_sub with checked_add. But I don’t want to fork the crate just to do that.

Thoughts? I looked for closed issues and didn’t see the question raised earlier. Thanks!

Edit: I found and am starting to read https://fitzgeraldnick.com/2019/11/01/always-bump-downwards.html , but have not finished yet.

Edit2: The post's additional branch for bumping up only makes sense if you allow huge, non-sensical alignments in tandem with backing Chunks arbitrarily close to usize::MAX as *mut _. If you limit alignments to reasonable values (like 64, which is the most you need for AVX512 or maybe at most, 4kB), and ensure your allocated chunks end <= usize::MAX - MAX_ALIGN, you can skip the extra condition on each individual allocation.

(My requirements happen to be even less general than that -- all allocations have equivalent, u64 alignment, and we could monomorphize on that and the alignment branches go away anyway. But I understand that isn't generally true for Bumpalo consumers. A version of Bumpalo that monomorphized on some T might be generally useful?) Thanks!

Edit3: I think you could also just use a different alignment algorithm bumping up in the general case and avoid the extra branch. Here is the original algorithm described on the blog post:

    pub unsafe fn alloc(
        &mut self,
        size: usize,
        align: usize,
    ) -> *mut u8 {
        debug_assert!(align > 0);
        debug_assert!(align.is_power_of_two());

        let ptr = self.ptr as usize;

        // Round the bump pointer up to the requested
        // alignment.
        let aligned =
            try_null!(ptr.checked_add(align - 1)) & !(align - 1);

        let new_ptr = try_null!(aligned.checked_add(size));

        let end = self.end as usize;
        if new_ptr > end {
            // Didn't have enough capacity!
            return std::ptr::null_mut();
        }

        self.ptr = new_ptr as *mut u8;
        aligned as *mut u8
    }

Here is a similar algorithm that avoids the extra alignment check while bumping up:

    pub unsafe fn alloc2(
        &mut self,
        size: usize,
        align: usize,
    ) -> *mut u8 {
        debug_assert!(align > 0);
        debug_assert!(align.is_power_of_two());
        debug_assert!(size <= isize::MAX as _ && align <= isize::MAX as _);

        let ptr = self.ptr as usize;

        // Round the bump pointer down, to the requested
        // alignment.  Ptr is non-NULL, so we can subtract one 
        // without underflow.  The resulting aligned ptr is always
        // exactly 'align' below the actual ptr we will allocate.
        let aligned = (ptr - 1) & !(align - 1);
        // Size and align must be <= isize::MAX in Rust; their addition cannot overflow usize.
        let new_ptr = try_null!(aligned.checked_add(size + align));

        let end = self.end as usize;
        if new_ptr > end {
            // Didn't have enough capacity!
            return std::ptr::null_mut();
        }
        self.ptr = new_ptr as *mut u8;
        // If new_ptr didn't overflow, this cannot overflow either:
        (aligned + align) as *mut u8
    }

And indeed, Godbolt shows it has two branches instead of three, like bumping down. It is still a handful of instructions longer than bumping down (13 -> 17), but none of them are branches.

The essential insight is that (ptr + (A - 1)) & !(A - 1)) == ((ptr - 1) & !(A - 1)) + A due to modular arithmetic.

As far as gaming the microbenchmark, both bumping up and down might benefit from LLVM generating more in-order code for the non-error cases? But maybe it knows more about optimizing x86 branches than I do.

fitzgen commented 3 years ago

Looks like you found the discussion of why bumpalo bumps downwards. :+1:

I'm not interested in adding some kind of flag to bumpalo to toggle upwards vs downwards bumping, as this would induce higher maintenance overhead, add more chances for bugs to arrise (which would, in turn, require we test more combinations), and, finally, complicate the user-facing API.

Maintaining a special-purpose bump arena for your specialized use case seems reasonable, or you can use bumpalo version 2.X, from before we switched to bumping downwards, if it fits your needs. Note that I won't be maintaining the 2.X crate, however, so I won't be publishing bug fix releases for it or anything like that. Forking the 2.X version and customizing it for your needs also seems reasonable.

Good luck!

cemeyer commented 3 years ago

Totally makes sense -- I wasn't proposing optional bump direction, but actually switching (back) to bumping up. But I think bumping down might be fine for my needs, anyway. :-)