fitzgen / bumpalo

A fast bump allocation arena for Rust
https://docs.rs/bumpalo
Apache License 2.0
1.41k stars 111 forks source link

Ability to create Bump from pre-existing memory allocation? #237

Closed overlookmotel closed 6 months ago

overlookmotel commented 7 months ago

Related to #100.

Would you consider addition of an unsafe method (called e.g. from_raw_parts) to create a Bump from an existing memory allocation?

In comparison to the request in #100, the caller would have to promise that the allocation was made from the global allocator, with the alignment that Bumpalo requires, and that this allocation now becomes owned by the Bump. So then no changes required to Bumpalo's internals.

One circumstance where this would be useful is Rust-Javascript interop, where it'd enable re-using the same Bump for each call into Rust, rather than calling into the global allocator to create a fresh one each time.

fitzgen commented 7 months ago

This could make sense, but I'm not sure your use case actually requires this functionality?

One circumstance where this would be useful is Rust-Javascript interop, where it'd enable re-using the same Bump for each call into Rust, rather than calling into the global allocator to create a fresh one each time.

You should be able to call bump.reset() and it will retain the largest chunk that it allocated (freeing the smaller chunks). This means you should reach a steady state pretty quickly.

The other concern is re-entry: JS calls Wasm calls JS calls Wasm again. In this scenario you explicitly don't want to be reusing the same memory because you could trample on an earlier call frame's data if you did Bump::from_raw_parts on a single allocation that is always reused. The former bump.reset() solution avoids this via the type system: it avoids unsafe code and forces you to prove uniqueness of borrows in order to call bump.reset().

overlookmotel commented 6 months ago

Sorry I've been so slow to come back to you. Your response on #234 has kicked off a lot of thoughts (and I've had to read up a lot to get to a point of understanding).

To give some context, this is what I'm trying to do: https://github.com/oxc-project/oxc/issues/2409

That's very long, so to boil it down a bit:

Oxc parser is a JavaScript parser written in Rust. We're working on NodeJS bindings for it, using napi-rs. JS ASTs are huge, heavily nested structs, and the serialization/deserialization required to transfer them from Rust to JS (currently via serde JSON) is a massive performance hit - roughly 10x the cost of the parsing itself.

Oxc stores the entire AST in a bumpalo Bump. So I have a POC which instead pulls the chunks out of the Bump and sends them to JS as buffers via napi-rs. This involves no memory copying, so is very fast. Then there's a deserializer on JS side which understands the memory layout of the native Rust types, and can convert them to JS objects.

It works, and performance is good, but there's a couple of sticking points:

  1. To make the JS deserializer really efficient, ideally want the arena to be a single chunk, and aligned on 4 GiB. We can accept an upper size limit of 4 GiB too. Then converting a 64-bit pointer to an offset in the chunk buffer is as simple as just taking the lower 32 bits of the pointer.
  2. The parser will process 100s of files one after another. It would be preferable to reuse one bump, rather than create a new one for each parser run.

The solution I have in mind is:

  1. Create a WebAssembly.Memory on JS side.
  2. Pass a mutable reference to that memory block into Rust (napi-rs handles that).
  3. Initialize a Bump using that memory block.
  4. Wrap it in ManuallyDrop so it doesn't free the memory on drop.
  5. Parser allocates AST into the Bump.
  6. Return control to JS which can read the contents of the arena from the WASM Memory object.
  7. Re-use same WASM Memory for the next call.

NB: Not actually using WASM at this point, but WebAssembly.Memory is useful as it happens to have the 4 GiB alignment property that I'm after.

The only missing element for this to work is step (3), which would need a Bump::from_raw_parts-type API.

It may well be that in long term Oxc should look at replacing bumpalo with a custom arena allocator tuned for our particular use case. But for purposes of developing a prototype, it'd be preferable to avoid doing that work upfront. I'm aware writing an allocator is no small task!

If the above makes sense, and you'd be willing to receive a PR to add Bump::from_raw_parts, I'd be happy to work that up.

If you see problems with what I'm suggesting, or a better way to tackle the problem, would also very much appreciate your thoughts.

fitzgen commented 6 months ago

Orthogonal to this discussion but:

NB: Not actually using WASM at this point, but WebAssembly.Memory is useful as it happens to have the 4 GiB alignment property that I'm after.

FWIW, unless this is a Node-specific property, there is no guarantee that Wasm memory is 4GiB-aligned. The maximum size of a 32-bit Wasm memory is 4GiB, but that doesn't have anything to do with its alignment.


(Writing a proper, on topic reply now)

overlookmotel commented 6 months ago

FWIW, unless this is a Node-specific property, there is no guarantee that Wasm memory is 4GiB-aligned. The maximum size of a 32-bit Wasm memory is 4GiB, but that doesn't have anything to do with its alignment.

Yes, point taken. But it appears to be reliably the case in NodeJS. I assume this is probably motivated by being able to efficiently convert 32-bit pointers in WASM into "real" 64-bit pointers.

This property may not always hold in future, and we should probably replace it later on, or provide a less efficient fallback deserializer which doesn't rely on it. But for a prototype it seems like a good starting point.

fitzgen commented 6 months ago

Do you need to allocate the memory that way? You can't let bumpalo do it via Bump::with_capacity(...)?

Because while I could imagine a Bump::into_raw and Bump::from_raw pair of methods, they would likely have the following constraints:

  1. It is only generally safe to call Bump::from_raw with the result of a previous call to Bump::into_raw. It would assume that the given thing is a pointer to an initialized ChunkFooter, which allows the Bump to figure out how much capacity there is in the chunk and things like that. That is, it would not be safe to call it with a pointer to a random allocation, especially not one managed by some other system, like V8's garbage collector.

  2. Relatedly, each chunk in a Bump would still always use the global allocator configured in the Rust program to deallocate the chunk. So, if you gave Bump::from_raw some chunk of memory that was allocated by V8's allocator, the Bump would try and pass that to Rust's global allocator for deallocation.

I guess, even given those constraints, if you were careful never to actually do anything that deallocates the bump's chunk, you could make something that worked with your scheme:

This could all work, but does seem pretty fragile to me. Especially the ChunkFooter bit. I wouldn't want to give any guarantees that implementation details this would rely on wouldn't change in new minor versions of bumpalo. But if you pinned exact versions and audited dependency updates, it could work.


Long term, I don't want to always use the global allocator for Bump. I want it to be parameterized over a generic type parameter A: Allocator that defaults to the global allocator. In this future, you would be able to implement a V8WasmMemory allocator or whatever and use this with Bump. This should work well and would be a whole lot less fragile. But this all depends on a stable allocator trait in std.

But now that I say that, we could prototype this scheme with the allocator_api2 crate, which provides a polyfill for the allocator trait that will eventually go into std in some form. If we were able to get the prototype to a place where

then I could consider merging it. But I really want to emphasize that this would have be structured to be as uninvasive as possible and not affect all of bumpalo's existing users before I would consider that.

Is this something you are interested in exploring?

overlookmotel commented 6 months ago

Thanks for giving this consideration and replying at length. Really appreciate it.

Do you need to allocate the memory that way? You can't let bumpalo do it via Bump::with_capacity(...)?

No I don't need to, but...

Main problem is that Bump::with_capacity doesn't allow requesting a specific alignment. So I can't get the 4 GiB alignment I would like.

Also:

In my scheme, JS is the ultimate "driver" of the whole process, so in my mind it makes sense for JS to own the memory, and rely on JS's garbage collector to free it. There's also a technical limitation: Node-API (the NodeJS-native interop layer) also has a very annoying problem with objects which are created in native code. If you allocate an object/block of memory in Rust, and pass a reference to it to JS, then once all references to it get garbage collected, JS calls the finalizer in Rust to say "I'm done with it, you can free it now." BUT... actually it doesn't happen like that. The finalizer in fact does not get called until the next turn of JS's event loop, which can be much later if the JS code is all synchronous. Therefore user-facing JS APIs which make large allocations in Rust are a footgun - makes it very easy to exhaust the system's memory by calling them in a synchronous loop.

Secondly, some of the JS-side deserializer may later be converted to WASM for speed, so using a WebAssembly.Memory from the start would leave the door open to experiment with that.

Because while I could imagine a Bump::into_raw and Bump::from_raw pair of methods, they would likely have the following constraints:

I was imagining Bump::from_raw a bit differently. I envisaged it as taking the raw block of memory provided and initializing a chunk footer itself. Basically, I imagined Bump::with_capacity but instead of calling alloc internally, it would use the pointer/size pair provided by the caller.

Caller would be responsible for ensuring the pointer is aligned correctly, and that the Bump doesn't get dropped.

Bump::into_raw isn't required for my use case, because the memory is borrowed from JS. But in any case I thought iter_allocated_chunks_raw would largely fulfil this function.

On the requirements:

  • ✨ somehow ✨ write a ChunkFooter into your custom memory chunk from V8.

See above. I'd hope from_raw would do this for me.

  • Use ManuallyDrop<Bump> to ensure that the drop never tries to deallocate the chunk (even when unwinding due to panics or whatever)

Yes, this is a pain, but necessary.

  • Only use the try_alloc family of methods to ensure that you only ever use your one chunk from V8
  • Audit any other Bump method you ever call to make sure it won't try and allocate new chunks or deallocate your V8 chunk

I was imagining creating a 4 GiB memory block to start with, and setting a maximum size of 4 GiB on the Bump with set_allocation_limit. I thought this would protect against any further allocations/re-allocations occurring.

We can accept a hard OOM failure if exhaust the 4 GiB allocation. In practice, no sane JavaScript file would produce an AST anything approaching this size. And Oxc also runs in WASM where 4 GiB is the limit anyway.

This could all work, but does seem pretty fragile to me. Especially the ChunkFooter bit. I wouldn't want to give any guarantees that implementation details this would rely on wouldn't change in new minor versions of bumpalo. But if you pinned exact versions and audited dependency updates, it could work.

I totally agree that trying to manually construct a ChunkFooter is a terrible idea! In any case, going that way wouldn't need any API support from bumpalo - if going that far, might as well take the next step of just creating the required byte pattern, and transmute it to a Bump.

Is this something you are interested in exploring?

Honest answer: maybe.

At this point, I'm at the proof-of-concept stage with the AST transfer mechanism, so my priority is to knock something together fairly quickly, and just see how it works. If it turns out the mechanism I've outlined is the most performant way, then yes I might will be interested to then "do it properly", and tackle the allocator_api2 approach you've outlined.

But for now I think I'd be better off either temporarily forking bumpalo or doing the horrid DIY ChunkHeader and transmute thing.

Or, if you'd be willing to accept a PR for from_raw along the lines of what I've suggested above, obviously I'd be happy to work up a PR. But from our conversation, I can see that's probably not the right way to do it, so would totally understand if you don't want to include that in bumpalo's API.

fitzgen commented 6 months ago

I was imagining creating a 4 GiB memory block to start with, and setting a maximum size of 4 GiB on the Bump with set_allocation_limit. I thought this would protect against any further allocations/re-allocations occurring.

Ah yes, it would. The allocations limit stuff slipped my mind.

fitzgen commented 6 months ago

Or, if you'd be willing to accept a PR for from_raw along the lines of what I've suggested above, obviously I'd be happy to work up a PR. But from our conversation, I can see that's probably not the right way to do it, so would totally understand if you don't want to include that in bumpalo's API.

Yeah, I'd rather go the custom allocator route.

overlookmotel commented 6 months ago

OK, fair enough. I'll go ahead with one of the hacky methods for now and see if it works. May return to this and try to implement custom allocators (within the requirements you've given) further down the road.

Thanks again for walking through this with me and exploring possibilities.