0xPolygonMiden / miden-vm

STARK-based virtual machine
MIT License
618 stars 156 forks source link

Performance of current decorator representation #1497

Open bitwalker opened 5 days ago

bitwalker commented 5 days ago

Issue

While investigating 0xPolygonMiden/compiler#317, I discovered via flamegraph analysis that up to 30% of program runtime was being spent just droping vectors used to store decorators.

The conditions in which this occurs are:

The issue was noticed in our test suite, prior to the change in this repo to allocating Librarys in Arc, and cloning those rather than the actual Library. Even though the compiler was caching instances of Library and only cloning it (similar to what is now done in next), we observed the excessive time spent in drop_in_place.

Impact

However, even though the recent changes in next makes this a non-issue for miden-stdlib (and I believe miden-base, not sure if that library is also allocated with 'static lifetime), for the time being, it will affect any other library, even if allocated via Arc. The use of Arc does mean that clones are cheap and don't require dropping, however the actual Library still gets dropped when all of its referencing Arcs are dropped. In one-shot compiler-like scenarios, this is a small difference, and the time spent dropping the data will still impact user-perceived execution time of the compiler/assembler and executor.

The compiler is also planning to switch to a rustup-like toolchain management approach to the libraries and tools it uses, which means that there will be no benefit to the miden-stdlib strategy of caching the Library allocations in a static. All libraries, regardless of whether they are fundamental or user-defined, will be treated the same, so this affects all programs, not just a subset.

Background

NOTE: The way decorators are managed was recently changed in next, but it may actually make the problem worse rather than better, as we now have two Vec for decorators (to partition before and after decorators), rather than just one.

As mentioned above, the problem is that every MastNode must be dropped individually when dropping a MastForest. If each MastNode also has two Vec, both vectors must be dropped as well, if they are non-empty, which is rarely the case when debug info is enabled.

More generally though, any heap-allocated type stored in a MastNode would present a similar issue. Vec is particularly bad because all of the elements must be dropped as well, but it's less about that, and more about the amount of drop glue that is executed when a MastForest is dropped - it ends up being a lot.

As programs grow in size, this problem becomes more egregious too - each new function adds anywhere from 10s to hundreds of nodes, so the growth in time spent dropping things is superlinear.

Proposed Solution

Various refactorings of decorators and MAST are still ongoing, some of which may address this issue. However, I have a specific proposal that I'd like to make that I think would also tie into those changes, so rather than wait to see what happens, I wanted to outline one option here, so it can be considered while looking at the totality of other changes being made.

Specifically, I'd like to propose moving to a model where MastForest (and all of its contents) are arena-allocated. A Library would consist not only of the same items it does today, but also the arena used to allocate the contents of its MastForest. This is actually along the lines of what my original design for the MAST refactoring proposed, i.e. the "table-based" structure, which was based on the way the compiler structures dataflow/control-flow graphs. The actual entities in the compiler structures are arena-allocated. We didn't go that far with the implementation of the proposal, but that was primarily to keep the initial implementation simpler so that it could evolve a bit first.

The benefits to a region-based memory management scheme for MastForest are as follows:

/// A smart-pointer type for arena-allocated `MastNode` impls, with dynamically-checked borrow rules
pub struct MastNodeHandle<T: MastNode, A: Allocator> {
    arena: Arc<A>,
    data: NonNull<MastNodeObject<T>>,
}
impl<T, A: Allocator> Clone for MastNodeHandle<T, A> {
    fn clone(&self) -> Self {
        Self { arena: Arc::clone(&self.arena), data: self.data }
    }
}
impl<T: MastNode, A: Allocator> MastNodeHandle<T, A> {
    pub fn id(&self) -> MastNodeId {
        let offset = core::mem::offset_of!(MastNodeObject, id);
        *self.data.byte_add(offset).cast::<MastNodeId>().as_ref()
    }
    pub fn borrow(&self) -> Ref<'_, T> {
        let offset = core::mem::offset_of!(MastNodeObject, node);
        self.data.byte_add(offset).cast::<RefCell<T>>().as_ref().borrow()
    }
    pub fn borrow_mut(&self) -> RefMut<'_, T> {
        let offset = core::mem::offset_of!(MastNodeObject, node);
        self.data.byte_add(offset).cast::<RefCell<T>>().as_ref().borrow_mut()
    }
}
/// Allow conversions like `MastNodeHandle<T>` to `MastNodeHandle<dyn MastNode>`
impl<T, U> CoerceUnsized<MastNodeHandle<U>> for MastNodeHandle<T>
where
    T: ?Sized + Unsize<U>,
    U: ?Sized,
{}

/// The type allocated in the arena for a `MastNode` impl
struct MastNodeObject<T> {
    id: MastNodeId,
    data: RefCell<T>
}

/// An abstraction over arena-allocated decorators might look similar to this,
/// all APIs expect a reference to the arena, but the `DecoratorList` itself is
/// basically a handle that the arena uses to access/manipulate the actual
/// data in the arena
pub struct DecoratorList {
    index: DecoratorIndex,
    len: u32,
}
impl DecoratorList {
    pub fn with_capacity(len: usize, arena: impl MastArena) -> Self {
        arena.alloc_decorator_list(len)
    }
    pub fn as_slice(&self, arena: impl MastArena) -> &[DecoratorId] {
        arena.get_decorators(self)
    }
    pub fn push(&mut self, decorator: Decorator) -> DecoratorId {
         arena.push_decorator(self, decorator)
    }
}

/// Rather than using a giant enumeration for nodes, we can use a trait
///
/// By implementing `Any`, we can cast `&dyn MastNode` to `&T`, as well as check
/// if a given `dyn MastNode` is an instance of a specific type, e.g. `node.is::<JoinNode>()`
pub trait MastNode: core::any::Any {
    fn to_pretty_print<'a>(&self, forest: &'a MastForest) -> impl PrettyPrint + 'a;
    fn domain(&self) -> Felt;
    fn digest(&self) -> RpoDigest;
    fn before_enter(&self) -> &[DecoratorId];
}

/// We can define traits like this to abstract over things like executing a node, rather than
/// using a big enum dispatch table
pub trait ExecutableMastNode: MastNode {
    fn execute<H: Host>(&self, program: &MastForest, process: &mut Process<H>) -> Result<(), ExecutionError>;
}

/// An example node type (this is allocated in the arena, and never passed around by value)
pub struct JoinNode {
    /// We use MastNodeHandle<dyn MastNode> here so that we can immediately access
    /// the data of those children without having to do a BTreeMap lookup, or index into a Vec,
    /// we already have a pointer directly to the data
    children: [MastNodeHandle<dyn MastNode>; 2],
    digest: RpoDigest,
    before_enter: DecoratorList,
    after_exit: DecoratorList,
}

/// The `MastNode` impl is pretty straight forward
impl MastNode for JoinNode {
    fn to_pretty_print<'a>(&self, forest: &'a MastForest) -> impl PrettyPrint + 'a { todo!() }
    fn domain(&self) -> Felt { Self::DOMAIN }
    fn digest(&self) -> RpoDigest { self.digest }
    fn before_enter(&self) -> &[Decorator] { &self.before_enter }
    fn after_exit(&self) -> &[Decorator] { &self.after_exit }
}

/// And makes it pretty easy to define node-specific behaviors
impl ExecutableMastNode {
    fn execute<H: Host>(&self, program: &MastForest, process: &mut Process<H>) -> Result<(), ExecutionError> {
        process.start_join_node(self, program)?;
        process.execute_mast_node(self.first(), program)?;
        process.execute_mast_node(self.second(), program)?;
        process.end_join_node(self, program)
    }
}

That was a fairly involved example, but hopefully makes the benefits obvious. I haven't yet gotten to what the actual arena implementation would look like, but that is because it could take multiple forms. My basic recommendation would be:


This turned into a combination bug report/enhancement proposal, but hopefully the bug report part of it provides some context for why such a proposal is worth considering at this juncture.

While I've gotten into the details in my proposal, I obviously haven't considered everything, so I expect we'll need to examine in what ways this change would impact existing APIs and workflows. I'm fairly confident we can keep the fact that arenas are used at all, mostly an internal implementation detail, but internally it would certainly impact a number of places in the code (mostly in the construction of MastForests, and in their execution). Some edge cases such as merging MastForests would require close attention, but would mostly continue to work as they do now.

I suspect the most annoying effect of this change, would be the fact that MastNodeHandle cannot implement Deref or AsRef, because we must dynamically check, and guard, borrows of the underlying data, using .borrow() or .borrow_mut() to check the borrow, and obtain the guard, which does implement those traits, but obviously is a an extra step. In many cases of course, you can do that in just one place, and then use a plain reference for the lifetime of the borrow guard, so the guard type itself doesn't need to appear in any APIs, but it does impose some constraints.

bobbinth commented 3 days ago

This is a great write up - thank you! I really like the idea of representing nodes as dynamic objects. This does look like a fairly significant refactoring though, and I do wonder whether we should attempt it at this moment. Maybe the approach could be:

bitwalker commented 3 days ago

@bobbinth Agreed, I was imagining this would supercede #1490, while also addressing more generally the problem of MastForest drop glue. It's possible that the DecoratorSpan implementation will address the issue with dropping decorators specifically, at least to the point where it gives us time to approach the larger refactoring proposed here, so I think its a perfectly valid approach to start there and work our way up to something like I've described above as/if needed.