0xPolygonMiden / miden-vm

STARK-based virtual machine
MIT License
618 stars 156 forks source link

Rewrite MAST to use structure-of-arrays/table-based representation #1217

Closed bitwalker closed 3 months ago

bitwalker commented 7 months ago

This is the first of a small set of issues I'm opening to try and set the stage for some of the things we need on the compiler/packaging side of things, but which also bring a number of benefits to the VM as well. I'm planning to work on implementing these things once there is consensus, but want to make sure everyone feels good about the proposals and has an opportunity to provide feedback on implementation details.

Proposal

This proposal is to reimplement the in-memory representation of MAST to use a more efficient, in both space and time, structure-of-arrays/table-based representation, rather than the tree structure currently used today. Interestingly, it seems like the seeds of this approach were planted with CodeBlockTable, but didn't take the idea to its logical conclusion. This proposal does just that. I'm guessing most/all of you have a pretty good feel for the reasons why this would be an improvement just in terms of memory and raw execution speed, but to recap:

Rather than representing MAST as a literal tree in-memory, we can use a far more compact representation that is also cache-friendly, and significantly easier to query/visit. In essence, you go from a structure like this:

pub enum Tree<T> {
    Leaf(T),
    Branch(Box<Tree>, Box<Tree>),
}

To something like this:

pub struct Tree<T> {
    nodes: Vec<Node<T>>,
}

pub enum Node<T> {
    Leaf(T),
    Branch(NodeId, NodeId),
}

pub type NodeId = usize;

In essence, you have a table of nodes which can reference each other using their index in that table. Not only does this have nice perf characteristics, but it makes nodes directly accessible regardless of their depth in the tree, allows nodes to be trivially compacted if they are equivalent (e.g. MAST blocks with the same hash could be stored just once). It also becomes very easy to extend the tree with additional metadata without it being awkward to access. For example, something like so:

pub struct Tree<T> {
    nodes: Vec<Node<T>>,
    // For dense metadata; vector is the same length as `nodes`
    dense: Vec<Option<MetadataNode>>,
    // For sparse metadata
    sparse: BTreeMap<NodeId, MetadataNode>,
}

The compiler uses a similar type of structure in a few places, most importantly for representing the dataflow graph of a function. But perhaps a more notable implementation of this idea is in WebAssembly, which stores each type of "entity" in a module-scoped table, and all references to a given entity use its index in the table, rather than the entity itself.

Rationale

This change would pave the way for some other proposals, which I'll be submitting over the next day or two. The key thing this specific proposal will enable, aside from the other benefits mentioned, is straightforward serialization of the MAST representation of a program (or its component parts) in a compact format more suitable for packaging and distribution. The idea would then be to ship this representation rather than Miden Assembly sources when publishing a Miden contract/program for consumption. There are some other important properties gained from this representation, but more on that in my next issue.

bobbinth commented 7 months ago

Thank you for the great write up! This makes a lot of sense to me (and something like this should probably have been implemented right from the start). A couple of clarifying questions:

pub struct Tree<T> {
    nodes: Vec<Node<T>>,
}

pub enum Node<T> {
    Leaf(T),
    Branch(NodeId, NodeId),
}

pub type NodeId = usize;

How would we use this structure to store node types for branch node (e.g., to figure out if it it is a Join or Split node)? Would it be done via lookups into the metadata fields? If so, is there a downside to storing the metadata directly with node? For example (I'm going to use our current struct names):

pub struct Program {
    nodes: Vec<Node>,
}

pub enum Node {
    Span(Span),
    Join(Join),
    ...
}

pub struct Join {
    body: [NodeId; 2],
    hash: Digest,
}

pub type NodeId = usize;

In either case, how would we handle resolving "local" vs. "external" references. By "local", I mean a node which is present in the tree, and by "external" I mean something that lives outside of the tree (e.g., a MAST root of some well-known program).

A potential alternative approach is to use digests as node IDs:

pub struct Program {
    nodes: BTreeMap<NodeId, Node>,
}

pub enum Node {
    Span(Span),
    Join(Join),
    ...
}

pub struct Join {
    body: [NodeId; 2],
    hash: NodeId,
}

pub type NodeId = Digest;

We do lose the nice cache properties and compact memory representation here as we need to do a lookups into the BTreeMap to traverse the tree, but:

  1. I'm not sure how significant is the impact because, assuming I understood the original proposal correctly, we would still need to do lookups into the metadata tables.
  2. The overall structure is potentially simpler as it handles internal and external nodes in the same way. That is, if the node for a given NodeId cannot be found in nodes, it must be an external reference.
bobbinth commented 7 months ago

I guess one way to still use vectors could be like so:

pub struct Program {
    nodes: Vec<Node>,
}

pub enum Node {
    Span(Span),
    Join(Join),
    ...
    Ref(Ref), // reference to an external node
}

pub struct Join {
    body: [NodeId; 2],
    hash: Digest,
}

pub struct Ref {
    hash: Digest,
}

pub type NodeId = usize;

I think this should work. One of the potential downsides is that the integrity of the structure is not "self-enforced" (e.g., due to a bug, we could end up with multiple nodes having the same hash located in different indexes - which would not be possible with the BTreeMap approach) - but maybe the efficiency gains are worth it.

bitwalker commented 7 months ago

I think having a dedicated node type for external references is likely to be more ergonomic (and efficient), but you've got the right idea in terms of how to handle the various node types. The core idea is pretty simple/flexible, but the basic principle is that a Node (using your example) is only stored once (in the vector), and always referenced by an opaque handle type (i.e. NodeId).

One of the potential downsides is that the integrity of the structure is not "self-enforced" (e.g., due to a bug, we could end up with multiple nodes having the same hash located in different indexes - which would not be possible with the BTreeMap approach) - but maybe the efficiency gains are worth it.

It definitely requires a stricter API in terms of the ways in which you can let people manipulate it, but there are a number of useful patterns in Rust for dealing with this sort of thing while retaining ergonomics. In particular, the use of the newtype pattern (you'd never use pub type NodeId = usize in the actual implementation), and the "chain of responsibility" pattern (typically used for things like builders/mutators which extend/modify the core structure while enforcing all of its invariants).

The actual NodeId type would be more like:

// u32 because we can't construct a MAST with more than that number of nodes anyway
#[derive(Debug, Copy, Clone, PartialEq, Eq, PartialOrd, Ord, Hash)]
pub struct NodeId(u32);
impl Default for NodeId {
    fn default() -> Self {
        Self::INVALID
    }
}
impl NodeId {
    /// A sentinel value for invalid node ids can occasionally be useful
    const INVALID: Self = Self(u32::MAX);

    pub const fn as_usize(&self) -> u32 {
        self.0 as u32
    }

    pub unsafe fn from_usize(index: usize) -> Self {
        let id = Self(index.try_into().expect("invalid node id"));
        assert_ne!(id, Self::INVALID, "invalid node id");
        id
    }
}
impl std::ops::Index<NodeId> for Vec<Node> {
    type Output = Node;

    fn index(&self, id: NodeId) -> &Self::Output {
        self.index(id.as_usize())
    }
}

You can make the level of safety as rigorous as you want, but there is a balance between preventing bugs and ergonomics to consider. In particular, the next level of rigor is using generational indices, so that a NodeId is only good for as long as the original structure or node remains unmodified. A step further would involve storing a weak reference to the allocating structure in the handle as well (or some kind of unique key), that is checked against the address (or key) of the structure to which the NodeId is used, panicking if the NodeId is not valid. I don't think either of those are really necessary for our use case though, but we can certainly keep it in the back of our mind.

In general, the risk of bugs of this type is low here, because the structure is immutable once built, and is only being used to represent a single MAST. Where this pattern can run into issues in terms of bugs is when you have some kind of structure that you hand out a NodeId-like handle for, but the tree is being mutated while those handles are floating around. If an index gets reused for a different node between the time the handle is given out, and when it is used, things get weird. That's where things like generational indices come into play, but there are always edge cases.

In any case, while the efficiency gains are certainly noticeable, I don't think that's the primary benefit here anyway (just a nice to have). The main benefit of this structure is that it corresponds to a much better serializable format, and will also make it easier to introduce on-the-fly code loading of blocks which are not part of the initial tree. That latter bit is the main topic of my next proposal, but it got delayed yesterday, I'll have it up today for sure.

bobbinth commented 7 months ago

All makes sense. We will probably need to build a map Digest |-> NodeId during the compilation into MAST - but that can be discarded once the Program is built.

btw, flashing it out a bit more, I think Program could look like so:

pub struct Program {
    nodes: Vec<Node>,
    /// Node at which the program execution starts.
    entrypoint: NodeId,
    /// A list of nodes IDs which are referenced more than once.
    roots: Vec<NodeId>,
}

The roots field could be very useful for efficient serialization.

bitwalker commented 7 months ago

All makes sense. We will probably need to build a map Digest |-> NodeId during the compilation into MAST - but that can be discarded once the Program is built.

It might be useful to keep it around, since it would allow dynexec/dyncall to determine whether or not the MAST it needs is in the current tree, or needs to be fetched from the cache/object store. But we can probably strip the mapping down to just those nodes which have their hash "captured" via procref, since those are the only ones that would ever need to be looked up by digest.

The roots field could be very useful for efficient serialization.

Could make that even more compact using a dense bitset, i.e. a bit for every index in nodes, using something off the shelf, or building one on top of Vec<u64> where each u64 represents up to 64 nodes, since we're talking only a single bit of information per node.

I have some thoughts on the binary format for MAST in general, should I make that part of this proposal, or open a separate issue for that?

bobbinth commented 7 months ago
pub struct Program {
    nodes: Vec<Node>,
    /// Node at which the program execution starts.
    entrypoint: NodeId,
    /// A list of nodes IDs which are referenced more than once.
    roots: Vec<NodeId>,
}

Thinking a bit more about this, I defined roots incorrectly in the above. Specifically, these should be indexes of nodes which are roots of subtrees where all descendants have at most one parent. Or to say it another way: if we keep reference counter for each node, then roots contains nodes where reference count is greater than $1$ but all of the descendants have reference count of exactly $1$.

The main benefit I see in tracking this info (or maybe building it on the fly during serialization) is that it would allow us to serialize such sub-trees super efficiently (e.g., spending something like $4$ bits per non-leaf node).

Could make that even more compact using a dense bitset, i.e. a bit for every index in nodes, using something off the shelf, or building one on top of Vec<u64> where each u64 represents up to 64 nodes, since we're talking only a single bit of information per node.

I'm not expecting the number of such nodes to be great (probably much smaller than the total number of nodes) - so, we may not need to do anything fancy. But I guess we'll need to test out this hypothesis.

I have some thoughts on the binary format for MAST in general, should I make that part of this proposal, or open a separate issue for that?

I think a separate issue would be better. In my mind, there are actually 2 binary formats:

  1. "Fast" format which would be optimized for fast serialization/deserialization.
  2. "Compact" format which would be optimized for smallest possible representation at the expense of slower serialization/deserialization.
bobbinth commented 3 months ago

Closed by #1349.