Closed bitwalker closed 3 months ago
Thank you for the great write up! This makes a lot of sense to me (and something like this should probably have been implemented right from the start). A couple of clarifying questions:
pub struct Tree<T> { nodes: Vec<Node<T>>, } pub enum Node<T> { Leaf(T), Branch(NodeId, NodeId), } pub type NodeId = usize;
How would we use this structure to store node types for branch node (e.g., to figure out if it it is a Join
or Split
node)? Would it be done via lookups into the metadata
fields? If so, is there a downside to storing the metadata directly with node? For example (I'm going to use our current struct names):
pub struct Program {
nodes: Vec<Node>,
}
pub enum Node {
Span(Span),
Join(Join),
...
}
pub struct Join {
body: [NodeId; 2],
hash: Digest,
}
pub type NodeId = usize;
In either case, how would we handle resolving "local" vs. "external" references. By "local", I mean a node which is present in the tree, and by "external" I mean something that lives outside of the tree (e.g., a MAST root of some well-known program).
A potential alternative approach is to use digests as node IDs:
pub struct Program {
nodes: BTreeMap<NodeId, Node>,
}
pub enum Node {
Span(Span),
Join(Join),
...
}
pub struct Join {
body: [NodeId; 2],
hash: NodeId,
}
pub type NodeId = Digest;
We do lose the nice cache properties and compact memory representation here as we need to do a lookups into the BTreeMap
to traverse the tree, but:
NodeId
cannot be found in nodes
, it must be an external reference.I guess one way to still use vectors could be like so:
pub struct Program {
nodes: Vec<Node>,
}
pub enum Node {
Span(Span),
Join(Join),
...
Ref(Ref), // reference to an external node
}
pub struct Join {
body: [NodeId; 2],
hash: Digest,
}
pub struct Ref {
hash: Digest,
}
pub type NodeId = usize;
I think this should work. One of the potential downsides is that the integrity of the structure is not "self-enforced" (e.g., due to a bug, we could end up with multiple nodes having the same hash located in different indexes - which would not be possible with the BTreeMap
approach) - but maybe the efficiency gains are worth it.
I think having a dedicated node type for external references is likely to be more ergonomic (and efficient), but you've got the right idea in terms of how to handle the various node types. The core idea is pretty simple/flexible, but the basic principle is that a Node
(using your example) is only stored once (in the vector), and always referenced by an opaque handle type (i.e. NodeId
).
One of the potential downsides is that the integrity of the structure is not "self-enforced" (e.g., due to a bug, we could end up with multiple nodes having the same hash located in different indexes - which would not be possible with the BTreeMap approach) - but maybe the efficiency gains are worth it.
It definitely requires a stricter API in terms of the ways in which you can let people manipulate it, but there are a number of useful patterns in Rust for dealing with this sort of thing while retaining ergonomics. In particular, the use of the newtype pattern (you'd never use pub type NodeId = usize
in the actual implementation), and the "chain of responsibility" pattern (typically used for things like builders/mutators which extend/modify the core structure while enforcing all of its invariants).
The actual NodeId
type would be more like:
// u32 because we can't construct a MAST with more than that number of nodes anyway
#[derive(Debug, Copy, Clone, PartialEq, Eq, PartialOrd, Ord, Hash)]
pub struct NodeId(u32);
impl Default for NodeId {
fn default() -> Self {
Self::INVALID
}
}
impl NodeId {
/// A sentinel value for invalid node ids can occasionally be useful
const INVALID: Self = Self(u32::MAX);
pub const fn as_usize(&self) -> u32 {
self.0 as u32
}
pub unsafe fn from_usize(index: usize) -> Self {
let id = Self(index.try_into().expect("invalid node id"));
assert_ne!(id, Self::INVALID, "invalid node id");
id
}
}
impl std::ops::Index<NodeId> for Vec<Node> {
type Output = Node;
fn index(&self, id: NodeId) -> &Self::Output {
self.index(id.as_usize())
}
}
You can make the level of safety as rigorous as you want, but there is a balance between preventing bugs and ergonomics to consider. In particular, the next level of rigor is using generational indices, so that a NodeId
is only good for as long as the original structure or node remains unmodified. A step further would involve storing a weak reference to the allocating structure in the handle as well (or some kind of unique key), that is checked against the address (or key) of the structure to which the NodeId is used, panicking if the NodeId is not valid. I don't think either of those are really necessary for our use case though, but we can certainly keep it in the back of our mind.
In general, the risk of bugs of this type is low here, because the structure is immutable once built, and is only being used to represent a single MAST. Where this pattern can run into issues in terms of bugs is when you have some kind of structure that you hand out a NodeId
-like handle for, but the tree is being mutated while those handles are floating around. If an index gets reused for a different node between the time the handle is given out, and when it is used, things get weird. That's where things like generational indices come into play, but there are always edge cases.
In any case, while the efficiency gains are certainly noticeable, I don't think that's the primary benefit here anyway (just a nice to have). The main benefit of this structure is that it corresponds to a much better serializable format, and will also make it easier to introduce on-the-fly code loading of blocks which are not part of the initial tree. That latter bit is the main topic of my next proposal, but it got delayed yesterday, I'll have it up today for sure.
All makes sense. We will probably need to build a map Digest |-> NodeId
during the compilation into MAST - but that can be discarded once the Program
is built.
btw, flashing it out a bit more, I think Program
could look like so:
pub struct Program {
nodes: Vec<Node>,
/// Node at which the program execution starts.
entrypoint: NodeId,
/// A list of nodes IDs which are referenced more than once.
roots: Vec<NodeId>,
}
The roots
field could be very useful for efficient serialization.
All makes sense. We will probably need to build a map Digest |-> NodeId during the compilation into MAST - but that can be discarded once the Program is built.
It might be useful to keep it around, since it would allow dynexec/dyncall to determine whether or not the MAST it needs is in the current tree, or needs to be fetched from the cache/object store. But we can probably strip the mapping down to just those nodes which have their hash "captured" via procref
, since those are the only ones that would ever need to be looked up by digest.
The roots field could be very useful for efficient serialization.
Could make that even more compact using a dense bitset, i.e. a bit for every index in nodes
, using something off the shelf, or building one on top of Vec<u64>
where each u64
represents up to 64 nodes, since we're talking only a single bit of information per node.
I have some thoughts on the binary format for MAST in general, should I make that part of this proposal, or open a separate issue for that?
pub struct Program { nodes: Vec<Node>, /// Node at which the program execution starts. entrypoint: NodeId, /// A list of nodes IDs which are referenced more than once. roots: Vec<NodeId>, }
Thinking a bit more about this, I defined roots
incorrectly in the above. Specifically, these should be indexes of nodes which are roots of subtrees where all descendants have at most one parent. Or to say it another way: if we keep reference counter for each node, then roots
contains nodes where reference count is greater than $1$ but all of the descendants have reference count of exactly $1$.
The main benefit I see in tracking this info (or maybe building it on the fly during serialization) is that it would allow us to serialize such sub-trees super efficiently (e.g., spending something like $4$ bits per non-leaf node).
Could make that even more compact using a dense bitset, i.e. a bit for every index in
nodes
, using something off the shelf, or building one on top ofVec<u64>
where eachu64
represents up to 64 nodes, since we're talking only a single bit of information per node.
I'm not expecting the number of such nodes to be great (probably much smaller than the total number of nodes) - so, we may not need to do anything fancy. But I guess we'll need to test out this hypothesis.
I have some thoughts on the binary format for MAST in general, should I make that part of this proposal, or open a separate issue for that?
I think a separate issue would be better. In my mind, there are actually 2 binary formats:
Closed by #1349.
This is the first of a small set of issues I'm opening to try and set the stage for some of the things we need on the compiler/packaging side of things, but which also bring a number of benefits to the VM as well. I'm planning to work on implementing these things once there is consensus, but want to make sure everyone feels good about the proposals and has an opportunity to provide feedback on implementation details.
Proposal
This proposal is to reimplement the in-memory representation of MAST to use a more efficient, in both space and time, structure-of-arrays/table-based representation, rather than the tree structure currently used today. Interestingly, it seems like the seeds of this approach were planted with
CodeBlockTable
, but didn't take the idea to its logical conclusion. This proposal does just that. I'm guessing most/all of you have a pretty good feel for the reasons why this would be an improvement just in terms of memory and raw execution speed, but to recap:CodeBlock
enum-of-structs type wraps the set of various block types that compose the MAST, and which form the leaves and branches of the T in MAST. For the block types which are branches in the tree, namelyJoinBlock
andSplitBlock
, their children are expressed as a boxed slice ofCodeBlock
. Thus, the tree is navigated by traversing from the rootCodeBlock
to the leaves in a depth-first fashion.Rather than representing MAST as a literal tree in-memory, we can use a far more compact representation that is also cache-friendly, and significantly easier to query/visit. In essence, you go from a structure like this:
To something like this:
In essence, you have a table of nodes which can reference each other using their index in that table. Not only does this have nice perf characteristics, but it makes nodes directly accessible regardless of their depth in the tree, allows nodes to be trivially compacted if they are equivalent (e.g. MAST blocks with the same hash could be stored just once). It also becomes very easy to extend the tree with additional metadata without it being awkward to access. For example, something like so:
The compiler uses a similar type of structure in a few places, most importantly for representing the dataflow graph of a function. But perhaps a more notable implementation of this idea is in WebAssembly, which stores each type of "entity" in a module-scoped table, and all references to a given entity use its index in the table, rather than the entity itself.
Rationale
This change would pave the way for some other proposals, which I'll be submitting over the next day or two. The key thing this specific proposal will enable, aside from the other benefits mentioned, is straightforward serialization of the MAST representation of a program (or its component parts) in a compact format more suitable for packaging and distribution. The idea would then be to ship this representation rather than Miden Assembly sources when publishing a Miden contract/program for consumption. There are some other important properties gained from this representation, but more on that in my next issue.