0xPolygonMiden / miden-vm

STARK-based virtual machine
MIT License
632 stars 161 forks source link

Support ability to specify advice data via MASM #1547

Open bobbinth opened 3 weeks ago

bobbinth commented 3 weeks ago

In some situations it maybe desirable to specify some data which a given program assumes to be available in the advice provider. One example of this is read-only data output by the compiler, but there could be may other examples. Currently, such data needs to be loaded separately into the VM which introduces extra complexities.

One way around this is to allow users to define data which is to be loaded into the advice provider before a given program starts executing. The syntax for this in MASM could look like so:

advent.FOO.0x9dfb1fc9f2d5625a5a304b9012b4de14df5cf6e0155cdd63a27c25360562587a
    642174d286a4f38e4d2e09a79d048fe7c89dec9a03fce29cbe10d32aa18a1dc4
    bb48645fa4ffe141f9a139aef4fa98aec50bded67d45a29e545e386b79d8cefe
    0f87d6b3c174fad0099f7296ded3abfef1a282567c4182b925abd69b0ed487c3
    c251ce5e4e2da760658f29f6a8c54788d52ae749afd1aef6531bf1457b8ea5fb
end

Here, advent specifies that we want to add an entry to the advice map. The key for the entry would be the word defined by 0x9dfb1fc9f2d5625a... value. The data of the entry would be the list of field elements defined by the hex encoded string. We also provide a way to specify a label FOO by which they key can be referred to from the code. For example:

being
    push.FOO
end

Would push the key 0x9dfb1fc9f2d5625a... onto the stack.

Upon assembly, this data would be added to the MastForest. For this, we'd need to add a single AdviceMap property to the MastForest struct - e.g., something like this:

pub struct MastForest {
    /// All of the nodes local to the trees comprising the MAST forest.
    nodes: Vec<MastNode>,

    /// Roots of procedures defined within this MAST forest.
    roots: Vec<MastNodeId>,

    /// All the decorators included in the MAST forest.
    decorators: Vec<Decorator>,

    /// Advice map to be loaded into the VM prior to executing procedures from this MAST forest.
    advice_map: AdviceMap,
}

Then, when the VM starts executing a given MAST forest, it'll copy the contents of the advice map into its advice provider (we can also use a slightly more sophisticated strategy to make sure that the map is copied only once).

Open questions

While the above approach should work, there are a few things we need to clarify before implementing it:

greenhat commented 1 week ago

...

This change is beneficial to #1544 since I was thinking of a way to convey the notion that MastForest(code) requires the rodata loaded into the advice provider before it can be executed.

The MASM-facing part (syntax, parsing, etc.) of the implementation would take me quite a lot of time since I'm not familiar with the code, but the VM-facing part I believe I can do in a reasonable amount of time. If @bitwalker is ok with it, I can take a stab at it.

Upon assembly, this data would be added to the MastForest. For this, we'd need to add a single AdviceMap property to the MastForest struct - e.g., something like this:

pub struct MastForest {
    /// All of the nodes local to the trees comprising the MAST forest.
    nodes: Vec<MastNode>,

    /// Roots of procedures defined within this MAST forest.
    roots: Vec<MastNodeId>,

    /// All the decorators included in the MAST forest.
    decorators: Vec<Decorator>,

    /// Advice map to be loaded into the VM prior to executing procedures from this MAST forest.
    advice_map: AdviceMap,
}

Then, when the VM starts executing a given MAST forest, it'll copy the contents of the advice map into its advice provider (we can also use a slightly more sophisticated strategy to make sure that the map is copied only once).

I've taken a look and here are my findings on what needs to be done to implement this:

Open questions

While the above approach should work, there are a few things we need to clarify before implementing it:

  • How should we handle conflicting keys during assembly and execution?

    • If we encounter two entries with the same key but different data during assembly, this should probably be an error.

Yes, I think it should be an error. From rodata perspective, the digest is a hash of the data itself, so if the data is different, the digest will be different as well. From the MASM perspective, this might mean key/digest re-use, which does not seem like something a user might want, so failing early is a good thing to do.

  • But what to do if we start executing a MAST forest which wants to load data into the advice provider but an entry with the same key but different data is already in the advice map? Should we error out? Silently replace the existing data with the new one? Anything else?

If the user code treats the advice provider as a some sort of dictionary, that's a valid use case. I'm not sure if it should be an error.

bobbinth commented 1 week ago

Handle the AdviceMap when merging MAST forests (join with other AdviceMaps?);

Yes, I think merging would work fine here. If there is a conflict (two entries with the same key by different data), we'd error out here as well.

Serialization/deserialization of the MastForest should handle the AdviceMap as well, but it'll break storing the rodata separately in the Package (roundtrip serialization would not work). We could put rodata in AdviceMap on the compiler side as well and not store it separately in the Package. @bitwalker is it ok?

Yeah - I think once we have this support for advice map entries in MastForest, there is no need to store rodata separately in the package.

greenhat commented 1 week ago

I'll wrap-up my current work (Component support in the compiler pipeline) and jump on the VM-faced part.

bitwalker commented 1 week ago

In the above example FOO refers to a full word. All our constants currently refer to single elements. Ideally, we should be able to tell by looking at the constant name whether it is for a full word or a single element. So, maybe we should come up with some simple scheme here to differentiate them?

The parser already knows how to parse various sizes of constants, including single words, or even arbitrarily large data (the size of the data itself indicates which type it is).

Should the key handle FOO be accessible outside of the module it was defined in? It seems like it would be a good idea, but then we need to be able to apply some kind of visibility modifiers to advent.

These would be effectively globally visible symbols, and while unlikely, you can have conflicting keys, so I think any attempt to make it seem like these can be scoped should be avoided.

How should we handle conflicting keys during assembly and execution?

I'm not sure how we handle this during execution today actually, presumably we just clobber the data if two things are loaded with the same key into the advice map?

During assembly I think it has to be an error. It might be possible to skip the error if the data is the same, I think it's still an open question whether or not you would want to know about the conflicting key regardless.


I'm questioning a bit whether it makes sense to define this stuff in Miden Assembly;

  1. A major reason why I made the point (back in my original packaging proposal) that the tooling should handle rodata internally, is because of how difficult is to get the encoding right for things that aren't a single element (or word if we're being generous). Encoding anything byte-oriented in terms of field elements is non-trivial at best, especially in MASM, as we handle two different endianness syntaxes in different places. The compiler does this automatically for the data it's dealing with, and I anticipated our packaging tooling would handle stuff like this. I think most people will want the assembler/compiler/whatever to handle the encoding to ensure it is correct.
  2. We can add stuff in the in-memory AST, or in MAST, without surfacing it in MASM syntax. Right now, the only real producer of this kind of data is the compiler, and we don't require support in the MASM syntax for this, just a way to supply the data when assembling to MAST.
  3. I think the theoretical use of this feature falls into two categories: a.) tool-generated rodata that is required by the program to execute, b.) user-defined key/value pairs that they want to refer to in their program, and know that at runtime those key/value pairs will be there. In the former, the tools don't require syntax support. In the latter, the user just needs to be able to specify that KEY is expected to be in the advice map at runtime, and they can refer to it in MASM as you've described (e.g. push.KEY) - the actual data (i.e. the value being referenced) could be supplied any number of ways, and in fact need not even be provided at assembly-time.
  4. Related to the above, I think this ties directly to the question of how we gather user inputs needed by Miden more generally. The only real novelty we need is the ability to provide those inputs with the program itself, rather than expect everything to be provided by the user. For those that want the conveniences of using KEY in their handwritten MASM, this gives them that ability without requiring that we figure out how to also encode the value in the syntax, while freeing us up to gather those values any number of ways.

Setting that aside for a moment:

bitwalker commented 1 week ago

I've taken a look and here are my findings on what needs to be done to implement this:

  • Move the AdviceMap type from processor to core;
  • Handle the AdviceMap when merging MAST forests (join with other AdviceMaps?);

We'll need to catch conflicting keys (different values for the same key, but fine if the keys overlap with the same value), but a straight merge of the two maps should be fine otherwise.

Serialization/deserialization of the MastForest should handle the AdviceMap as well, but it'll break storing the rodata separately in the Package (roundtrip serialization would not work). We could put rodata in AdviceMap on the compiler side as well and not store it separately in the Package. @bitwalker is it ok?

Once we can write our rodata to the MastForest directly, we won't need to do it in the Package anymore, so that sounds fine to me!

@greenhat For now, I would focus purely on the implementation around the MastForest/processor (what you've suggested AIUI), don't worry about the AST at all. That's all we need for the compiler anyway, while we figure out how to handle the frontend aspect in the meantime.