0xPolygonMiden / miden-vm

STARK-based virtual machine
MIT License
615 stars 152 forks source link

feat(masm): refactor assembler into a more compiler-like architecture #1277

Closed bitwalker closed 3 months ago

bitwalker commented 5 months ago

[!IMPORTANT] This is a large PR, with many changes interlinked. I have broken up the changes into many smaller pieces that introduce them piece-by-piece for review, but those pieces may refer to thing that come in later commits, or be incomplete in and of themselves. This is intentional, and you should use the commits to understand the structure of the changes being made, and then attack the review however you prefer from there.

Please read the commit messages, as they also further introduce the work done and what I'm trying to convey in each particular chunk.

This PR is a major refactor of the assembler crate, with various bits of new functionality in the miden-core and miden-assembly crates. There are some changes made in other crates, but these are largely tests, where the interaction with the assembler has either changed in some way, or the output being tested has changed. The following section lists the most significant changes to be aware of - keep in mind, that as I laid out in my note at the top, the changes are described in more detail in the commit messages, so read those for more info, but I'm happy to answer any questions you may have as well.

Summary

Here's a high-level overview of what is contained in this PR:

Changes to MASM Syntax

I'd like to devote a section just to this topic, as they are both a motivating reason for some of the changes, as well as a result of reimplementing compilation in a more principled fashion:

I think that largely covers the changes to the surface syntax, at least that I can recall off the top of my head here.

Changes to MAST/Assembler

One of the major things that falls out of the Assembler refactoring, is that we now always visit procedures in reverse topological order, i.e. callees before callers. As a result, we always know the MAST root for every procedure being called in the program. Instead of inlining the body of every procedure at every call site, the assembler now emits PROXY blocks for procedures called via exec (and CALL for call, but that's not new). This has the effect of making the emitted MAST drastically smaller. I added support in the processor for executing proxy blocks (it was an error previously) - if that is not the correct behavior, we may want to revisit this as part of the larger MAST refactoring. The only downside is that the textual MAST now has proxy.HASH where there used to be inlined code, but even that is something we could easily address in the formatter if we actually have a problem with this. I have added a test helper though that makes expressing tests that expect certain textual MAST output that contains calls much more readable, and less fragile to changes in the standard library, etc.

The other significant change to the way the assembler works is that the procedure cache has been rewritten to be tightly integrated with the module graph. This enables some nice optimizations, and in particular, allows us to use plain integer identifiers rather than having to hash procedure ids. The downside is that the procedure cache cannot be shared across assembler instances - whether that's a problem in practice I think remains to be seen. I do think that we'll want to further refactor the assembler in the future to enable a greater degree of sharing of the module graph and procedure cache, but there is a tradeoff in performance either way, so I erred on the side of single-threaded use cases for now.

Diagnostics

This PR introduces two dependencies to aid in creating error types and diagnostics, thiserror and miette respectively. These are using forks I made from the latest versions of both crates, so as to add support for #![no_std] environments until the error_in_core feature stabilizes. But let's take a brief look at how these are used:

Defining an error

Let's use the parser as an example here, the following is a subset of it's definition that demonstrates how thiserror can be used to define the type, while simultaneously implementing From for any errors it encapsulates, as well as the Display, and Error traits:

#[derive(Debug, Clone, thiserror::Error)]
pub enum ParsingError {
    #[error("invalid token")]
    InvalidToken {
        span: SourceSpan,
    },
    #[error("unrecognized token: expected {}", expected.as_slice().join(", or ")))]
    UnrecognizedToken {
        span: SourceSpan,
        token: String,
        expected: Vec<String>,
    },
    ...
}

That's it! It makes defining error types for each task convenient, and provides a very natural way to display the data associated with an error as part of it's Display implementation.

Now let's extend our ParsingError type for use in our diagnostics system:

#[derive(Debug, Clone, thiserror::Error, Diagnostic)]
pub enum ParsingError {
    #[error("invalid token")]
    #[diagnostic()]
    InvalidToken {
        #[label("occurs here")]
        span: SourceSpan,
    },
    #[error("unrecognized token")]
    #[diagnostic(help("expected {}", expected.as_slice().join(", or ")))]
    UnrecognizedToken {
        #[label("lexed a {token} here")]
        span: SourceSpan,
        token: String,
        expected: Vec<String>,
    },
    ...
}

As you can see, this feels like a natural extension of thiserror, with support for decorating things that are convertible to SourceSpan as "labels" in the diagnostic output. Here's what an instance of the UnrecognizedToken error above looks like when rendered with source code:

  x unrecognized token
         ,-[test1737:2:37]
       1 |
       2 |         use.dummy::math::u64->bigint->invalidname
         :                                     ^|
         :                                      `-- lexed a -> here
       3 |
         `----
        help: expected "begin", or "const", or "export", or "proc", or "use",
      or end of file, or doc comment

I actually lifted the example above from our test suite (the module_alias test). These pretty errors are enabled by associated source spans with specific errors, and then pairing them with a reference to the source code to which those spans are derived, and this can be done in a few different ways.

MASM Serialization

While this PR does not remove serialization of MASM, it does change it, as the AST has changed, and some of the restrictions have been loosened. I think we should plan on removing serialization entirely once the MAST refactoring is done, which should now be trivial with the changes contained in this PR having already set the stage for it.

TODO

There are three things that still need to be done, and they are small, but I want to call them out here:

TL;DR

This is an enormous PR, but it is more efficient to group these changes together like this than to have made them incrementally (which would have taken many weeks). That said, please take as much time as you need, I don't want this to be a nightmare to review. For that reason I spent an inordinate amount of time breaking this up into smaller commits that can be reviewed piecemeal, but stack in such a way that you can pretty much just start at the oldest commit and work forward and have it go fairly painlessly.

I'm also happy to walk through the changes or any questions you have on a call, in Discord, etc., if you feel it will be easier to have me on hand to answer questions you have. I'll literally pair up for the whole review if you want - I opened this monstrosity, so its only fair I go out of my way to make it easy on the rest of you. Just let me know!


I ran the benchmarks, and here's what I'm seeing compared to next:

❯ cargo bench -p miden-vm -- --save-baseline assembler-refactor
warning: src/github.com/0xpolygonmiden/miden-vm/miden/Cargo.toml: unused manifest key: bench.0.debug
   Compiling miden-vm v0.8.0 (src/github.com/0xpolygonmiden/miden-vm/miden)
    Finished bench [optimized] target(s) in 2.84s
     Running benches/program_compilation.rs (target/release/deps/program_compilation-f49d9b2efbd5811d)
Gnuplot not found, using plotters backend
program_compilation/sha256
                        time:   [5.1606 ms 5.1667 ms 5.1736 ms]
                        change: [-96.568% -96.560% -96.553%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  2 (2.00%) low mild
  5 (5.00%) high mild
  6 (6.00%) high severe

     Running benches/program_execution.rs (target/release/deps/program_execution-2647b9ea5dc36e7d)
Gnuplot not found, using plotters backend
program_execution/sha256
                        time:   [15.236 ms 15.283 ms 15.334 ms]
                        change: [-3.0510% -2.4781% -1.8854%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
hackaugusto commented 5 months ago

BTW, I was going over the docs, and I think after this PR this is no longer true, and should be udpated:

Procedures invoked via the exec instruction, are inlined at their call sites during compilation. Thus, from the standpoint of the final program, executing procedures this way is indistinguishable from manually including procedure code in place of the exec instruction. This also means that procedures invoked via the exec instruction are executed in the same context as the caller.

https://0xpolygonmiden.github.io/miden-vm/user_docs/assembly/execution_contexts.html#invoking-via-exec-instruction

bitwalker commented 4 months ago

FYI, now that #1287 is merged, I need to rebase on those changes. I'll get to that tomorrow probably, but until then it's going to show like there are a ton of conflicts, just ignore that for now.

bitwalker commented 4 months ago

I've rebased this on next, so it should be mergeable again shortly!

bitwalker commented 3 months ago

This is ready now and has been rebased on latest next, all review comments have been addressed, aside from the removal of the Assembler::*_in_context methods and the AssemblyContext, as that is a larger change that feels like it belongs in a separate PR once this is merged.

A number of improvements and other features/changes have been discussed here, but I believe everything we intended to tackle as part of this PR is more or less complete.