feat(masm): refactor assembler into a more compiler-like architecture

bitwalker commented 5 months ago

[!IMPORTANT] This is a large PR, with many changes interlinked. I have broken up the changes into many smaller pieces that introduce them piece-by-piece for review, but those pieces may refer to thing that come in later commits, or be incomplete in and of themselves. This is intentional, and you should use the commits to understand the structure of the changes being made, and then attack the review however you prefer from there.

Please read the commit messages, as they also further introduce the work done and what I'm trying to convey in each particular chunk.

This PR is a major refactor of the assembler crate, with various bits of new functionality in the miden-core and miden-assembly crates. There are some changes made in other crates, but these are largely tests, where the interaction with the assembler has either changed in some way, or the output being tested has changed. The following section lists the most significant changes to be aware of - keep in mind, that as I laid out in my note at the top, the changes are described in more detail in the commit messages, so read those for more info, but I'm happy to answer any questions you may have as well.

Summary

Here's a high-level overview of what is contained in this PR:

A completely new MASM frontend (i.e. parser/lexer/diagnostics), which makes use of a formal LALR(1) grammar, and corresponding generated parser. The grammar and parser generator used is lalrpop, and the lexer generator is logos - both are chosen because of my prior experience in compiler projects with them. They are solid, well-built crates, with a decent community around them, and both support no-std builds.
Rewritten source-location tracking, which also uses a representation that the Miden compiler uses, which will make later work to propagate source location information from the compiler frontend through Miden Assembly trivial. Most importantly though, this enables precise diagnostics, with source code snippets. See the tests for examples of what these diagnostics look like in practice.
A new way of expressing errors and diagnostics in the assembler and core crates (and other crates as well, when we want to start using them there). This makes use of the thiserror and miette crates (forked versions currently, for no-std support, until error_in_core is stabilized). These enable very ergonomic error types (via thiserror and #[derive(Error)]), and the ability to layer diagnostics on those same types (via miette and #[derive(Diagnostic)]). The diagnostics infrastructure is quite configurable, and I suspect we may want to take advantage of its JSON reporting backend for running the VM in the browser, so that we can extract error information in JavaScript and render pretty errors in the browser-based editor.
A source code formatter for Miden Assembly and MAST (in its textual form). This is based on the algorithm described by Philip Wadler, commonly called Prettier. You interact with the pretty printer by implementing the PrettyPrint trait, which is implemented in miden-core under core/src/prettier, by describing how you want to lay out the components of the syntax to be rendered. The prettier algorithm then takes that description and renders it using a language-agnostic algorithm. I have implemented pretty printers for both MASM and MAST, but feel free to modify their implementations as you see fit. One of the interesting things you can do with it is describe alternate layouts, which the algorithm will choose between depending on the available width of the output window (which defaults to 80 columns). You can see this in effect with span in the case of MAST, which will alternatively render single- vs multi-line depending on the size of the span.
Syntax tree visitors for the MASM abstract syntax tree, which allows you to concisely express analyses and rewrites over the tree without having to implement the traversals by hand each time. This is used in a few places in the refactored code.
A new semantic analyzer, which performs a variety of first-pass validation checks after parsing MASM source code.
A largely rewritten set of AST nodes. Many of the previous ones are still there in some form (especially Instruction, which hasn't changed much), others are expanded with new functionality or metadata (e.g. InvocationTarget), and others are entirely new, or rewritten so as to be essentially new. Virtually all of these nodes now implement a Spanned trait which returns the source span for that node, or a default one if there is no source code.
A significant rewrite of the Assembler internals. Some parts of this are virtually untouched (particularly the translation to MAST, with the main exception being how procedure calls are handled). The way in which the assembler is instantiated in used is slightly different, but much more powerful now. These changes are all driven off the new "module graph", which is used for global inter-procedural analysis during assembly.

Changes to MASM Syntax

I'd like to devote a section just to this topic, as they are both a motivating reason for some of the changes, as well as a result of reimplementing compilation in a more principled fashion:

You may now express arbitrarily-long identifiers in MASM, e.g. the 255 character limit is removed.
You may now specify identifiers that were previously illegal by quoting them in double-quotes, e.g. export."_RNvCskwGfYPst2Cb_3foo16example_function" (an identifier generated from Rust's mangling scheme). The default "bare" identifiers still have largely the same constraints as before (start with an alphabetic character, contain only alphanumerics and underscores, etc.).
You may now order procedures in a module any way you like, i.e. you do not need to define a procedure before it is referenced. This allows you to organize modules as you see fit to group functionality. The inter-procedural analysis done during compilation will ensure that no cycles in the call graph are introduced.
Similarly, you may introduce imports, constants, and other top-level items in any order, i.e. it is now perfectly acceptable to place constant definitions close to their usages. This means you can also reference definitions out of order, as long as when the names are resolved there are no cycles (which is also detected and raised as a diagnostic).
It is expected that Kernel modules export their syscalls, and proc their private internal helper functions. I actually can't recall now if that was the case before, but it certainly is now.
All instructions which accept immediates, now also accept constant identifiers as well as literals. I think the majority of instructions had support for constants, but it is now universal. There is only one exception to this rule, and it could easily be removed, and that's for the exp.uXX instruction, which currently only accepts a literal.
Imports are now purely syntactic sugar - they have no bearing on the compiled code at all, as the imports are resolved to their actual definitions during compilation (no matter how many re-exports have to be followed, though again, cycles are detected and not allowed). We could easily add support for important constants as well, but I have not implemented that here.

I think that largely covers the changes to the surface syntax, at least that I can recall off the top of my head here.

Changes to MAST/Assembler

One of the major things that falls out of the Assembler refactoring, is that we now always visit procedures in reverse topological order, i.e. callees before callers. As a result, we always know the MAST root for every procedure being called in the program. Instead of inlining the body of every procedure at every call site, the assembler now emits PROXY blocks for procedures called via exec (and CALL for call, but that's not new). This has the effect of making the emitted MAST drastically smaller. I added support in the processor for executing proxy blocks (it was an error previously) - if that is not the correct behavior, we may want to revisit this as part of the larger MAST refactoring. The only downside is that the textual MAST now has proxy.HASH where there used to be inlined code, but even that is something we could easily address in the formatter if we actually have a problem with this. I have added a test helper though that makes expressing tests that expect certain textual MAST output that contains calls much more readable, and less fragile to changes in the standard library, etc.

The other significant change to the way the assembler works is that the procedure cache has been rewritten to be tightly integrated with the module graph. This enables some nice optimizations, and in particular, allows us to use plain integer identifiers rather than having to hash procedure ids. The downside is that the procedure cache cannot be shared across assembler instances - whether that's a problem in practice I think remains to be seen. I do think that we'll want to further refactor the assembler in the future to enable a greater degree of sharing of the module graph and procedure cache, but there is a tradeoff in performance either way, so I erred on the side of single-threaded use cases for now.

Diagnostics

This PR introduces two dependencies to aid in creating error types and diagnostics, thiserror and miette respectively. These are using forks I made from the latest versions of both crates, so as to add support for #![no_std] environments until the error_in_core feature stabilizes. But let's take a brief look at how these are used:

Defining an error

Let's use the parser as an example here, the following is a subset of it's definition that demonstrates how thiserror can be used to define the type, while simultaneously implementing From for any errors it encapsulates, as well as the Display, and Error traits:

#[derive(Debug, Clone, thiserror::Error)]
pub enum ParsingError {
    #[error("invalid token")]
    InvalidToken {
        span: SourceSpan,
    },
    #[error("unrecognized token: expected {}", expected.as_slice().join(", or ")))]
    UnrecognizedToken {
        span: SourceSpan,
        token: String,
        expected: Vec<String>,
    },
    ...
}

That's it! It makes defining error types for each task convenient, and provides a very natural way to display the data associated with an error as part of it's Display implementation.

Now let's extend our ParsingError type for use in our diagnostics system:

#[derive(Debug, Clone, thiserror::Error, Diagnostic)]
pub enum ParsingError {
    #[error("invalid token")]
    #[diagnostic()]
    InvalidToken {
        #[label("occurs here")]
        span: SourceSpan,
    },
    #[error("unrecognized token")]
    #[diagnostic(help("expected {}", expected.as_slice().join(", or ")))]
    UnrecognizedToken {
        #[label("lexed a {token} here")]
        span: SourceSpan,
        token: String,
        expected: Vec<String>,
    },
    ...
}

As you can see, this feels like a natural extension of thiserror, with support for decorating things that are convertible to SourceSpan as "labels" in the diagnostic output. Here's what an instance of the UnrecognizedToken error above looks like when rendered with source code:

  x unrecognized token
         ,-[test1737:2:37]
       1 |
       2 |         use.dummy::math::u64->bigint->invalidname
         :                                     ^|
         :                                      `-- lexed a -> here
       3 |
         `----
        help: expected "begin", or "const", or "export", or "proc", or "use",
      or end of file, or doc comment

I actually lifted the example above from our test suite (the module_alias test). These pretty errors are enabled by associated source spans with specific errors, and then pairing them with a reference to the source code to which those spans are derived, and this can be done in a few different ways.

MASM Serialization

While this PR does not remove serialization of MASM, it does change it, as the AST has changed, and some of the restrictions have been loosened. I think we should plan on removing serialization entirely once the MAST refactoring is done, which should now be trivial with the changes contained in this PR having already set the stage for it.

TODO

There are three things that still need to be done, and they are small, but I want to call them out here:

[x] Need to rebase on the most recent changes in next, I'm ~12 commits behind, and there are certainly conflicts there, but I wanted to tackle that once I got this open first.
[x] There is a single failing test that I need to diagnose (in case one of you knows why it alone is failing!), which can be run with cargo test -p miden-stdlib -- crypto::sha256::sha256_hash_memory. It is failing with FailedAssertion { clk: <some number of cycles>, err_code: 0, err_msg: None }, but no additional info. I'm assuming it is due to my changes, but find it odd that all other tests pass and this one does not.
[ ] I would like to confirm that we actually want to support things like begin end or if.true end in the surface Miden Assembly syntax. I have chosen not to allow this in the parser, to keep the grammar simpler, and because it honestly doesn't make sense that anyone would write an empty block by hand, or what the purpose of doing so would be. This obviously has nothing to do with supporting empty blocks in MAST, only in MASM source code.
[ ] On a related note, I've also relaxed the behavior with regard to multiple imports of the same module with different aliases (as long as they are both used). This was an error, as of these changes that is no longer the case. We can certainly make it an error easily enough, but I feel there is a valid reason to allow this, and that we would only want to raise a warning/error if an import is unused (e.g. while refactoring a large program, you might move stuff using the old name to a new module that is aliased, while simultaneously importing the new module with its "true" name, and have all new code use that import, thus both imports are valid and useful, despite being overlapping.

TL;DR

This is an enormous PR, but it is more efficient to group these changes together like this than to have made them incrementally (which would have taken many weeks). That said, please take as much time as you need, I don't want this to be a nightmare to review. For that reason I spent an inordinate amount of time breaking this up into smaller commits that can be reviewed piecemeal, but stack in such a way that you can pretty much just start at the oldest commit and work forward and have it go fairly painlessly.

I'm also happy to walk through the changes or any questions you have on a call, in Discord, etc., if you feel it will be easier to have me on hand to answer questions you have. I'll literally pair up for the whole review if you want - I opened this monstrosity, so its only fair I go out of my way to make it easy on the rest of you. Just let me know!

I ran the benchmarks, and here's what I'm seeing compared to next:

❯ cargo bench -p miden-vm -- --save-baseline assembler-refactor
warning: src/github.com/0xpolygonmiden/miden-vm/miden/Cargo.toml: unused manifest key: bench.0.debug
   Compiling miden-vm v0.8.0 (src/github.com/0xpolygonmiden/miden-vm/miden)
    Finished bench [optimized] target(s) in 2.84s
     Running benches/program_compilation.rs (target/release/deps/program_compilation-f49d9b2efbd5811d)
Gnuplot not found, using plotters backend
program_compilation/sha256
                        time:   [5.1606 ms 5.1667 ms 5.1736 ms]
                        change: [-96.568% -96.560% -96.553%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 13 outliers among 100 measurements (13.00%)
  2 (2.00%) low mild
  5 (5.00%) high mild
  6 (6.00%) high severe

     Running benches/program_execution.rs (target/release/deps/program_execution-2647b9ea5dc36e7d)
Gnuplot not found, using plotters backend
program_execution/sha256
                        time:   [15.236 ms 15.283 ms 15.334 ms]
                        change: [-3.0510% -2.4781% -1.8854%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

hackaugusto commented 5 months ago

BTW, I was going over the docs, and I think after this PR this is no longer true, and should be udpated:

Procedures invoked via the exec instruction, are inlined at their call sites during compilation. Thus, from the standpoint of the final program, executing procedures this way is indistinguishable from manually including procedure code in place of the exec instruction. This also means that procedures invoked via the exec instruction are executed in the same context as the caller.

https://0xpolygonmiden.github.io/miden-vm/user_docs/assembly/execution_contexts.html#invoking-via-exec-instruction

bitwalker commented 4 months ago

FYI, now that #1287 is merged, I need to rebase on those changes. I'll get to that tomorrow probably, but until then it's going to show like there are a ton of conflicts, just ignore that for now.

bitwalker commented 4 months ago

I've rebased this on next, so it should be mergeable again shortly!

bitwalker commented 3 months ago

This is ready now and has been rebased on latest next, all review comments have been addressed, aside from the removal of the Assembler::*_in_context methods and the AssemblyContext, as that is a larger change that feels like it belongs in a separate PR once this is merged.

A number of improvements and other features/changes have been discussed here, but I believe everything we intended to tackle as part of this PR is more or less complete.

0xPolygonMiden / miden-vm