0xPolygonMiden / compiler

Compiler from MidenIR to Miden Assembly
MIT License
63 stars 20 forks source link

Finalize codegen handling of data segments/rodata #120

Closed bitwalker closed 4 weeks ago

bitwalker commented 7 months ago

This is a tracking issue for one of the last major remaining functional pieces of the compiler that is currently missing. Support for data segments is not blocked on anything in the compiler specifically, but rather it is blocked on the fact that the Miden VM does not (as yet) provide a way to initialize the linear memory of a program prior to startup. Additionally, when a program context switches (e.g. due to call), there is no way (as yet) to ensure that the callee has their linear memory initialized.

There have been some initial discussions around introducing some changes/features that would provide a clean(er) solution for this, but as of right now, our only option is the following:

This combined solution is really gross for a variety of reasons, but could suffice as a temporary workaround until something proper can be put into place. I'm opening this issue to keep this on our radar, but I will probably defer implementation of any solution here until we know for sure where things stand on the VM side of things.

bobbinth commented 7 months ago

In the program entrypoint, before doing anything else, emit code to write the read-only data into linear memory at the necessary offsets, and only then begin execution of the program.

For this case, I don't think we'll be able to do much better than the approach you described. We can probably create some syntactic sugar in MASM to hide the underlying implementation details, but under the hood, we'll still probably end up using adv_pipe instructions to write the read-only data to memory.

Using these instructions, we can initialize memory relatively efficiently. Specifically, we can do it at a rate of about 1 word per VM cycle. Assuming a single word contains 16 bytes (i.e., 32 bits per element), this means initializing 1KB of read-only data would take roughly 64 VM cycles. Not great, not terrible.

In the prologue of any function that is call-able, emit code to initialize linear memory as described above before resuming execution of the function. This is to ensure that any references to read-only data remain valid even across context switches.

This is the part where I think we'll need to come up with something new in the VM. I had some initial thoughts about potential approaches. One key question is whether we can cleanly (and consistently) distinguish accesses to read-only segment vs. accesses to regular memory.

bitwalker commented 7 months ago

For this case, I don't think we'll be able to do much better than the approach you described. We can probably create some syntactic sugar in MASM to hide the underlying implementation details, but under the hood, we'll still probably end up using adv_pipe instructions to write the read-only data to memory.

I think one of the big things that feels like we're missing here is a straightforward/canonical way to ship a package that, in addition to the MAST, contains a manifest that, amongst other things, specifies the (optional) rodata segment(s) that the program represented by that MAST expects to have written into linear memory at program startup. The VM could either 1.) handle setting things up (using the advice provider presumably) so that a program can assume that everything it needs is in place when it is run (i.e. it can just start adv_pipeing memory straight away). Or 2.) would manage the gritty details of initializing the program state with that raw data (I suppose by performing the equivalent operations necessary to write that data to memory where it is expected), such that when the program starts, it can assume linear memory is initialized appropriately.

Still, it seems like program startup and context switches due to call are two sides of the same coin, and I can't help but feel like there should be some path to handling the two uniformly. The situation around program startup is less of a concern (and headache) to me than that of call handling.

This is the part where I think we'll need to come up with something new in the VM. I had some initial thoughts about potential approaches. .

I'd be interested to hear more about what you've got in mind when you get a chance to write up your thoughts there, I think it'll probably help me to have a better sense of what the actual solution space is for this.

I do think this is where expanding on the concept of a "context" so that it encompasses some of these details could make a lot of sense, and would interact well with some of the other potential features we've been mulling over. Setting aside the feasibility of various details here, if we assume that each context is associated with a unique linear memory instance, then initialization of that linear memory is conceptually part of initializing the context it is associated with. If it is possible to allow the initial state for a context to be configurable (for example, based on some metadata associated with the callee), then all the necessary pieces are there to provide built-in support for something very similar to how ELF executables initialize program memory at startup.

Specifically, if we assume that something like that is feasible, then you could use something like the following to describe the behavior of any given Miden program:

  1. The root context derives its linear memory initialization scheme based on the entrypoint
  2. The callee context for any call derives its initialization scheme based on the callee
  3. The scheme is derived by: a. Initializing memory with all zeroes b. If the given MAST root has an explicit set of rodata segments associated with it, then they are applied in the order they appear. Overlapping segments are not allowed.
  4. An rodata segment set can be associated with a given MAST root in one of three ways: a. The manifest associated with the MAST specifies a default segment set for all roots local to that MAST b. The manifest associated with the MAST specifies a segment set for each root explicitly c. Some combination of the two

I'm assuming of course that the choice of initializing to all zeroes is essentially arbitrary (or more likely it is convenient for various reasons), but the bottom line is that I'm assuming that this isn't a fundamental limitation.

I'm guessing the big issue here has to do with constraints though, and that's where I lose my intuition for what is actually viable vs what is not. I would assume that the initial state of linear memory is just one of the parameters used to "seed" the initial VM state, but if it only works because memory is assumed to be all zeroes, then I'm not sure what options are available to us. Perhaps there are some other zkVM implementations with solutions to this that might be applicable to Miden?

One key question is whether we can cleanly (and consistently) distinguish accesses to read-only segment vs. accesses to regular memory

In general, there are probably many situations where it is distinct, but the primary issue is that an address taken doesn't have to be used right away, it might be passed as an argument to a function that reads through the pointer as if it could be referencing anything on the heap. Since the actual load happens elsewhere, we won't be able to guarantee that a given load is always from read-only memory.

bobbinth commented 7 months ago

I will write up a more detailed reply in the next day or two, but here is a brief description of some fundamental constraints we have when it comes to memory initialization:

There are two ways to initialize memory that I'm aware of (there may be some newer techniques and maybe @Al-Kindi-0 can chime in here if I'm missing something):

  1. Initializing memory by non-deterministically inverting hash of the memory contents. This is basically what our adv_pipe instruction is meant to help with. The core idea is that the hash of the memory contents (i.e., a word) is provided to the VM as an input, and the VM executes a small program which gets the actual values from the advice provider, writes these values to memory, and then makes sure that the hash of these values matches the hash provided via inputs.
  2. Initializing the memory via boundary constraints against auxiliary trace. This is a sort of a "public table" similar to what we've been discussing in AirScript for variable-length public inputs (and it is also similar to the public memory approach described in the Cairo paper starting at the end of page 18). In this approach, the verifier knows the initial state of the memory, and thus, can compute boundary constraints for it.

Each of these approaches has their tradeoffs. The main downside of approach 1 is that we need to "waste" VM cycles to initialize the memory. The main downside of approach 2 is that it places extra burden on the verifier. Practically this means that we can't initialize "too much" memory using approach 2. What is "too much" is somewhat arbitrary, but in our context, it is probably not more than a few KB of memory.

So, if we need to initialize, say, more than 4KB, then approach 1 is the only practical option and we need to bite the bullet on the extra cycle count.

The way to implement it for executable programs is pretty straight forward:

  1. At compile time we compute the hash of the rodata.
  2. This hash is then hard-coded into the memory memory initialization program which is executed before the actual program. Both these programs are a part of the same MAST.
  3. As a part of the package, in addition to MAST, the compiler includes the actual rodata in a separate section.
  4. When the VM starts executing a specific package, the rodata would be loaded into the advice provider, and from that point on, everything else works as is now.

The changes needed to support this in the VM are pretty small - the main thing here is to align on the package format and this could be done as a part of the refactoring of MAST we are discussing in the VM repo.

As mentioned above, this approach works well for executable files, but for context switches via call instruction it is not ideal. The main reason for this is not that we have to use the memory initialization program for switching contexts (we have to do this in any scenario), but that we have to use it multiple times.

Specifically, the VM right now does not have the ability to enter the same context twice. So, in effect, every call creates a new context. Thus, when we execute the same procedure twice using call instruction`, we'd get two different memory contexts and we'd need to run memory initialization program twice.

In some cases, this is not a problem. For example, a note script context is entered only once, and thus, for executing note scripts, using the same approach as for executable programs would work fine.

In the cases of account code, however, we run into issues. Assuming the entire account module shares the same rodata, any time a procedure is called on the account interface, we'd need to run memory initialization program for the entire rodata segment. Even right now, with a few simple programs we've built for the rollup, there are cases when multiple procedures are called on the account interface during a transaction. And I'm imagining that as programs become more sophisticated, this would become more and more of an issue.

So, the main case I'm trying to solve for is: how can we ensure that memory initialization program is run only once for a given context.

bobbinth commented 7 months ago

Also cc @plafer as he may be interested in this discussion as well.

bobbinth commented 7 months ago

Before getting to potential solutions, I'd like to describe the problem a bit better. I'm specifically going to focus on accounts in the context of Miden rollup because this is the main use case I am thinking of. But I'd imagine that if we find a solution here, it would be applicable more broadly.

So, let's say we have some account similar to our current basic wallet. The code module for this account essentially consists of 3 procedures and looks something like this:

export.receive_asset
    ...
end

export.send_asset
    ...
end

export.auth_tx
    ...
end

Each of these procedures is a separate program with its own MAST root.

All 3 may be called in a course of a single transaction, and the call tree may look something like this:

main (start in context 0)
├── prologue
├── notes
│   └── dyncall (executes note script in context A)
│       ├── call.receive_asset (executes in context B)
│       └── call.send_asset (executes in context C)
├── tx_script
│   └── call.tx_script (executes in context D)
│       └── call.auth_tx (executes in context E)
└── epilogue

Here, I used letters A, B, C etc. for non-root contexts, but in reality these are 32-bit identifiers.

As you can see, the 3 calls to the account interface procedures currently create 3 different contexts. So, if we put a rodata initialization program in the beginning of each of these procedures, we would be executing this code 3 times.

There is also a separate, but related issue. In some cases, we may end up calling just one of the 3 procedures (or we could have an account with 100 procedures but only one gets called in a given transaction). But if rodata is common across all 3, we'll end up initializing all rodata even though a small part of it is needed for a given procedure.

I think the ideal solution would have the following properties (I'm putting aside if this is viable from VM or compiler side):

  1. Each account procedure has its own rodata which is required for its execution.
  2. Rodata for a procedure is initialized only once. This is done on the first call to the procedure. All subsequent calls have access to the initialized data.
  3. It may be useful to extend this beyond call instructions. That is, if we say exec.foo and foo has some associated rodata, then on the first call to foo this rodata should be initialized, and on all subsequent calls it would have access to this rodata.

Again, not sure if the above is achievable or what changes this may entail - but wanted to write this down as I think this would be the ideal solution.

bitwalker commented 7 months ago

Initializing the memory via boundary constraints against auxiliary trace. This is a sort of a "public table" similar to what we've been discussing in AirScript for https://github.com/0xPolygonMiden/air-script/issues/143 (and it is also similar to the public memory approach described in the Cairo paper starting at the end of page 18). In this approach, the verifier knows the initial state of the memory, and thus, can compute boundary constraints for it.

This is probably a lot more general solution, but the amount of rodata (specifically the number of words which are non-zero) is not something I have good data on currently. Could you elaborate on how you estimated 4KB as the upper limit for this? I could actually see benefits to supporting both approaches - one for larger amounts of rodata (i.e. the fallback solution), and one for rodata that can adhere to the tighter constraints. The package manifest would then specify which of the two is expected to be used by the VM at initialization time.

The way to implement it for executable programs is pretty straight forward: ... The changes needed to support this in the VM are pretty small - the main thing here is to align on the package format and this could be done as a part of the refactoring of MAST we are discussing in the VM repo.

This seems fine to me when this approach is used vs Option 2. I'll of course make sure this is addressed in the package format and the MAST refactoring proposals (where applicable).

As mentioned above, this approach works well for executable files, but for context switches via call instruction it is not ideal ...

I think the ideal solution would have the following properties (I'm putting aside if this is viable from VM or compiler side):

I elaborate in quite a bit of detail some of the overall issues we're facing not only in the compiler itself, but when figuring out things like packaging and cross-language compatibility in my recent comments on #119. Many of the points I raise there overlap with the ones you've described here, and I do think there are some foundational things in the VM we will want to try and implement if we can, that I suspect would pave the way for a solution not only to those problems, but this one as well. Let's perhaps continue the conversation in a new issue/discussion pertaining to the overall topic of context switching, the various function call instructions, and the interactions with things like rodata, ABI, and the expectations of the Wasm component model (and/or existing languages in general). I believe all of these things are intertwined at a fundamental level, and finding a set of primitives we can provide in the VM which solve these in a cohesive fashion is going to be tricky, but well worth working out now.

Each account procedure has its own rodata which is required for its execution.

This won't be possible in virtually any existing compiler/language that I know of, due to how compilers in general try to handle non-primitive constant/read-only data. They are always de-duplicated and co-located globally, so trying to trace which bits of rodata are specific to a given function is not practical without some sophisticated static analysis and rewriting of the generated code.

This would of course be possible in a new language of our own design specifically for Miden. Another approach that technically would work for languages like Rust or C (to a limited extent) is breaking things up into multiple translation units (e.g. Rust crates). So each account function would be compiled into its own object, with just the rodata for that function and its local dependencies. That said, that is a pretty horrible solution for a variety of reasons, and would run into issues pretty quick with functions that need to call other functions on the same account (you'd need to merge both functions into a single crate, but now both functions pay the initialization overhead for their collective rodata).

Rodata for a procedure is initialized only once. This is done on the first call to the procedure. All subsequent calls have access to the initialized data.

This point specifically is actually how I imagine linear memory/contexts working with the changes I propose in #119. Contexts would only need to be instantiated under specific conditions, and once instantiated could be kept alive for the remainder of the program's lifetime. This extends to more than just rodata, but also to things like lazy_static! and OnceCell in Rust, and more importantly, "persistent" contexts would provide the basis for resources (as-in Wasm component model resources), which I think is ultimately the kind of primitive we'll need to implement secure authentication/authorization in things like the transaction kernel in a less fragile way.

I'd like to join these two threads of conversation (i.e. the stuff about contexts and the component model in #119 and the relationship and semantics of memory and rodata from here), rather than try to carry on the same underlying conversation in multiple places. Should we set up a canonical thread for this somewhere and copy over some of the "meatier" comments (or reference them in an index of sorts at the top)? Do you have a preference where that happens? It feels like the VM repo might be the best place for it. In fact the Component Model proposal discussion I started may well be a good place to continue discussing these things, since they directly pertain to multiple aspects of that proposal, and the compiler frontend is already building around the Wasm component model (so its semantics affect what solutions are possible for the Rust toolchain). I'm becoming increasingly convinced that the intersection of all of these pieces is of critical importance, and should probably be prioritized above all else, since the effects propagate to just about every corner of Miden, but I know there are a lot of plates spinning at the moment.

bobbinth commented 7 months ago

I'd like to join these two threads of conversation (i.e. the stuff about contexts and the component model in #119 and the relationship and semantics of memory and rodata from here), rather than try to carry on the same underlying conversation in multiple places. Should we set up a canonical thread for this somewhere and copy over some of the "meatier" comments (or reference them in an index of sorts at the top)? Do you have a preference where that happens? It feels like the VM repo might be the best place for it.

Agreed! I'll post my replies in https://github.com/0xPolygonMiden/miden-vm/discussions/1171.

bobbinth commented 7 months ago

One question to clarify about rodata specifically (so, posting here rather than in the more general discussion): how much control do we have over defining the offset for the rodata segment generated by the compiler?

If, for example, we could say that rodata segment always starts at offset $3 \cdot 2^{30}$, then we could detect this in the VM and treat handling of rodata there differently. Specifically:

  1. The VM would fail if a program tried to write to any address greater than $3 \cdot 2^{30}$.
  2. When reading data from an address greater than $3 \cdot 2^{30}$, the VM would perform an implicit translation to read the data from the root memory context.

We still need to solve a couple of other problems (e.g., handling rodata offsets across multiple contexts), but I think if we can do the above - this would move us much closer to an acceptable solution.

bitwalker commented 7 months ago

@bobbinth I did some digging, and we can control the starting address where the rodata section is placed in memory. It requires us to set a linker flag when compiling the module, but we can control that from the Cargo extension, but would need to detect when someone gives us a module that was clearly compiled without that flag set and treat it as a compile error (or implement support for the "slow" path as a fallback, but I'd prefer to treat it as an error in the near term).

The main issue I can see here though is three-fold:

  1. How do we initialize the memory if it is write-protected?
  2. In the rollup, the root context is associated with the transaction kernel, not the code compiled from Rust
  3. We can't assume that the context of the callee (for call) was compiled with the same rodata as the caller anyway, so even if the two issues above are solved, it will not be the case that the rodata for every callee is the same (in fact, it almost certainly won't be).

We still lack a way to associate specific rodata contents with each context. I do like the write-protection mechanism, and would like to take it even a step further (even if only in a limited debug mode or something) to aid in debugging misbehaving programs - but I don't think it solves the core underlying issues which has more to do with initialization and associating specific rodata with specific MAST roots.

bobbinth commented 7 months ago

I'm trying to break the problem into two parts and see if we can solve them separately. The parts are:

  1. Assuming the rodata section is initialized somehow, how can a given context read from it.
  2. How do we initialize the rodata section.

Given that we are able to control the offset for rodata, I can see a couple of potential solutions for the first problem. Here is one example:

  1. We add another register to the VM - e.g., rdo (rodata offset). This would be similar in spirit to our fmp register which is used to identify the memory offset of procedure locals. rdo value would be context-specific (we could also make it procedure-specific, but for now let's not go that far). That is, whenever a new context is entered, the value of rdo is set somehow, and when a context is exited, the value of rdo is set to the value of the parent context.
  2. The value in rdo would be a relative offset in the root context in relation to the start of the overall rodata section (e.g., $3 \cdot 2^{30}$).
  3. Then, when the VM detects that we are trying to read from an address greater than $3 \cdot 2^{30}$, it would perform automatic address translation to the appropriate address in the root context. Specifically, assuming $addr$ is the address we are trying to read, the actual address read from the root context would be $rdo+addr$.

To summarize the above, the VM would have a small address translation unit which would work as follows:

Given a tuple (ctx, addr) where ctx is the context ID and address of the memory read operation, the actual address from which we'd read would be determined as follows:

The main overhead here comes from:

The main remaining open question is how do we actually initialize rodata segments and map them to appropriate offsets. Basically, we need a function which looks something like this:

init_rodata(rodata_hash) -> rodata_offset

As a side effect, this function would "unhash" the rodata committed to by roadata_hash in the root context starting at rodata_offset, or if this rodata has already been initialized, would just return the relevant offset.

One possible solution to this is to do this via a syscall. In this case, the kernel would be the one responsible for managing rodata offsets for all subsequent contexts. Then, at the start of every context we would do two things:

  1. Make a syscall to init_rodata procedure. This would return the value for rdo. a. Here, we'd also need to make an exception to allow root context to write into memory beyond $3 \cdot 2^{30}$.
  2. Set the rdo register to this value. We'll need a new instruction for this.

I think this will work, but it would be nice if initi_rodata was an atomic instruction provided by the VM itself. But I haven't figured out yet how to do this.

bitwalker commented 7 months ago

The address translation part seems solid, that's similar to how virtual memory works on a typical OS, so it shouldn't present any issues for the compiler.

The main remaining open question is how do we actually initialize rodata segments and map them to appropriate offsets. ... One possible solution to this is to do this via a syscall. In this case, the kernel would be the one responsible for managing rodata offsets for all subsequent contexts

So this actually segues nicely into the issue I think I have with one aspect of the design here. Specifically, I think we should avoid using the memory of the root context for this purpose. Additionally, having the kernel code be responsible for implementing the bulk of this functionality feels like it breaks some important properties that the Miden abstract machine currently has (in my opinion):

  1. Aside from some aspects of the syscall instruction (i.e. when it is valid to use), code written/compiled for Miden can treat it as a single target; whether or not the code runs in the root context doesn't matter, the semantics are the same, and enforced by the VM and it's instruction set.
  2. Code running in the root context is not given a higher level of privilege than other contexts - a kernel cannot access the memory of contexts running under it, and the boundary between the kernel and other contexts is strict and well-defined, same as between any two contexts. The root context does get some special benefits (i.e. it can export functions that can be syscalled), but those benefits do not violate the shared-nothing semantics of contexts.
  3. Anyone can plug in their own kernel and build a system on top of it, just like we are doing with the rollup. The rollup kernel isn't special in some way.

While I think all of these are important, I'm especially concerned about 2, as that seems like a significant change in semantics to me. One of the things we can say currently is that you don't need to trust the kernel - it can't read or write your memory, and the only way to interact with it is in the form of syscalls with well-defined boundaries, where you can "trust but verify". In this new model though, the kernel would not only be able to read your read-only memory, it could modify it at any time (both during initialization, and at any point afterwards). I think that's dangerous, even if there isn't a way to exploit that now, it might present such an opportunity at some point in the future to an attentive adversary.

More to the point though, I don't think it is actually essential to the proposed design that the root context be special in this way. It seems to me that the VM itself could have a special context it uses to maintain this read-only memory, and where it can execute code to do things like initialization of an rodata segment - but this would be purely an internal detail. From the perspective of code targeting Miden (including kernels), it appears as if this is all built-in VM functionality. An instruction could be added for the purpose of initializing memory (it might not even need to be an instruction, maybe it is just an implicit part of the context lifecycle), which internally delegates to executing the implementation of init_rodata in the "special" VM context responsible for memory protection.

Thoughts? One thing I'd really like to avoid is further bifurcating the types of artifacts we produce and how they can be combined, and having to treat code in the root context differently than other contexts would make the two mutually incompatible, introducing some unintended complexity I think.

bobbinth commented 6 months ago

One of the things we can say currently is that you don't need to trust the kernel

This is not entirely correct. For example, in the context of the transaction kernel for the Miden rollup, developers of account code and note scripts do need to trust the kernel with quite a few things. Management of account storage is one such thing. For example, it is essential that the kernel does not modify account storage outside of the procedures which are designated to do so.

It is true that the kernel cannot interfere with programs executed in user contexts and change the semantics of any of the syscall-ed procedures (because these are invoked by their MAST roots). But the program executed in the root context (e.g., prologue, epilogue etc. of transaction kernel) do need to be trusted as there is no way to enforce correct execution of these from user contexts.

So, in many ways, the kernel is just as trusted as the VM itself.

I don't think it is actually essential to the proposed design that the root context be special in this way. It seems to me that the VM itself could have a special context it uses to maintain this read-only memory, and where it can execute code to do things like initialization of an rodata segment - but this would be purely an internal detail.

Agreed, and as I mentioned in my comment, I would like to find a way to make initialization of rodata an atomic VM operation. But I also think that this is a much bigger change which requires additional research. For example, we may be able to add a new chiplet to the VM specifically for handling rodata, or we could treat rodata as a special context within the existing memory chiplet.

So, my thinking is to get to the end state incrementally as follows:

bitwalker commented 6 months ago

So, in many ways, the kernel is just as trusted as the VM itself.

I think it might be more fair to state that certain, limited aspects, of the kernel are just as trusted, but the current state of things is that those parts are narrowly scoped and thus easier to review/control, i.e. you can focus auditing efforts on just those couple areas where trust of the kernel is required, and how you interact with them.

However, it the kernel controls all of your read-only memory, it can subtly change things in ways that are non-obvious from the userspace code. You have to assume that any read from read-only memory is a potential attack vector (if you are being overly cautious/paranoid).

In any case, I don't think we have to necessarily deal with this issue right now, just that whatever work we do in this direction is almost certainly a dead-end (i.e. at some point, probably sooner rather than later, we will have to solve this in a way that requires less trust).

..we could treat rodata as a special context within the existing memory chiplet.

I think this is more or less what I was getting at with regards to the root context not being special. Instead of putting control over that memory in the hands of the kernel, we put it in the hands of the VM, and essentially require a proof that the read-only memory initialized by the VM matches that expected by the program. I have some ideas on how I think this could be represented in terms of the execution trace, but not sure how difficult it will be to add the VM instructions necessary to make it usable from MIden Assembly.

So, my thinking is to get to the end state incrementally as follows:

I think your approach here makes sense, and is probably how we'll have to do it. That said, if Step 1 is a significant change, I can't help but wonder how much of a difference in effort we're talking about compared to jumping straight to implementing native support in the VM for this.

Either way, the compiler is almost at the point where we require some solution to proceed, so on the one hand that means I can dedicate time and effort to helping implement either solution to keep things moving along, if this is largely a matter of engineering resources. On the other hand, in order to get an initial MVP in the hands of our users ASAP, whatever hacky kludge we come up with at least gives them something, even if it later requires some pain to transition to a better one.

bitwalker commented 4 weeks ago

This is completed and working for the time being. There is more work to be done, primarily around call, but the fundamental framework is implemented and being used when compiling executable Miden programs to initialize memory as expected.