Develop a way to create and a package format for smart contracts

0xPolygonMiden / miden-base

Core components of the Polygon Miden rollup

MIT License

68 stars 42 forks source link

Develop a way to create and a package format for smart contracts #550

Open Dominik1999 opened 6 months ago

Dominik1999 commented 6 months ago

Feature description

Background

Currently, the Miden client only allows users to create accounts with standardized code (basic wallet and faucets) and storage. We want users to be able to create accounts with custom code and storage and deploy them on Miden using the Miden client. This is the equivalent of deploying Ethereum smart contracts, and it will widen the space for use cases as users can define their account interfaces.

We need to devise a way for users to write their own smart contracts and initialize their custom storage, which results in a .mac file that can be injected into the Miden client. See https://github.com/0xPolygonMiden/miden-client/issues/204.

We could come up with a simple Rust tool at the beginning.

We can mostly follow how we create accounts in here. To create an account, we need the following:

pub struct Account {
    id: AccountId,
    vault: AssetVault,
    storage: AccountStorage,
    code: AccountCode,
    nonce: Felt,
}

Account id

The account ID should be ground at account creation. We should notify the user that this might take some time.

Account vault and nonce

It should be clear that the vault is empty and the nonce is 0. Users can always populate the account's vault in the first transaction.

Account storage

That might be the most complex part. To initialize the AccountStorage, we need to know the type (simple Value, Map, Array) and the content of each slot. We have 255 slots. In technical terms, we need to construct a Vec<SlotItem> whereas a SlotItem consists of (u8, (StorageSlotType, Word)).

One problem is that certain slots are reserved. The basic authentication smart contract assumes that the private key is always at slot 0. So, our tool or package needs to warn the user if he occupies the slot with something else. We should also clearly document this.

Another problem is that we must ensure that the user provides the necessary data if the StorageSlotType is Map or Array.

We should also think of conflict resolution. The tool can notify a user if he wants to occupy the same slot twice.

Account code

After the user has written his MASM or Rust account code, we need to parse the code, pick the right assembler and create the AccountCode. I think this should not be an issue here.

Why is this feature needed?

We want generalized smart contracts. And we want users to write them.

bobbinth commented 6 months ago

A couple of preliminary comments:

Account code

We are moving toward using MAST as the format for serializing Miden VM programs. So, the code included in the package will probably be serialized MAST. This means two things:

The compiler will need to output the account package and MAST code would be a part of this. This is already planned.
We'll need a tool that can do the same but for hand-written MASM code.

Account storage

This is indeed the trickiest part. I actually think what we are looking for is a way to define an "account template" which would consist of MASM/MAST code + storage layout that is required to support this code. This account template can then be instantiated into a specific account (at this time, account ID would also be generated), and then this account can be serialized as .mac file.

The code part should probably in the package format discussed in https://github.com/0xPolygonMiden/compiler/pull/129 (cc @bitwalker, @greenhat).

The storage part could be some metadata file describing which storage slots should be initialized and how. Just to illustrate and using JSON format, for basic wallet the storage template could look like so:

{
  "slots": [
    {
      "index": 0,
      "name": "pub_key",
      "value:" {
        "arity": 0,
        "value": "{{auth::rpo_falcon512::pub_key}}"
      }
    }
  ]
}

This basically specifies that when an account with this template is instantiated, we'll need to get a RpoFalcon512 public key from the user and put it into the first slot of account storage (the expected type of the storage slot would be a simple value with arity 0 - i.e., just a single word).

Of course, we'll actually need to define all the details of this format (what I have above is just a very basic illustration) - but I don't think that's going to be too difficult.

A more difficult question to resolve is what to do when a user combines different components into a single contract (e.g., basic wallet + something else). The main issue is potential conflicts in how different components try to access storage. The simplest solution is to just panic on conflicts. A more sophisticated one would be to try to resolve conflicts by dynamically modifying the code - but this probably best left to the compiler.

bitwalker commented 6 months ago

A more difficult question to resolve is what to do when a user combines different components into a single contract (e.g., basic wallet + something else).

I think to some degree this will tie back to the fundamentals of packaging and distribution of libraries/programs. A package that contains a library which implements an account/contract built out of other, more foundational, libraries/components, is going to be doing so either by referencing code from other packages, or by generating code that integrates behavior provided by another package (e.g. basically along the lines of the difference between calling an extern function in Rust, and calling a generic function in Rust, where the generic function is monomorphized in the caller's crate). I would imagine this would have to extend to data (including config) as well.

I would recommend that we define the details of storage layout abstractly, but deterministically, much like how Rust defines how it lays out types in memory. So if you know the offset at which your storage slots start, you can compute the offset of specific storage keys by adding a zero-based index to that offset. This is akin to how one can compute the memory address of a specific struct field in a complex nested data structure, all you need to know is the offset of the data structure, and from there the rules for deriving the address of a specific field are entirely deterministic.

This would make it possible to combine together storage requirements from multiple "parent" contracts/components in a new contract, as long as they can all fit in 255 slots - each need only know (or be provided) the offset at which their storage begins. Because the derived contract knows all of the components that it is mixing together, it can compute the offset for each of those components statically, so accessing specific slots doesn't necessarily even require any runtime calculations.

Admittedly, I have no idea where we're at with regard to account storage in general (i.e. how it is specified, how it gets initialized, etc.), so maybe some of those details have already been pinned down. It'll certainly be something @greenhat will need to get deeper into pretty soon (right now we've largely been focused on surfacing the low-level primitives, but coming up with a natural idiomatic Rust way of expressing storage is something we haven't really done yet).

We probably should make accommodations in the package format for surfacing additional arbitrary custom package metadata (e.g. Rollup-specific details like whether a package provides an account or note script, what its storage requirements are, etc.). Tooling specific to the Rollup will want to be able to easily access that information, but I don't think we want to bake in Rollup-specific details in the package format, nor do I think we would want to have multiple package formats. Cargo is a good example of this, custom tools can describe metadata in Cargo.toml that is not used by Cargo itself, but this metadata is still accessible to tools querying info about a specific Cargo package via Cargo, or via crates.io, or by parsing Cargo.toml manually. I'd like to use a similar model for our packaging.

With that in mind, we'd then only really need some lightweight tooling, built on top of the more general packaging infra, that can validate how specific packages are used together (i.e. ensuring the package types are compatible, validating that there are no storage conflicts, etc.).

In any case, definitely interested to understand more of the gritty detail of account storage and how some of the interactions will work - definitely a weak spot in my knowledge of the overall system right now.

bobbinth commented 6 months ago

We probably should make accommodations in the package format for surfacing additional arbitrary custom package metadata (e.g. Rollup-specific details like whether a package provides an account or note script, what its storage requirements are, etc.).

Yes - totally agreed: we should have a single package format for the VM programs - but allow users to add custom metadata to these packages (i.e., for rollup purposes).

I would recommend that we define the details of storage layout abstractly, but deterministically, much like how Rust defines how it lays out types in memory. So if you know the offset at which your storage slots start, you can compute the offset of specific storage keys by adding a zero-based index to that offset.

This could work, but has one potential downside: if the offsets are part of the code, then MAST roots of public procedures would change depending on what other components are added to the same contract. In some cases this may be fine, but in other cases, it may break things (e.g., notes would not have a stable MAST root to call on an account interface).

It would be ideal if we could somehow provide these offsets at runtime in a way that doesn't change MAST roots - but I'm not sure it is possible.

There are other solutions to this (e.g., using only storage maps and salting the keys) - but these come with their own set of pros and cons.

bitwalker commented 6 months ago

This could work, but has one potential downside: if the offsets are part of the code, then MAST roots of public procedures would change depending on what other components are added to the same contract.

I'm not sure I follow how this happens - wouldn't the slot indices be fixed when publishing the contract anyway? In other words, if you are "mixing in" elements from other contracts acting as templates, you end up with code that is either unique to the derived contract (and so published with it), or code which is generic across all derivations (and so external to the derived contract and published with the base contract). In the latter case, presumably the base contract would expect to be provided with the offset at which its storage slots begin, at runtime, so that contracts which derive its functionality, but that have storage offset at potentially different slot indices, can all share the same code. In the former case, the slot indices are computed/known statically during compilation, and the code is published with those indices baked in (or, possibly also compiled to be relative to some runtime-provided base offset, if it is intended to itself be derivable).

I'm not sure if what I'm saying is applicable here or not, since I only have a rough understanding of how account storage is supposed to work, but since contracts are going to have to be compiled to MAST, whatever mechanism we come up with here, has to be something that is either known at the time the MAST is compiled (and so baked in), or is provided at runtime. Either way, the code never changes (and can't change anyway), regardless of how the contract is used downstream.

It kind of sounds like the mechanism would need to be based on runtime-provided offsets, but I'm definitely way out of my depth at this point, so I hesitate to opine too much on what makes sense here. The problem statement stood out to me as very similar to how the concept of ABIs, type layouts in memory, and how linkers allocate segments of memory to different parts of a program that are potentially linked in dynamically at runtime; so I do think there is prior art on how to approach that kind of thing when interop is in play - but it might be that there are other elements here that make it a fundamentally unique problem.

bobbinth commented 6 months ago

I'm not sure I follow how this happens - wouldn't the slot indices be fixed when publishing the contract anyway?

This is correct - but let me explain what I mean on a concrete example.

Let's use our BasicWallet as an example. This is not the best example because procedures in this module do not touch the storage at all - but for the purposes of the example, let's pretend that the do.

So, we have two procedures defined as a part of BasicWallet: receive_asset and send_asset. There are standardized notes which assume that these procedures have stable MAST roots (P2ID note is a good example). That is, when we create a P2ID note, we include the MAST root of receive_asset into the script of P2ID note. So, if the MAST root of receive_asset were to change somehow, we would not be able to execute standard P2ID notes against it.

Now, let's say a user wants to create an account which combines functionality of BasicWallet with some other components. Let's assume that these components need to access storage as well. When we compile them together, may need to change storage offsets for receive_asset procedure. While this results in an internally consistent account code (i.e., there are no conflicts between different components and the provided functionality is the same as we'd expect), this could change the MAST root of receive_assets. This would render all P2ID notes sent to this account un-spendable. Or, alternatively, anyone creating P2ID notes would now need to know the specifics of this account to send notes to it (this will probably be unsustainable at scale).

So, basically, in the ideal scenario we want to achieve 2 goals:

Be able to mix and match any number of components to create a single account without creating conflicts between them.
Keep MAST root of exported procedures the same across different accounts - even if different accounts may use other combinations of components.

There are many ways to achieve the above - I think the tricky part is finding a way which minimizes performance overhead. Here is one approach:

By convention, we designate storage slot 0 to store offsets for all components deployed in the account. The type of this slot would be storage map (basically a key-value map with 256-bit keys and 256-bit values). Each component would be identified by some unique 256-bit ID and so, when a component is added to the account, we'd insert a key-value pair for this component into storage slot zero which would look like (component_id, storage_offset).

Then, when a component needs to access storage, it would look up its offset in this map, and then use it accordingly. For example, one of the first steps of receive_asset could be to get the storage offset and save it into memory. From that point on, any time it needs to access storage, it would look up the offset in memory and add it to the storage slot it wants to access.

This does create quite a bit of overhead though. Specifically:

There is an overhead in writing this code correctly. Though, if this is handled by the compiler - it's probably not too bad.
There is an overhead of reading the offset from the storage map and then applying for every storage access. Maybe this is also tolerable.

Also, there is a question of how to we guarantee that the 256-bit component IDs are indeed unique (or whether we can even guarantee this at all). So, I do wonder if there is a better alternative.

bobbinth commented 6 months ago

Another approach, which is simpler in many ways but also less flexible in others:

A component which doesn't want to have any potential conflicts with other components would always use a storage map in some designated slot (let's say storage slot 0 again). This slot would always type storage map - and so, could hold practically unlimited amount of data. To avoid key collisions before inserting values into the storage map, we'd compute the new key a:

new_key = hash(key || component_id)

As long as component ID can be easily identified, this key mapping can be handled by the kernel.

bitwalker commented 6 months ago

Now, let's say a user wants to create an account which combines functionality of BasicWallet with some other components. Let's assume that these components need to access storage as well. When we compile them together, may need to change storage offsets for receive_asset procedure.

I think I see where we're diverging here. My expectation would be that we would not be compiling the code of BasicWallet again, i.e. the MAST root of receive_assets remains the same (it is an external dependency). Or more precisely, compiling the components "together" here actually means compiling the source code for the derived contract, while linking against the procedures provided by the contracts it derives functionality from (e.g. BasicWallet::receive_asset). To be clear, what you are stating absolutely would have been the case where MASM is the artifact we ship, but I'm assuming two things here:

We're shipping BasicWallet (all contracts really) as a package containing MAST
BasicWallet is a contract that was published (presumably on-chain), and is designed to provide core low-level functionality used by higher-level contracts that in turn provide additional, richer functionality.

With those assumptions in mind, the only actual requirement imposed on BasicWallet here, is that it be written in such a way that its storage accesses are expressed in terms of relative offsets, not absolute offsets, and that the "base" offset that the others are relative to, is provided at runtime somehow. So any new contract that derives from BasicWallet, only need provide the offset at which the BasicWallet storage begins, when it calls the BasicWallet APIs (assuming there is no way to provide this configuration in some more global fashion).

If instead, BasicWallet is a library that you sort of vendor into your own project as source code (MASM in this case, but in theory it could be some other language), then yeah, each time it gets compiled, the MAST root could change (though that is true even if nothing else about the source code has changed at all, i.e. a new version of the assembler or the compiler could result in changes to the way the source code is lowered to MAST). That said, it seems to me that in this model, you literally could not bake in any assumptions about the MAST root of any of the procedures even if you wanted to (since you can't feasibly know what they are before the code is compiled). And so things like P2ID wouldn't be portable across different derivations of BasicWallet. For that reason, I'm assuming that my first interpretation is more what we're talking about here, but wanted to at least draw the comparison.

That said, assuming I'm still missing the key thing here that breaks my mental model as laid out above, what you've described as a possible solution seems sound to me. We can likely lean on the compiler to generate some of the more repetitive/annoying code as you mentioned, I don't see any reason why that wouldn't be possible as long as some pretty minimal requirements are met (i.e. that we know the storage layout somehow, that we know all of the call sites where storage is being accessed, and that at each call site we know what "component" it belongs to), and I'm pretty sure that will be the case in general.

bitwalker commented 6 months ago

It does occur to me that one of the things I'm obviously forgetting to account for here are the boundaries of the system with regard to storage. When we talk about BasicWallet as a distinct contract, that kind of implicitly sounds like its APIs are going to be called, but my assumption is that storage doesn't get shared across call boundaries - and so I'm realizing that I'm definitely missing a piece of the puzzle here.

Do we have an up-to-date description/sketch of how we have been expecting storage to work up until now, and how it relates to Miden execution contexts (to the degree there is a relationship there)? I think that would probably sort me out.

bobbinth commented 6 months ago

Do we have an up-to-date description/sketch of how we have been expecting storage to work up until now, and how it relates to Miden execution contexts (to the degree there is a relationship there)?

Unfortunately, there isn't a single comprehensive description of this yet (@Dominik1999 let's create an issue to add this to the docs). The basic design of the storage is described in https://github.com/0xPolygonMiden/miden-base/issues/239#issuecomment-1790102962 and following comments. To summarize:

Account storage consists of 255 slots each of each can have one of 3 types: (1) simple value (a word), (2) an addressable array of word, or (3) a map where both keys and values are words.
- The types of each slot (collectively called "storage layout") are set at account creation and cannot be changed afterwards. There is an issue for relaxing this restriction in the future.
We have kernel procedures which allow account code to read/write from storage. Currently, we have implemented only the procedures for reading/writing simple values (see here and here). Also, in #521 we have implemented support for storage maps, but this hasn't been merged yet.
- These procedures can be invoked only from one of account's procedures (this is enforced by the kernel). So, for example, a note's script cannot directly modify an account's storage.

With those assumptions in mind, the only actual requirement imposed on BasicWallet here, is that it be written in such a way that its storage accesses are expressed in terms of relative offsets, not absolute offsets, and that the "base" offset that the others are relative to, is provided at runtime somehow.

Yes, agreed. I think the tricky part here is how exactly to provide these at runtime in a way which introduces minimal overhead both in terms of performance and complexity.

There is another thing we may want to consider: not all account procedures added to the account have the same requirements. Specifically, I think we have 4 cases:

A procedure does not touch account storage. This is for example the case for send_asset and receive_asset procedures of BasicWallet. This is the simplest case to consider because there is nothing extra we need to do about them (though, it may be beneficial to mark such procedures somehow - e.g., via attributes).
A procedure does need to touch account storage but the MAST of such a procedure does not need to be stable because it is expected to be invoked only via transaction scripts (rather than note scripts). For example, this is the case the basic authentication procedure. This procedure does access storage, but in theory, we can pick any storage slot for it and it wouldn't really matter for anything else. Offsets for such procedures could be defined statically (though, how to do it in the context of the MAST-based package format, is not clear).
A procedure needs to touch account storage and MAST for such procedure must be stable. This is the case for providing offsets at runtime.
A procedure needs to touch the storage and it needs to do it statically (i.e., it always must touch the same storage slot) while maintaining a stable MAST root. In this case, we the only thing we can do is detect if two procedures need to use the same storage slot, and raise an error if the user is trying to add both procedures their account.

So far, we've been focusing on use case 3 (and use case 1 is already handled), but I think we should figure out if we want to (and are able to) support use cases 2 and 4.

Dominik1999 commented 6 months ago

Interesting discussion, indeed.

I will create an issue to explain how account storage works in the docs. I wanted to wait until we merge StorageMaps.

However, aren't we overthinking this here a bit?

So, the problem, as I understand it, lies in the combination of different contracts or procedures that have conflicting storage assumptions. And please correct me, if I am oversimplifying it.

A more difficult question to resolve is what to do when a user combines different components into a single contract (e.g., basic wallet + something else). The main issue is potential conflicts in how different components try to access storage. The simplest solution is to just panic on conflicts. A more sophisticated one would be to try to resolve conflicts by dynamically modifying the code - but this probably best left to the compiler.

So, if we can identify conflicts at account creation/compilation, then we should panic. Let's do more sophisticated architecture later if needed. That shouldn't be too difficult to implement. We could use decorators for storage assumptions and analyze them. Or, we could add metadata to our contract packages or procedures (we only have 3 at the moment) and get the info like this (I think this is what @bitwalker suggested in his first answer).
About adjusting dynamically, as I said, I would do that later. At the moment we have 3 contracts and only 3 storage assumptions (0 - authentication, 1 - faucet assumes token metadata, 255 - storage layout). But in the future, we could say, that all non miden-lib contracts, so user-defined contracts, get only one slot per contract and they must use StorageMaps. So, there is a hardcap for contract mixing / combination of 253 contracts. We can then sort all importer contracts according to their MAST root and map StorageSlots dynamically.

bobbinth commented 5 months ago

Revisiting this again, I think the solution is actually much simpler than what I originally thought.

First, for every storage access we'll implicitly apply an offset. So, for example, when the user executes account::get_item or account::set_item procedures, the kernel will add an offset to the specified slot index. The offset will come from the proc_root |-> offset map that the kernel maintains in its memory. Here, proc_root is the MAST root of the procedure exposed via the account's public interface and it can be looked up via caller instruction at the time when get_item or set_item instructions are called.

The process of performing the offset lookup would be conceptually similar to what is happening in the current authenticate_procedure procedure. In fact, we may want to create a separate authenticate_storage_access procedure which will look something like this:

#! Verifies that the procedure root is part of the account code Merkle tree and applies the offset
#! associated with this procedure to the provided slot index. Panics if the procedure root is not
#! part of the account code Merkle tree.
#!
#! Stack: [PROC_ROOT, slot_index]
#! Output: [PROC_ROOT, offset_slot_index]
#!
#! - PROC_ROOT is the hash of the procedure to authenticate.
export.authenticate_storage_access

One of the implications of this is that both storage reads and storage writes would need to be authenticated (currently, we authenticate writes but not reads) - but I think that's a good idea anyway.

To support the above, we'll need to modify how account code tree is built. Specifically, right now this is a Merkle tree where each leaf is a root of a procedure in the public interface of the account. We'll need to change this to (proc_root, storage_offset) tuples instead.

Separately, it may be a good idea to move away from using a Merkle tree here in favor of a sequential hash which would be "unhashed" into the kernel memory during the transaction prologue. This will substantially reduce the cost of accessing storage and authenticating procedures in general.

The overall mechanism would be that at account instantiation time (or at upgrade time) we'll set storage offsets for all procedures in the account's public interface. These offsets could be computed by the compiler as it combines different components into a single contract, or could be done manually for simpler contracts.

Overall, I think with this we achieve all desired properties:

Each component can be written independently without the need to consider how it will be combined with other components inside a given account.
MAST roots of the components remain stable (i.e., they are not affected by the presence of other components in the account).
The introduced overhead is minimal (probably leas than 10 extra VM cycles per storage access).

One other implication is that we should make the public account procedures accessible only via the call instruction. For this we'll need to have support for attributes implemented in the assembler - but we are going there anyway.

One the "account template" side (mentioned in https://github.com/0xPolygonMiden/miden-base/issues/550#issuecomment-2023854130), the additional storage metadata associated with account code may need to look something like this:

{
  "slots": [
    {
      "index": 0,
      "name": "pub_key",
      "value:" {
        "arity": 0,
        "value": "{{auth::rpo_falcon512::pub_key}}"
      }
    }
  ],
  "offsets": {
    "0x12345...": 0,
    "0x23456...": 0,
    "0x34567...": 5,
  }
}

Where procedures with MAST roots 0x12345... and 0x23456... have the same storage offset 0, while procedure with MAST root 0x34567 has storage offset 5.

bobbinth commented 3 weeks ago

With implementation of #667, we now have the ability to associate storage offsets with every account procedure. This will be enhanced a little with #866 - but overall, this aspect is near its final form. We still need to have a way to specify this info via MASM, but this will be addressed with procedure annotations in Miden Assembly (https://github.com/0xPolygonMiden/miden-vm/issues/1434).

Taking a step back, for account package format, I think we need to have the following pieces:

A way to define account code: this is covered by the current MastForest struct which we derive from MASM.
A way to attach storage attributes to account procedures: this will be covered by procedure annotations, which I assume will also be added to the MastForest struct.
A way to define account ABI for higher-level languages (primarily to be used by the compiler): this could be done either by adding type signatures to procedures as discussed in 0xPolygonMiden/miden-vm#1436 (comment) or by providing WIT metadata for accounts, or both.
A way to specify how account storage is to be initialized whenever a new account is instantiated.

In my mind, for all of these, except for the last item, we have a pretty good idea of what needs to be done. So, it probably is a good idea to start brainstorming what we want to do with the last item as well.

Perhaps, a good way to start is to take a concrete example. Let's say we want to instantiate a fungible faucet account. For this, the storage could look roughly like this:

- slot 0 (value): reserved for internal faucet usage, cannot be set by the user.
- slot 1 (value): authentication data - e.g., public key for RpoFalcon512 signature scheme.
- slot 2 (value): faucet metadata which includes `max_supply`, `symbol`, and `decimals`.

We could described the above with something like this:

[
    {
        "type": "value",
        "value": ["0", "0", "0", "0"]
    },
    {
        "type": "value",
        "name": "pub_key",
        "value": "{{auth::rpo_falcon512::pub_key}}"
    },
    {
        "type": "value",
        "name": "metadata",
        "value": [
            "{{decimals}}",
            "{{symbol}}",
            "{{max_supply}}",
            "0"
        ]
    }
]

The main idea is that values in {{}} would need to be provided by the user - but there are some open questions about these:

If we are going to present these to the user for input, we should probably add much more information for each such element. For example, we need to define what are the valid values (e.g., max_supply must be a value between $0$ and $2^{63}$), provide some basic description of what the field would be used for etc. We probably need to define something like a user_input element which can describe these attributes in more detail.
What should we do about {{auth::rpo_falcon512::pub_key}}. There are a couple of things to consider here: a. Should we expect the user to enter their key directly? If so, we should probably define acceptable format for the encoding these. b. Do we want to make this extensible beyond a very narrow use case of authentication?

I think the way we want to go will probably depend on the answer to this last question. For example, one approach could be that with every account we need to provide an "initialization function" (maybe represented by a WASM program). This function would take some set of user inputs and output the AccountStorage struct (or something like the above JSON but with all {{}} replaced with actual values).