OP Stack Integration - Githubissues

colin-axner commented 11 months ago

Huge thank you to @hamdiallam for walking me through the OP Stack, @AdityaSripal for providing feedback and documenting his thoughts on integration and to @PolymerDAO for discussing their POC with us and providing a path forward to integration!

Context

I will do my best to outline my understanding of how the OP Stack functions. My knowledge is still limited, so please correct me if I have any wrong assumptions.

Flow

At a very high level, an optimistic chain will use its op-node process to generate a blockchain. The batcher make an rpc request for transactions created by this unsafe blockchain and will send these batched L2 transactions to the L1 as an ordered list of transactions for L2 block execution. The execution engine (typically op-geth) has its own mempool which can accept user transactions for inclusion in a block.

The op-node will listen for transaction submissions by the batcher in the BatchInbox on the L1. It will derive these compressed transactions into a list of transactions which are then executed by the execution layer (commonly op-geth).

The execution layer will execute these transactions and when requested by the op-node, it will commit a block.

In a separate process, the op-proposer will query an L2 rpc (the sequencer rpc) for finalized blocks which are available to be submitted back to the L1 as a block hash.

The op-batcher, op-proposer, op-node and execution engine are all separate services.

Terms:

unsafe head: this refers to the head of the rollup chain which contains blocks which cannot be derived from the L1. safe head: this refers to the head of the rollup chain which can be derived from the L1 but whose blocks on the L1 have not been finalized. finalized head: this refers to the head of the rollup chain which only contains blocks that can be derived from finalized blocks of the L1.

op-batcher

The op-batcher batches and compresses transactions which it submits to the L1 via a regular eth send to a prespecified address (0 eth is sent with the compressed transaction as input data). It will pull transactions from the op-node's unsafe head.

op-proposer

The op-proposer will periodically query a RollupClient for blocks at a given height. The block returned, OutputResponse, is a type defined by optimism which allows for easy construction of the L2 Output Commitment, that is the output hash submitted to the L1.

The op-proposer can be configured to submit safe or finalized blocks.

op-node

The op-node contains code which will parse batches posted by the batcher on the l1 and it will invoke the execution engine to produce safe blocks.

derivation

Transactions may be submitted to an L2 via the batcher or directly to the L1 in the case of censorship. The derivation code will derive all transactions to be executed from the L1 block by filtering for deposit transactions and batch submissions by the batcher address.

Here is where deposits are derived and here is an example of a system transactions being derived and created based on the L1 block information.

The PreparePayloadAttributes will return a modified payload attribute type with extensions required for op-geth. Specifically, it will instruct geth to treat this payload as an "empty" block, but it includes a list of transactions which will be forced into the block. This is the modification required to swap out the geth mempool for the batcher.

invoking engine (execution layer)

The op-node will drive the execution engine by feeding it transactions which then produces safe (and eventually finalized blocks) which the proposer can query and submit.

When building a block, the op-node will call StartPayload (start block) which will instruct op-geth to create a block with the provided list of transactions. Note that this required a modification to geth. As noted earlier, geth is creating an "empty" block as far as it knows, but op-geth has modified it to allow for a pre-determined list of transactions to be included in the building process.

The transactions derived for an L2 block are provided to the engine via the engine.ForkchoiceUpdate API. This will return the PayloadID for the block being built.

Once the block has been built, the op-node will call CompleteBuildingBlock. This will incur two API calls to the engine: engine.GetPayload, followed sequentially by: engine.NewPayload.

From my understanding, newPayload inserts a block into the eth chain without updating the safe head.

Thus after calling NewPayload, the op-node will update the reference to chain's head and ask the engine to finalize the update via another call to engine.ForkchoiceUpdate.

Integration approaches

Based on my investigation, it appears there are three potential approaches to integrating ibc-go:

Embedding ibc-go into the existing execution engine (assumed to be op-geth)
Creating a side by side interaction with the existing engine
Executing ibc-go before calls to the execution engine and feeding system transaction inputs into the execution engine

Requirements

Based on the IBC specs, all provable state must be included in the output hash posted by the proposer.
All ibc-go execution must support fraud proofs (proofs of ibc-go execution).

At the time of this writing, I have not had time to look into how fraud proofs with ibc-go. Based on discussions with external parties, the current working understanding is that it should be possible to reuse the work Optimism has done to cross compile op-geth into MIPS which can then be used with the Canon contract to prove fraud proofs.

For now, it will be assumed that while potentially challenging, it should be possible to reuse the existing work.

Approach 1: Embedding ibc-go into the execution engine

In order to embed ibc-go into op-geth, two approaches could be taken:

Developing IBC in solidity.
Modifying op-geth to hook into ibc-go if the transaction is an IBC transaction.

Option (1) allows for trivial support of the existing work to reuse fraud proofs.

Option (2) may allow for easy reuse of fraud proofs, but may require modifications to ibc-go to make it cross compilation compatible (unknown assumption).

It is currently unclear to me the complexity of modifying op-geth to support hooks into ibc-go. I suspect this work to be non-trivial as we would need to ensure the provable state used by ibc-go is included into the output hash. In addition, it is likely desirable to build in bi-directional communication between smart contracts and ibc-go.

Modifying op-geth to communicate with ibc-go is very similar to the following approaches. The primary trade off it makes is being tightly coupled with op-geth. Based on my current understanding, I believe this may be a technically difficult and likely a poor abstraction, but I can understand it to be an adoption based benefit.

Approach 2: Side by side execution

The following quotes are from a write up by @AdityaSripal on the side by side execution approach:

The sidecar approach intends to delegate IBC logic to a “sidecar” process that can easily run ibc-go while the target state machine can be as unaffected as possible by core IBC semantics. This will allow target statemachines (e.g. EVM, rollups, external ecosystems) to quickly integrate IBC because they will not be required to directly implement into their own statemachine logic. Instead they will require a trusted process to do all of the IBC logic for them and rely on calls to and from this trusted process in order to interact with the rest of the IBC ecosystem.

The mental model here is to mount the IBC applications directly onto the statemachine, while keeping core IBC logic separated into a sidecar. This allows the developer ecosystem in the target statemachine to directly innovate on IBC while keeping IBC core logic implementation the responsibility of the ibc-go team.

Let us first think about a situation where the sidecar is an externally operated ledger run by a different validator set. Here the IBC protocol runs on a separate chain H on behalf of a chain A. Chain A might send an application packet by emitting events for the packet data signalling a desire to send a packet to chain B.

Chain H must have a light client of chain A that can verify events emitted. Chain H will then run IBC core logic on behalf of chain A (packet commitment). Chain B can now verify the packet using a light client of chain H.

When an acknowledgement returns from chain B, this gets processed by core IBC logic on chain H. Chain H must then send the acknowledgement to the IBC application interface on chain A. Chain A will authenticate using a light client of chain H.

If there is no way for chain A to directly verify the consensus then it must rely on a trusted relayer as a substitute for direct verification.

This approach is the same as outlined in @notbdu's writeup on Virtual IBC and IBC outposts. As Bo notes:

IBC outposts can be either local or remote to the chain itself. While local IBC outposts require an integration directly into a chain client, a remote IBC outpost can enable permissionless integrations of IBC without needing to modify the chain client.

2a: Polymer (delegated chain)

The polymer approach enables rollups on Optimism to communicate using IBC by using vIBC on the Polymer rollup. Authentication of messages are backed by the underlying eth client on each rollup which serves as a unified settlement layer.

If chains are unwilling to adopt an execution engine which places ibc-go and op-geth side by side, then they may opt to delegate this responsibility to Polymer. They also may opt to delegate to Polymer if the side by side execution approach is not available yet.

2b: Single rollup

The question is more complicated when we talk about the single rollup case. Here we would have to mount two execution engines onto the same rollup as well as have a way for these execution engines to send messages back and forth.

As noted above, in order for the target state machine and ibc-go to communicate, we will need to verify event emission between the two state machines to pass authenticated messages.

There are two approaches to this:

Proving that an event is emitted (async)
Generating relay messages when the event is emitted (sync)

In practice, chains which delegate IBC responsibilities will need to utilize async message passing while chains which run IBC in-process can synchronously validate relay messages before committing a block.

An important aspect to note is that most execution engines (geth, ABCI) will execute the full list of transactions once rather than continually accepting transactions to build a block. We need to take this into consideration because we likely need to atomically submit an ordered list of transactions (submitted by the batcher) for each execution engine. We cannot easily execute a single transaction at a time and relay the result to the other execution engine.

Let's look at some other aspects that will come into play for the single rollup approach.

Optimistic bi-directional communication

It is likely desirable to enable IBC applications to be written natively in the target state machine (op-geth). In this case, we need to ensure that smart contracts can send authenticated messages to ibc-go and ibc-go can send authenticated messages to smart contracts.

At a high level, here is an example where we would like bi-direction communication:

ibc-go receives a MsgRecvPacket relay message.
ibc-go emits an event to indicate a cross domain message should be sent to the related smart contract for the OnRecvPacket callback.
an op-geth transaction initiates a cross domain message relay to a smart contract for the OnRecvPacket callback.
the smart contract executes the callback producing an emitted event
ibc-go is fed the OnRecvPacket result via the emitted event by the smart contract

In this flow, we can note a limitation. Each state machine executes all its transactions for a block once, but step (5) requires ibc-go to first execute steps 1 and 2 and op-geth to execute step 4. This would require 3 rollup blocks to be produced to perform the receive packet step (normally executes entirely within a single transaction within ibc-go)

A potential solution to this multi-block delay is to optimistically accept transactions, such as step (5), whose validity will be proved in the final step of committing the block. In situations where the batcher is acting honestly, the IBC flow will have low latency transactions. In the situation where the batcher acts maliciously, the block is marked as invalid, the same result as the batcher withholding batch submissions.

Thank you Aditya for this ingenious idea.

Post Handler Verification

Once both state machines have executed their blocks, op-geth and ibc-go, a post verification handler can be run on the ordered list of transactions for each respective block. If the list contains cross-domain relay messages, a check should be performed to ensure that the prerequisite event was emitted on the necessary state machine. This post verification only needs to be performed for the (2) synchronous relay message approach.

Using the example in the optimistic bi-directional communication section, the following checks would need to be performed before the op-geth and ibc-go blocks can be finalized:

Cross-domain relay message (step 2) requires ibc-go emitted an event for OnRecvPacket callback.
Ibc-go OnRecvPacket response (step 5) requires smart contract emitted an event for the result of the OnRecvPacket callback.

If any of the post verification steps fails, the blocks produced must be aborted and an empty invalid block must be produced in its place.

Note: This behavior must be verified. I am unsure what op-stack currently does if the batcher submits an invalid batch. My assumption is that the first block produced in an epoch should only contain deposits and system transactions, that is non-batcher submitted transactions, ensuring that batcher misbehavior does not affect critical components of the rollup guarantees.

Note: events emitted in SDK/CometBFT chains are currently non-deterministic. We would need to ensure we are emitting provable events (ie consistent across all nodes/patch releases)

Cross Domain Messenger

Communication with smart contracts may occur by utilizing the cross-chain domain messenger. All sendMessage transactions which come from ibc-go must be validated by proving the corresponding event was emitted. Once the message is "sent", a user can submit a relayMessage which will pass the message to the appropriate contract.

Approach 3: Pre Engine Invocation Execution

Another approach is to execute ibc-go logic before the execution engine is invoked.

To achieve this, derivation code would be added to op-node/rollup/derive which parses an ibc-go transaction from the batcher submitted batch. The ibc-go transaction would be passed to ibc-go via a RPC call. The response to this RPC call would be translated into an op-geth transaction which would use a predeploy IBC contract to write state into a provableStore.

This is very similar to the approaches listed above, except that it makes modifications into the derivation code. The primary benefit of this approach is that it mainly requires modifications to the derivation layer. One downside is that you would be performing ibc-go fraud proofs in the derivation code. In addition to this, logically it is not natural to perform executions at the derivation layer whose primary purpose is uncompressing transactions submitted by the batcher and reading transactions submitted via DA (data availability layer). Another downside is that it requires a special ibc-go transaction for each execution engine (though for the moment op-geth is the primary execution engine).

Thank you Hamdi for noting this clever approach.

Considerations

We should have high confidence in the ability to restructure our approach to OP Stack Integration. It is better to have an insufficient solution that can be modified over time, then to decide on an unmovable solution. We should be able to rework, what components do what without fracturing ics20 liquidity. We should also be able to reuse existing work when updating to newer solutions.

As noted by Aditya:

There should be a natural upgrade path for migrating to core IBC being hosted on polymers rollup, to having it hosted on your own.

This will be partially achievable via channel upgradability which will allow changing of connections associated with a channel.

Conclusion

Given the above information, I believe the (2) approach of creating an environment where multiple execution engines can be supported within a single execution layer (and represented under a single output hash) creates for a sound technical abstraction that can likely be extended to other frameworks as well. This also happens to align with Polymers conclusion as outlined in the blog post Exploring Virtual Blockchain Design written by @notbdu. Bo does an excellent job visualizing the application/server abstraction and analogizing it to web2 design, showing us that this is a design model which can easily extend into other ecosystems.

In addition, we will want to take inspiration from the (1) approach which would support implementation of IBC components in other execution engines.

To achieve this, we will need to define standards for facilitating bi-directional communication between execution engines. Specifically how we expect asynchronous (messages sent in past rollup blocks) relay messages and synchronous (messages sent in the same rollup block) relay messages to be authenticated (proved). In addition we should define the endpoints which these relay messages interact to facilitate the next IBC processing step.

To create a multi-execution engine environment, we can start by implementing our own execution engine which embeds op-geth (taking care to handle the single output hash elegantly). I expect it to be possible to add additional abstractions to the op-node which would allow for a standard way to extend the execution layer with multiple execution engines.

norswap commented 11 months ago

A few corrections / additional info:

At a very high level, an optimistic chain will use the batcher as its mempool. Transactions for an L2 can be submitted to the batcher or directly to the L1. The batcher will batch L2 transactions and submit these to the L1 as an ordered list of transactions for L2 block execution.

The execution engine (aka op-geth in OP Labs' version, a fork of geth) holds the mempool. The batcher is a very simple component that queries the op-node (to retrieve the span of blocks not yet submitted, between the "unsafe head" and the "safe head") and the engine (to get the actual blocks) batches them and submits them to the DA (at present: Ethereum calldata unless you fork the stack).

The mempool can be public or private (this is just normal geth stuff), it's private on Optimism and I think most big OP stack chains (Base etc).

In a separate process, the op-proposer will query an L2 rpc (the sequencer rpc) for blocks which are available to be submitted back to the L1.

You're describing the batcher, the proposer queries the "finalized head" from the op-node, and posts its blockhash (and a few other roots derived from it).

The op-batcher, op-proposer, and op-node are all separate services.

And so is the execution engine (op-geth).

ibc-go receives a MsgRecvPacket relay message.

ibc-go emits an event to indicate a cross domain message should be sent to the related smart contract for the OnRecvPacket callback.

an op-geth transaction initiates a cross domain message relay to a smart contract for the OnRecvPacket callback.

the smart contract executes the callback producing an emitted event

ibc-go is fed the OnRecvPacket result via the emitted event by the smart contract

I'm very unfamiliar with IBC, but is it possible to decouple the networking layer from the light client execution?

If so, what I would imagine as a design is have an independent service responsible for network communication. Upon receiving a MsgRecvPacket, it wraps it into a transaction, which it sends to the op-node (alternatively, the op-node can poll the IBC network service, just like it polls L1).

This transaction can either be a new transaction type (to be implemented within op-geth — which already has custom transaction types) or some kind of regular transaction that hits a new pre-compile contract. In both cases, the goal is that during processing by op-geth, the implementation goes to ibc-go and performs the required verifications there, then the requested EVM call.

The IBC network service would then listen to blocks created by op-geth, in particular emitted events, and transmit sent messages to the appropriate chains.

This would require 3 rollup blocks to be produced to perform the receive packet step (normally executes entirely within a single transaction within ibc-go)

This might be the part I understood the least in that section, and maybe my proposal above is hogwash because of it. It seems to entail that receiving a network packets needs to be block execution, and furthermore on its own separate block.

Maybe it's due to confusion around events? Like step (2) above produces (in my understanding) a "go-style" event, whereas step (4) produces an EVM event.

colin-axner commented 8 months ago

Thank you so much @norswap for your participation and engagement! I apologize for the delayed reply.

I have updated the issue with your corrections, thank you! These were very useful context fixes for me.

I will reply to your questions and suggestions a bit indirectly in the post below. It would be great to have your input on the approach we took and if there's any design decisions you would have done otherwise.

colin-axner commented 8 months ago

Proof of concept

With tremendous efforts from @srdtrk and @DimitrisJim we spent a few weeks creating a proof of concept for the second approach in this issue. This work builds off the existing work Polymer has done with monomer and goes a bit further exploring the possibility of having two execution engines on the same rollup.

We managed to execute EVM and SDK transactions independently of each other and have the results be fed into a single block returned back to the op-node. Due to time constraints, we were unable to find solutions which allowed us to relay the results of execution in one execution engine to the other state machine.

Below is a architecture diagram which resulted from this proof of concept:

op-stack integration diagram

The proof of concept is held up by three repositories:

ibc-interceptor this is the main repository we worked on
optimism-fork a fork was only necessary to run e2e's as quickly as possible
monomoer-fork a fork was only necessary to use our own custom simapp and add additional logging

Note: In reality, there is no "sequencing" and "derivation" mode. Both "sequencing" and "derivation" occur in the same process. I separated the two as I find it conceptually much easier to reason about since they have different code traces. Using my terms, in sequencing mode the execution engine is responsible for gathering transactions which are executed and included in a block based on a timer which is managed by op-node and communicated via the engine API. In derivation mode, op-node will provide a list of ordered transactions. My understanding is that sequencing and derivation mode are run in the same process in order to handle the unhappy path where the derivation logic needs to interrupt the sequencing logic due to a reorg or mismatch in block hashes.

Components

Optimism

As noted, this section remained unchanged. The fork we used was only to run e2e's and we did not spend time investigating if it'd be possible to do without a fork. Given that the op-stack is in production, it is marked entirely as green.

op-geth

This component remained unchanged, however we did run into difficulties inspecting the results of a block in active building which was necessary for trying to execute a series of transaction components in a single block (if a single transaction requires two components to be executed in different execution engines). This is noted below in the concerns section.

ibc-interceptor

The goal of the interceptor is to intercept rpc calls from op-node directed for the execution engine and to act as a middleware which enables a sidecar process to be run. This sidecar process can then have its results fed into the block which is returned to op-node.

A learning from this process is that the base execution engine (op-geth) should be given a higher status than the sidecar process (sdk state machine) as it is required to handle additonal rollup specific logic which we do not require from the other execution engine.

In general, the interceptor will parse user transactions, forward them to the respective execution engine based on the runtime mode (sequencing or derivation), receive back the responses from execution, generate any additional transactions to be executed, compose the block results into a single block and relay signal instructions from op-node as necessary (start/stop block production).

Conceptually what we worked on was not executing IBC transactions on a rollup, but allowing for multiple execution engines on a single rollup.

This section is marked as orange or red depending on the components which were implemented in the proof of concept.

Quick links to code:

monomer

As noted, there were minimal changes required for monomer. Primarily allowing us to use a custom simapp (which is an issue which will be addressed in that repo). Any other concerns are noted in the concerns section below.

contracts

We deployed two contracts (IBCStandardBridge and IBCCrossDomainMessenger). The basic idea was to have an escrow contract. A contract which fulfills the bank keeper interface used by IBC transfer.

This section was not thoroughly explored and was used mostly as a placeholder.

The good

No fork of op stack needed. Wahoo!
No form of op-geth needed (yet)
Using local rpc's for execution engine interception worked out great! We also would make rpc calls to the two execution engines making for a clean design.
Using local rpc's for mempool interception worked just as well!
Reusing monomer was very smooth and reduced the complexity of the interceptor code significantly!
No changes required in ibc-go (thus far)

Concerns

We ran into difficulties generating transactions based on op-geth results. When a EVM transaction is received and executed in geth, we would like to be able to generate a transaction for the SDK execution engine (if necessary) which can be executed in the same block during sequencing mode. Unfortunately we were unable to find a way to inspect the result of this transaction execution based on the available API within op-geth. This leads us to believe that we would need a fork to expose some functionality to inspect transaction results before a block is closed and returned to op-node. It is unclear to me if it is discouraged to use transaction results from block being built due to some side effect I am unaware of? We would like the second component to be executed in the same block as the second component should only be executed if the first component succeeds and if they are executed in different blocks, it may be difficult for the interceptor to be aware of the first components transaction result (as I write this I wonder if the second component can be linked via a tx hash to the first component?).
Block composition (combining 2 blocks into one) was very hacky. There are many fields in the execution engine api representing the block. We modified the ParentHash BlockHash and PayloadID to get the tests passing. This results in 3 hashes or identifiers for each (one for op-node, one for op-geth, one for monomer). Another approach might be to extend the existing block type with a new field dedicated for the information required by this additional execution engine to function properly.
Performance. We didn't explore this at all, but it would be a serious concern to ensure an additional execution engine does not affect the processing time of the rollup without this extra execution engine.
Gas consumption. There needs to be a mechanism to charge gas within the base execution engine for execution performed outside in another execution engine. This requires any additional execution engines to supply gas calculations and there to be some conversion contract which can charge gas for these out of band transactions.
Monomer needs to be updated to allow for "live transaction execution". That is, to execute transactions in the sequencing mode as they are received, rather than when it is told by op-node to close the block.

Unexplored

Gas calculation in additional execution engine resulting in gas consumption in the base execution engine.
Transaction formatting. Ideally an eth transaction is submitted which contains SDK transactions in its data field.
Derivation mode. We didn't manage to get tests passing for this. We did not return back transaction executed for the unsafe chain (combining SDK/EVM transactions into one list which is later derived).
Control logic on mempool. Our extremely simple mempool forwards transactions to the execution engines without delay. Ideally there would be some control logic which follows the control signal provided by op-node via engine API.
Light client code on the cosmos chain side.
The SDK is working on a state transition function which would likely reduce the unnecessary complexity of the additional state machine.
Channel creation flow. How is an escrow contract associated with a specific channel which refers to a certain counterparty chain?

Conclusion

While difficult, the proof of concept process was extremely fruitful and rewarding in terms of the information and insight it provided. We were unable to fully explore all aspects necessary in achieving our goal of IBC integration in the op-stack, but we were left with a concrete architecture diagram which acts as a starting point for this discussion. I believe the currently proposed architecture to be quite complex and will certainly change and adjust with additional information, inputs and perspectives.

crodriguezvega commented 5 months ago

Huge shout-out to @colin-axner, @srdtrk and @DimitrisJim for the excellent work researching, designing and implementing the PoC.

After thoughtful deliberation, we have decided to put on hold the work on this approach for integration with OP stack. Due to the high level of complexity of the solution that has been investigated, the unknowns that remain, and the difficulty of generalising this solution to other frameworks/ecosystems, we have decided to explore a different approach and see where that takes us.

cosmos / ibc-go

OP Stack Integration #5458

Context

Flow

Terms:

op-batcher

op-proposer

op-node

Integration approaches

Requirements

Approach 1: Embedding ibc-go into the execution engine

Approach 2: Side by side execution

2a: Polymer (delegated chain)

2b: Single rollup

Optimistic bi-directional communication

Post Handler Verification

Cross Domain Messenger

Approach 3: Pre Engine Invocation Execution

Considerations

Conclusion

Proof of concept

Components

Optimism

op-geth

ibc-interceptor

monomer

contracts

The good

Concerns

Unexplored

Conclusion