0xPolygonMiden / miden-node

Reference implementation of the node for the Polygon Miden rollup
MIT License
52 stars 36 forks source link

[block-producer]: implement a new skeleton #491

Open Mirko-von-Leipzig opened 1 week ago

Mirko-von-Leipzig commented 1 week ago

Following the redesign discussion we've decided to try out an event loop based design as roughly outlined in this comment chain.

Its not 100% how well this will fit with our data and mental model so as an initial prototype we'll implement a skeleton for review.

bobbinth commented 1 week ago

Assuming this works, this will supersede #185, #191, and #196.

bobbinth commented 1 week ago

In terms of events, here is my take on the list (just what's top of mind - probably missing a few things or could be wrong entirely):

enum Event {
    /// Adds transaction to the pool. TransactionInputs (should probably renamed into something else)
    /// are retrieved from the store before the event gets added to the queue.
    ///
    /// This event gets added to the queue from RPC calls. We'll also need a way to communicate the
    /// result (if the transaction was added to the pool or not) back to the RPC.
    AddTransaction(ProvenTransaction, TransactionInputs),

    /// Selects the next set of transaction from the pool and sends them to the batch builder.
    ///
    /// This event gets added to the queue either periodically (if the pool is empty) or as batch builder
    /// capacity becomes available.
    SelectBatch,

    /// Adds the batch to the pool.
    ///
    /// This event is added to the queue once batch builder completes construction/proving of a batch.
    BatchProduced(Batch),

    /// Releases all the transactions that were being batched into a batch with the specified ID.
    ///
    /// This event is added to the queue if batch production fails. We may also want to include some
    /// information about the reason so that we could maybe take some action in the pool (e.g., drop
    /// some problematic transactions).
    BatchFailed(BatchId),

    /// Selects the next set of batches from the pool and sends them to the block builder.
    ///
    /// This event gets added to the queue periodically but only after the previous block has been built
    /// (i.e., no two blocks can be built at the same time).
    SelectBlock,

    /// Applies the block to the pool.
    ///
    /// This event is added to the queue once the block builder constructs/proves the block and saves
    /// it to the store.
    BlockProduced(Block),

    /// Releases all the batches that were being put into the block with the specified hash.
    ///
    /// This event is added to the queue if block production fails. We may also want to include some
    /// information about the reason so that we could maybe take some action in the pool (e.g., drop
    /// some problematic batches).
    BlockFailed(Digest),
}
Mirko-von-Leipzig commented 1 week ago

I've prototyped a few variations of this and I'm not really happy with any of them.

The primary reason is that I'm failing to encode the sort of constraints like only having a single block inflight in combination with block timer events. Its of course possible, but the code is unnecessarily complex. And making changes will also be painful - there is a simple model hidden somewhere but given that we're still going to change the rules several times I don't think its worth pursuing.

I'm going to pivot to the Arc<Mutex<TransactionPool>> suggestion you made here.

A concern with it was transaction inputs getting undue priority. However we can resolve this externally instead by limiting concurrency to N=1 (or whatever) using a channel or semaphore.

This does remove the ability to prioritize certain events over others internally, but I think so long as the lock is short then we can live with a fair mutex so long as transactions are limited wrt concurrent access.

The benefit is that we can more easily separate out parts of puzzle instead of embedding them into a single god-object.

We can always revisit this or move things towards a different model once we better understand or run into limitations.

bobbinth commented 1 week ago

The primary reason is that I'm failing to encode the sort of constraints like only having a single block inflight in combination with block timer events.

Do we need more here than raising an error if SelectBlock is called while bock construction is in progress? (i.e., before BlockProduced or BlockFailed are called for a previously selected block).

The way I was thinking about it was that the block producer component would be responsible for putting SelectBlock events into the queue - and the strategy of how they get there would be up to the block producer (i.e., if there are timeouts involved or whatnot).

I'm going to pivot to the Arc<Mutex<TransactionPool>> suggestion you made here.

In this approach, would we still handle all store requests outside of the pool? (i.e., by the time we lock the pool, we already have the all the data from the store needed to fulfill a given function call).

Mirko-von-Leipzig commented 6 days ago

The primary reason is that I'm failing to encode the sort of constraints like only having a single block inflight in combination with block timer events.

Do we need more here than raising an error if SelectBlock is called while bock construction is in progress? (i.e., before BlockProduced or BlockFailed are called for a previously selected block).

The way I was thinking about it was that the block producer component would be responsible for putting SelectBlock events into the queue - and the strategy of how they get there would be up to the block producer (i.e., if there are timeouts involved or whatnot).

Its doable; but I think I no longer believe the queue will simplify matters. Now that transaction inputs are externalized, the pool can be a very simple thing. And then all the queue is doing is writing the methods in a weirder syntax.

I'm going to pivot to the Arc<Mutex<TransactionPool>> suggestion you made here.

In this approach, would we still handle all store requests outside of the pool? (i.e., by the time we lock the pool, we already have the all the data from the store needed to fulfill a given function call).

Yes, the pool wouldn't interact with the store at all. It essentially only knows about inflight data, store access will be performed by the transactions, batch and block producers.

Something that I'm still unclear on is if we want to actively cancel inflight batches e.g. a batch/block fails invalidating some inflight batches. Since the pool is the one tracking the dependencies it would need some way of signalling to the batch producer. And similarly for the block producer when we do multiple inflight blocks.

In this case we would want the pool to push instead of be pulled from. But maybe for now we just let the batches complete and then ignore them as they return to the pool.

bobbinth commented 6 days ago

In this case we would want the pool to push instead of be pulled from. But maybe for now we just let the batches complete and then ignore them as they return to the pool.

Yep, letting them complete and then ignoring/logging upon return seems like a reasonable solution to me.