Open hackaugusto opened 7 months ago
As mentioned above, batch proving will be implemented with a cluster of machines. Currently it is unclear if BatchBuilder is an abstraction over the cluster, or a single machine in said cluster.
IMO it makes sense to abstract over the strategy, local/distributed, instead of a prover instance, a single prover, because the later would need extra code (and maybe even another trait).
If that is the case, then IMO the return type should have Future
, so that the distributed case can resolve it to the result and report errors. (Currently it seems this is implemented with a new task )
Back pressure from the batch builder. Currently the tx queue pushes batches to the builder on regular intervals without back pressure.
If we agree on using the Future
as mentioned above, backpressure can be implemented by wrapping the future in an option, the type would be:
async fn build_batch(&self, txs: Vec<SharedProvenTx>) -> Option<Future<Result<(), BuildBatchError>>>;
async fn
because it allows the inner code to use the gRPC clients with out much fussOption
to report back pressure, when None
is returned the queue should wait a while before trying again.Future
to represent the work is possible done remotely.The last thing, is the dependency graph. At the moment the graph is:
graph
API -->|add_transaction| TransactionQueue;
TransactionQueue -->|verify_tx| StateView;
TransactionQueue -->|build_batch| BatchBuilder;
BatchBuilder -->|build_block| BlockBuilder;
BlockBuilder -->|apply_block| StateView;
IOW, roughly like a pipeline. (Note: The control flow is a bit more complicated ref ).
This pipeline-like design, has a few consequences:
IMO it should instead be a design like the following:
#[async_trait]
pub trait BatchBuilder {
async fn build_batch(
&self,
txs: Vec<SharedProvenTx>,
) -> Option<Future<Result<TransactionBatch, BuildBatchError>>>;
}
#[async_trait]
pub trait BlockBuilder {
async fn build_block(
&self,
batches: Vec<TransactionBatch>,
) -> Option<Future<Result<Block, BuildBlockError>>>;
}
That allows the TransactionQueue
to take the responsibility of the StateView
, and hopefully reduce the number of tasks (not sure about this yet, but it is one of my goals, to reduce synchronization and test flakiness). The simplified control flow would then be:
sequenceDiagram
actor Network
participant TxQueue
participant BatchBuilder
participant BlockBuilder
participant Store
Note over Network,Store: New tx
Network->>TxQueue: add_transaction
activate TxQueue
TxQueue->>TxQueue: verify_transaction
alt is_valid
TxQueue->>TxQueue: add_tx
end
alt has_txs_for_batch
TxQueue--)BatchBuilder: build_batch
end
deactivate TxQueue
Note over Network,Store: New batch
BatchBuilder--)TxQueue: batch_done
activate TxQueue
TxQueue->>TxQueue: add_batch
alt has_batches_for_block
TxQueue--)BlockBuilder: build_block
end
deactivate TxQueue
Note over Network,Store: New block
BlockBuilder--)TxQueue: block_done
activate TxQueue
TxQueue->>Store: apply_block
TxQueue->>TxQueue: update_inflight
deactivate TxQueue
Note over Network,Store: Periodic
activate TxQueue
TxQueue->>TxQueue: tick
loop for each tx group
TxQueue--)BatchBuilder: build_batch
end
alt has_batches
TxQueue--)BlockBuilder: build_block
end
deactivate TxQueue
Note: The workflow above, were build_block
is called as soon as there is enough batches for a block, means that block times are not regular. The trade-off here is higher throughput and lower latency, for a irregular block time. I think it is a good trade-off, and we can use the block's timestamp to base a time signature (however, this has implications on some other parts of the protocol, like P2IDR, which should be re-evaluated).
Note2: The workflow above is simplified, as it does not account for back pressure handling, but it gives the overall flow of the data in the system, the back pressure should be pretty easy to implement on top of the above.
The trait has a single method:
async fn build_block(&self, batches: Vec<SharedTxBatch>) -> Result<(), BuildBlockError>;
Which is called by
TransactionQueue
.
This is not accurate (transaction queue pushes transaction batches into the batch builder, and the batch builder pushes batches into the block producer), but since the diagram in https://github.com/0xPolygonMiden/miden-node/issues/191#issuecomment-1911855544 is correct, I'm assuming this is a typo?
IMO it should instead be a design like the following:
async fn build_block( &self, batches: Vec<SharedTxBatch>, ) -> Option<Future<Result<TransactionBatch, BuildBlockError>>>;
I'm not quite following this part:
TransactionBatch
from build_block()
?TransactionBatch
should be Block
), why do we return Feature
within an Option
?Basically, was pass a set of batches into build_block()
, unless there is an error, we should get a block back, and the Feature
part is handled organically by async
function.
That allows the
TransactionQueue
to take the responsibility of theStateView
This approach could work and we actually considered it initially (basically to have a single component which manages the whole state with several tasks interacting with it). The current approach tries to follow an "actor model" where there are multiple components each responsible for maintaining their own state. I am not against re-evaluating this though, especially if it leads to a simpler design.
One other general note: we should keep in mind that we are building towards a fully decentralized system (even if it takes us some time to get there). This means that we shouldn't assume that there is a single block builder or even that the block builder will be known ahead of time.
This doesn't have a ton of implications for now, but there are a few. The main one relevant here is that block production times should be relatively regular. Currently, we are aiming 3-second block times (though this can change).
This is not accurate (transaction queue pushes transaction batches into the batch builder, and the batch builder pushes batches into the block producer), but since the diagram in https://github.com/0xPolygonMiden/miden-node/issues/191#issuecomment-1911855544 is correct, I'm assuming this is a typo?
Ah, yes, I meant to write the issue about the batch builder, as in the issue title. Not sure why I pasted the links to the block producer over there.
Why are we returning a TransactionBatch from build_block()?
Because of the typo, it should have been the batch builder.
Assuming the above is a typo (i.e., TransactionBatch should be Block), why do we return Feature within an Option?
It is an Option
of a Future
.
The Option
would signal the batch builder has enough work, and the transaction queue should wait before pushing extra work. This implements back-pressure.
The Future
is working like a Promise
from the JS world, basically an object that will be resolved once there is a value. It represents the fact that batch proving is asynchronous. Returning a Future
should remove the need for spawning a task
(maybe), and it signals to the caller that it will receive the results back.
This approach could work and we actually considered it initially (basically to have a single component which manages the whole state with several tasks interacting with it). The current approach tries to follow an "actor model" where there are multiple components each responsible for maintaining their own state. I am not against re-evaluating this though, especially if it leads to a simpler design.
I don't know all the theoretical baggage of the actor model framework. But my main takeaway is the lack of locking primitives for synchronization, which I'm not proposing to use here, so in that regard it should have no downside (as in, this also doesn't deadlock).
Are there any other benefits we are looking for?
The
Option
would signal the batch builder has enough work, and the transaction queue should wait before pushing extra work. This implements back-pressure.
Could we not accomplish this by adding something like BuildBatchError::Busy
variant? Basically, instead of wrapping retuning None
in case when the batch builder is busy, we'd return Busy
error.
A couple of general comments about Tx queue and batch builder:
The way they are set up currently is very primitive - this was basically implemented to make them work in the simplest possible way. So, we should view these as skeletons of what needs to be there. Here is a rough sketch of the flow we'd like to achieve eventually:
As transactions are submitted, they are added to the transaction queue. The only checks we make here is that transactions should be valid. I think this part works fine.
Assuming the batch builder is not busy, we want to submit transactions to it for batching (if it is busy, ideally, we won't even send a request to batch at all). Here we have 2 considerations:
The batch selection process right now is very primitive: periodically, we grab transactions from the queue, split them up into equal groups, and send them off to the batch builder. We should encapsulate this logic so that it is easy to update without affecting the rest of the system. For example, we could have something like this on the transaction queue:
/// Extracts groups of transactions from the current transaction queue. `ProposedBatch` could be
/// as simple as `Vec<ProvenTransaction>`.
///
/// `num_batches` could come from the batch producer indicating how much capacity it has.
pub fn select_batches(&self, num_batches: usize) -> Vec<ProposedBatch>
Current implementation of this function, would look something like this:
pub fn select_batches(&self, num_batches: usize) -> Vec<Vec<ProvenTransaction>> {
let txs: Vec<ProvenTransaction> = {
let mut locked_ready_queue = self.ready_queue.write().await;
if locked_ready_queue.is_empty() {
return Vec::new();
}
locked_ready_queue.drain(..).collect()
};
txs.chunks(self.options.batch_size).map(|txs| txs.to_vec()).to_vec()
}
The implementation of try_build_batches()
could then look something like this:
async fn try_build_batches(&self) {
let num_batches = self.batch_builder.get_capacity();
if num_batches == 0 {
return;
}
let proposed_batches = self.select_batches(num_batches);
for proposed_batch in proposed_batches {
let ready_queue = self.ready_queue.clone();
let batch_builder = self.batch_builder.clone();
tokio::spawn(
async move {
match batch_builder.build_batch(proposed_batch).await {
Ok(_) => {
// batch was successfully built, do nothing
},
Err(e) => {
// batch building failed, add txs back at the end of the queue
ready_queue.write().await.append(&mut e.into_transactions());
},
}
}
.instrument(info_span!(target: COMPONENT, "batch_builder")),
);
}
}
Another point is that if we do want to have one central place to maintain global state (e.g., something like TransactionPool
), we probably need to change how batch and block builder work. Currently, the build batches/blocks and process the results (e.g., build_batch
and build_block
do not return anything). But with this approach, the would need to return the result to be processed accordingly. For example, the try_build_batches
function would change like so:
async fn try_build_batches(&self) {
let num_batches = self.batch_builder.get_capacity();
if num_batches == 0 {
return;
}
let proposed_batches = self.select_batches(num_batches);
for proposed_batch in proposed_batches {
let tx_queue = self.tx_queue.clone();
let batch_queue = self.batch_queue.clone();
let batch_builder = self.batch_builder.clone();
tokio::spawn(
async move {
match batch_builder.build_batch(proposed_batch).await {
Ok(batch) => {
// batch was successfully built, add the it to the batch queue
batch_queue.write().await.push(batch);
},
Err(e) => {
// batch building failed, add txs back at the end of the queue
tx_queue.write().await.append(&mut e.into_transactions());
},
}
}
.instrument(info_span!(target: COMPONENT, "batch_builder")),
);
}
}
Could we not accomplish this by adding something like BuildBatchError::Busy variant? Basically, instead of wrapping retuning None in case when the batch builder is busy, we'd return Busy error.
It does work. But IMO this is a more complicated API, for example, this adds the need for the get_capacity
you mentioned above. My original idea was to isolate things that can be done synchronously from things that must be done async.
It does work. But IMO this is a more complicated API, for example, this adds the need for the
get_capacity
you mentioned above. My original idea was to isolate things that can be done synchronously from things that must be done async.
I probably still don't fully understand your proposal. Let's say we have the following method on the BatchBuilder
:
async fn build_batch(
&self,
transactions: Vec<ProvenTransaction>,
) -> Option<Future<Result<TransactionBatch, BuildBatchError>>>
How would it be called from transaction queue/pool (i.e., how would try_build_batches()
look like)?
I probably still don't fully understand your proposal. How would it be called from transaction queue/pool look like?
The proposal is basically:
txqueue
should have handles to the in-flight batches (the Future
above).
Future
resolves).I did a few experiments and I no longer think that returning a task/Future
is a good idea. Here is why:
Future
s. The code would need a dynamic container, similar to a Vec<Future>
, which waits for all the Future
s concurrently. FuturesUnordered
is the closest to a solution that I found, the problem with this structure is that it doesn't block when it is empty.Future
. The goal was to have all the Future
s managed by a single task, so that it would be easy to propagate cancellations to the inner Future
s. The problem with returning a Future
is that async fn
are unnamed types, also the compiler does Sync
estimation. The first point means we have to use Box<dyn Future>
to store said futures, which is a bit cumbersome. The second point means that the code would be brittle, it is easy to break Sync
in an async
task and endup in a situation were the code needs to be reworked.Here is a concrete example. I think changing the type to JoinHandle<Result<Batch, Error>>
like you suggested would be better indeed at the cost of some races with spawning/back pressure handling. The important piece of code is txqueue
.
This sample above will:
It shows how to collect all the handlers, but I have not looked into cancellation. The code is just a sample, it runs, it shows the concepts, but it is not prod code (so things like names, grouping of functions, or even the types was not really the focus).
The batch builder is currently two pieces:
BatchBuilder
DefaultBatchBuilder
implementationThe trait has a single method:
Which is called by
TransactionQueue
. The current implementation pushes transactions batches on regular intervals.Things to discuss and improve on the current design:
BatchBuilder
is an abstraction over the cluster, or a single machine in said cluster.