astriaorg / astria

A monorepo containing all the custom components of the Astria network, a decentralized system that replaces traditional sequencers, offering a shared, permissionless sequencer network.
https://www.astria.org/
Apache License 2.0
105 stars 69 forks source link

Minimize resubmission of same data by sequencer-relayer to Celestia #1200

Open Fraser999 opened 4 weeks ago

Fraser999 commented 4 weeks ago

Currently if the relayer process exits for any reason, or if the submission flow is broken after a BlobTx is broadcast to Celestia (e.g. via a gRPC timeout) then there is a high chance that the relayer will resubmit the same data in a new tx.

While this isn't problematic from an end-to-end flow perspective, we want to try and minimize the chances of this happening for efficiency and cost reasons.

To handle the restart case:

We should update the presubmit.json file once we have prepared a new BlobTx ready for broadcasting to Celestia with the TxHash of the BlobTx and a current timestamp.

In BlobSubmitter::run, on startup after initializing the client we can then read that optional value from submission_state. If it is Some, we can then use the client to confirm whether that tx exists on Celestia by polling GetTx as normal, but only for up to one minute after recorded timestamp.

If the tx isn't confirmed or there was no in-flight tx recorded in the presubmit file, the flow continues as per currently. If the tx is confirmed, we just skip blocks in the main select! loop until we reach the sequencer height specified in the presubmit file.

To handle the timeout case:

In submit_with_retry, we're already passing a last_error_receiver to the client try_submit method (so it can potentially get the required fee from the error message). Rather than passing the receiver to the client, we can instead retrieve the error inside tryhard::retry_fn and if the error indicates a broadcast timeout, we can use the same new confirm method on the client as used in the restart case (i.e. it only tries for up to one minute before assuming the broadcast was unsuccessful). If the tx is confirmed, flow continues as normal. If the tx is not confirmed, we fall into the retry flow and attempt to resubmit from scratch.

SuperFluffy commented 3 weeks ago

We should update the presubmit.json file

Is this only a change to the data we write to the file? Or does this mean that the presubmit.json is written more than once during a broadcast? I am asking because the split into presubmit- and postsubmit was also to guard against write errors to either file.

If the presubmit.json is to be written repeatedly, then I would change the code to a) write a new file, and b) override presubmit.json.

ready for broadcasting to Celestia with the TxHash of the BlobTx and a current timestamp.

In addition to the current timestamp, could this also take a <celestia-height at submission>? I am not a fan of timestamps because they are so sensitive to external conditions, whereas the celestia height would also provide a proxy for time

To handle the timeout case:

That sounds reasonable to me.

Fraser999 commented 3 weeks ago

Is this only a change to the data we write to the file? Or does this mean that the presubmit.json is written more than once during a broadcast?

I'm proposing to have one extra write per submission - just before we send the actual BroadcastTx.

If the presubmit.json is to be written repeatedly, then I would change the code to a) write a new file, and b) override presubmit.json.

Sounds fine to me - can do.

In addition to the current timestamp, could this also take a ? I am not a fan of timestamps because they are so sensitive to external conditions, whereas the celestia height would also provide a proxy for time

Well, I generally agree about favouring something like a block height over timestamps. But in this case I think the timestamp is a bit simpler since we're using that to calculate a timeout instant for stopping GetTx polling. If we were to use Celestia block height, we'd need to also be polling the Celestia app for current block height at the same time as polling GetTx. This is do-able of course, but then we'd also need to handle the case where polling for current block height fails - probably with a time-based timeout.

Fraser999 commented 3 weeks ago

I'd like to propose that we move to a single file solution for tracking submission state. On every update, we'd write the new state to a temp file and move it to the desired path. The Rust objects would be:

struct LastSubmission {
    celestia_height: u64,
    #[serde(with = "as_number")]
    sequencer_height: SequencerHeight,
}

enum State {
    Fresh,
    Started {
        #[serde(with = "as_number")]
        sequencer_height: SequencerHeight,
        last_submission: LastSubmission,
    },
    Prepared {
        #[serde(with = "as_number")]
        sequencer_height: SequencerHeight,
        last_submission: LastSubmission,
        tx_hash: String,
        #[serde(with = "humantime_serde")]
        at: SystemTime,
    },
}

I think this alone gives us everything needed to be able to continue submitting from where we left off when the process restarts.