Timeout-based message expiration

Hyodar commented 3 weeks ago

In both aggregator and operator we don't really have a timeout for processing messages.

For operators, it means when a message is added to the resend queue, it'll just stay there while it's not resent enough, even though it's a very old message. This is not interesting because (1) the interest in a state root update is immediate, so resending a really old message is essentially unnecessary info and can result in a lot of latency for the actual blocks to be processed when re-upping and (2) the resend queue just grows unbounded, which is never nice. Here, I propose we just check in the resend queue if a message is old enough to be dropped and drop it even if the aggregator connection is down. We could use a fancier queue with TTL (I'd prefer that) or just straight up manage this on tryResendFromQueue.

For the aggregator, there's still the latency consideration - should take quite a while to get to the 'good' messages and, depending on the situation, aggregating an old message can affect a past checkpoint info. This should ideally be considered in slashing terms as well, but this is not desirable behavior. In that case, I propose we set a submission timeout and just ignore messages that are older than that when receiving them. It'd also be good to then adapt the checkpoint time range logic so it's from lastCheckpointTimestamp + 1 to currentBlockTime - MESSAGE_AGGREGATION_TIMEOUT - MESSAGE_SUBMISSION_TIMEOUT.

Hyodar commented 3 weeks ago

Btw, if you'll see testnet-0, this was already implemented in a naive way to avoid any load issues.

emlautarom1 commented 1 week ago

Some work was done in #259 but it seems like on master we already have somewhat of a mechanism for TTL in the form of timeout + retries for the operator's RPC client:

https://github.com/NethermindEth/near-sffl/blob/8f43707c78ba61820b3d0b18a27c2afe2e930672/operator/rpc_client.go#L20-L23

emlautarom1 commented 1 week ago

On the aggregator side, we handle 3 messages:

https://github.com/NethermindEth/near-sffl/blob/8f43707c78ba61820b3d0b18a27c2afe2e930672/aggregator/aggregator.go#L522

https://github.com/NethermindEth/near-sffl/blob/8f43707c78ba61820b3d0b18a27c2afe2e930672/aggregator/aggregator.go#L551

https://github.com/NethermindEth/near-sffl/blob/8f43707c78ba61820b3d0b18a27c2afe2e930672/aggregator/aggregator.go#L581

ProcessSignedCheckpointTaskResponse is the only one that does not handle internally a TTL, though I'm not sure if this is what is being discussed in the ticket.

emlautarom1 commented 1 day ago

I propose we set a submission timeout and just ignore messages that are older than that when receiving them

Would adding the following validation to ProcessSignedStateRootUpdateMessage and ProcessSignedOperatorSetUpdateMessage be enough?

// Error handling is ignored for brevity
func (agg *Aggregator) validateMessageTimestamp(ctx context.Context, messageTimestamp uint64) error {
    currentBlockNumber, _ := agg.httpClient.BlockNumber(ctx)
    currentBlock, _ := agg.httpClient.BlockByNumber(ctx, big.NewInt(0).SetUint64(currentBlockNumber))

    if messageTimestamp < (currentBlock.Time() - MESSAGE_AGGREGATION_TIMEOUT - MESSAGE_SUBMISSION_TIMEOUT) {
        return errors.New("Message has expired")
    }

    return nil
}

It'd also be good to then adapt the checkpoint time range logic so it's from lastCheckpointTimestamp + 1 to currentBlockTime - MESSAGE_AGGREGATION_TIMEOUT - MESSAGE_SUBMISSION_TIMEOUT.

Would changing the following

https://github.com/NethermindEth/near-sffl/blob/dfb1874b14eaee8de7d33ac074e71024f4c28f8f/aggregator/aggregator.go#L396

to

    toTimestamp := block.Time() - MESSAGE_AGGREGATION_TIMEOUT - MESSAGE_SUBMISSION_TIMEOUT

be enough?

Hyodar commented 1 day ago

Let me get up to speed on this, need to gather past thoughts!

Hyodar commented 1 day ago

Okay, so my understanding on the aggregator is: We need to introduce MESSAGE_SUBMISSION_TIMEOUT - this timeout is purely for submission. We're not going to accept messages from timestamps lower than currentTimestamp() - MESSAGE_SUBMISSION_TIMEOUT - the message timestamp being defined as the timestamp in the message content. This means we need to filter these messages out when receiving (as pointed out by you, but then using the aggregator timestamp as reference as opposed to the current block). The idea for that is to avoid aggregating old messages, which skews the checkpoint content - whoever signs a checkpoint attests that the messages were passed around during that time. Then, whenever we're making a checkpoint, we should be making it as you linked there.

emlautarom1 commented 1 day ago

but then using the aggregator timestamp as reference as opposed to the current block

What is the aggregator timestamp? Is it the unix timestamp of the machine running the aggregator or is it a domain specific concept?

Hyodar commented 1 day ago

Unix timestamp, really. The point here is just establishing some baseline for when a message is old or not. Let me know if it makes sense, I may also be forgetting about something here.

NethermindEth / near-sffl

Timeout-based message expiration #212