Open Hyodar opened 3 weeks ago
Btw, if you'll see testnet-0
, this was already implemented in a naive way to avoid any load issues.
Some work was done in #259 but it seems like on master
we already have somewhat of a mechanism for TTL in the form of timeout + retries for the operator's RPC client:
On the aggregator side, we handle 3 messages:
ProcessSignedCheckpointTaskResponse
is the only one that does not handle internally a TTL, though I'm not sure if this is what is being discussed in the ticket.
I propose we set a submission timeout and just ignore messages that are older than that when receiving them
Would adding the following validation to ProcessSignedStateRootUpdateMessage
and ProcessSignedOperatorSetUpdateMessage
be enough?
// Error handling is ignored for brevity
func (agg *Aggregator) validateMessageTimestamp(ctx context.Context, messageTimestamp uint64) error {
currentBlockNumber, _ := agg.httpClient.BlockNumber(ctx)
currentBlock, _ := agg.httpClient.BlockByNumber(ctx, big.NewInt(0).SetUint64(currentBlockNumber))
if messageTimestamp < (currentBlock.Time() - MESSAGE_AGGREGATION_TIMEOUT - MESSAGE_SUBMISSION_TIMEOUT) {
return errors.New("Message has expired")
}
return nil
}
It'd also be good to then adapt the checkpoint time range logic so it's from lastCheckpointTimestamp + 1 to currentBlockTime - MESSAGE_AGGREGATION_TIMEOUT - MESSAGE_SUBMISSION_TIMEOUT.
Would changing the following
to
toTimestamp := block.Time() - MESSAGE_AGGREGATION_TIMEOUT - MESSAGE_SUBMISSION_TIMEOUT
be enough?
Let me get up to speed on this, need to gather past thoughts!
Okay, so my understanding on the aggregator is:
We need to introduce MESSAGE_SUBMISSION_TIMEOUT
- this timeout is purely for submission. We're not going to accept messages from timestamps lower than currentTimestamp() - MESSAGE_SUBMISSION_TIMEOUT
- the message timestamp being defined as the timestamp in the message content. This means we need to filter these messages out when receiving (as pointed out by you, but then using the aggregator timestamp as reference as opposed to the current block). The idea for that is to avoid aggregating old messages, which skews the checkpoint content - whoever signs a checkpoint attests that the messages were passed around during that time.
Then, whenever we're making a checkpoint, we should be making it as you linked there.
but then using the aggregator timestamp as reference as opposed to the current block
What is the aggregator timestamp
? Is it the unix timestamp of the machine running the aggregator or is it a domain specific concept?
Unix timestamp, really. The point here is just establishing some baseline for when a message is old or not. Let me know if it makes sense, I may also be forgetting about something here.
In both aggregator and operator we don't really have a timeout for processing messages.
For operators, it means when a message is added to the resend queue, it'll just stay there while it's not resent enough, even though it's a very old message. This is not interesting because (1) the interest in a state root update is immediate, so resending a really old message is essentially unnecessary info and can result in a lot of latency for the actual blocks to be processed when re-upping and (2) the resend queue just grows unbounded, which is never nice. Here, I propose we just check in the resend queue if a message is old enough to be dropped and drop it even if the aggregator connection is down. We could use a fancier queue with TTL (I'd prefer that) or just straight up manage this on
tryResendFromQueue
.For the aggregator, there's still the latency consideration - should take quite a while to get to the 'good' messages and, depending on the situation, aggregating an old message can affect a past checkpoint info. This should ideally be considered in slashing terms as well, but this is not desirable behavior. In that case, I propose we set a submission timeout and just ignore messages that are older than that when receiving them. It'd also be good to then adapt the checkpoint time range logic so it's from
lastCheckpointTimestamp + 1
tocurrentBlockTime - MESSAGE_AGGREGATION_TIMEOUT - MESSAGE_SUBMISSION_TIMEOUT
.