AmplicaLabs / reconnection-service

A Gateway microservice to reconnect graphs in DSNP/Frequency
https://amplicalabs.github.io/reconnection-service/
Apache License 2.0
2 stars 0 forks source link

Plug holes in nonce allocation logic #85

Open JoeCap08055 opened 8 months ago

JoeCap08055 commented 8 months ago

~## Problem # 1 - Nonce "holes"~

~The nonce-incrementing logic currently relies on Redis auto-expiration of keys to allow for the nonce to be "reset" to the last known nonce from the chain. However, this still presents a possibility that when a new block is formed, we end up with a "hole" in the nonce, resulting in a future transaction that may or may not ever make it into a block.~

~## Example~

~Current block: 10 Last known nonce @ block 10: 30 Redis cache: [] (empty)~

~ Post 3 txs, resulting in Redis cache updating to [1,2,3] -> nonces 30, 31, 32 used~ ~ Before the Redis cache expires, a new block forms. Now the chain last known nonce is 32.~ ~ Post a new transaction. Redis cache contains [1,2,3], so the next index is 4 -> nonce 32 + 4 - 1 = 35 --> results in "Future" transaction. This transaction may or may not ever be included in a block; it is entirely dependent on whether another transaction comes along at some point to use nonces 33 & 34 before the mortality era of this transaction expires. However, it is very likely that we will violate our own timeout logic before that happens. The result:~ ~ We could hit our timeout logic, which would fail the job--but the pending transaction could still ultimately be included in a block. Depending on how many times this job has been attempted, it could either be erroneously flagged as a permanently failed job, or be retried later (not so bad; the retry should be a no-op)~

~## Possible solutions~

~### Solution # 1: Block number caching~

~One approach might be to cache the block number at which the current nonce was retrieved, and check it against the current block. If the current block number has changed since our last nonce check, clear the Redis nonce cache so that we start again at the current nonce.~

~#### Risks~

~This approach does not take into account pending extrinsics that were not included in the most recent block, and therefore have already allocated the current nonce (and possibly more). However, since this scenario would result in an immediate RPC error "Priority too low", it may be preferable to having to worry about a "Future" transaction whose ultimate disposition we don't know.~

~### Solution # 2: Handle Future Extrinsics~

~Another approach would be to explicity screen for "Future" status in the Extrinsic event stream Observable pipeline, and trigger some processing if it's encountered. The difficulty here would be to decide how, exactly, to handle this scenario, as the transaction may or may not still complete either before or after the timeout.~

~#### Risks~

~Complex solution, difficult to cover various cases/scenarios. I prefer solution # 1.~

Problem # 2 - Redis cache timeout

A second problem is that the current default timeout for the Redis nonce keys is 2 seconds--much less than the current mainnet block time of 12 seconds. This means that it's much more likely to encounter a "Priority too low" RPC error due to re-used (but still pending) nonces, even if solution # 1 is implemented.

Example:

Current block: 10 Last known nonce @ block 10: 30 Redis cache: [] (empty)

Proposed solution

If, instead of auto-expiring the keys, we implement Solution # 1 above (block-number caching), and only clear the keys when the block number changes, it should solve this issue.

aramikm commented 8 months ago

I don't think example 1 can happen. Let's review the example based on my understanding of what we do

Current block: 10 next valid nonce @ block 10: 30 Redis cache: [] (empty) 1- Post 3 txs, resulting in Redis cache updating to [30,31,32] -> nonces 30, 31, 32 used 2- Before the Redis cache expires, a new block forms. Now the chain last known nonce is 32 so the next valid would be 33. 3- Post a new transaction. Blockchain next valid nonce is 33 and redis cache contains [30,31,32], so the next nonce is 33 which is what we expected.

A couple of things to remember .

JoeCap08055 commented 8 months ago

3- Post a new transaction. Blockchain next valid nonce is 33 and redis cache contains [30,31,32], so the next nonce is 33 which is what we expected.

Ah--I misread the lua code. We store the nonces, but we return the index/offset. I was thinking we stored the index/offset

JoeCap08055 commented 8 months ago

So, then, Problem # 2 could still be an issue--if a nonce key expires before the next block is formed, we could attempt to re-use a nonce.

aramikm commented 8 months ago

based on my current understanding problem # 2 could happen but not in the way mentioned.

One thing to think about is that nonce increment happens in 12 seconds between blocks (not just on block production). This is obvious since if it does not happen you would not be able to have more than one extrinsic inside the block. Considering this, the only way that PriorityTooLow can happen is that the update time between NextAccountNonce getting from chain and caching a nonce is more than 2 seconds. This happened in 1% of the time when tested in content publishing. But since we get immediate feedback aka PriorityTooLow error once this happens using a simple retry logic fixes the issue.

aramikm commented 8 months ago

To have a bulletproof nonce service it makes sense to periodically query something on chain to see if there are some transactions InFuture state and if that is the case we can burn gapped nonces. This is complex but if required possible.

JoeCap08055 commented 3 months ago

Note, now that reconnection service has changed to the model where transactions are submitted to the chain and we track their inclusion in a block separately (including tracking mortality/expiration), this seems like much less of a problem.

saraswatpuneet commented 3 months ago

I think one way to do is as suggest by Aramik, to get storage data for infuture https://polkadot.js.org/docs/substrate/rpc/#getstoragekey-storagekey-at-blockhash-storagedata

and use that to plug the holes if any, this is a complex one but trying it our

However anyone know where to get storage key?

wilwade commented 3 months ago

@saraswatpuneet I don't think the future data works because that requires that you always use the same node which might not be true due to load balancers and such

aramikm commented 3 months ago

While that is correct the account Nonce is also relied on the data in the pool. So I don't think we can escape that mentioned issue anyway. We might be able to mitigate it by retries and ...

    /// Returns the next valid index (aka nonce) for given account.
    ///
    /// This method takes into consideration all pending transactions
    /// currently in the pool and if no transactions are found in the pool
    /// it fallbacks to query the index from the runtime (aka. state nonce).
    #[method(name = "system_accountNextIndex", aliases = ["account_nextIndex"])]
    async fn nonce(&self, account: AccountId) -> RpcResult<Nonce>;
wilwade commented 3 months ago

That's correct, but because we track the nonce locally it isn't an issue. (Except on Redis data deletion and restart while pending tx). We only should be calling this once at startup/when we don't have anything pending right?

While that is correct the account Nonce is also relied on the data in the pool. So I don't think we can escape that mentioned issue anyway. We might be able to mitigate it by retries and ...

  /// Returns the next valid index (aka nonce) for given account.
  ///
  /// This method takes into consideration all pending transactions
  /// currently in the pool and if no transactions are found in the pool
  /// it fallbacks to query the index from the runtime (aka. state nonce).
  #[method(name = "system_accountNextIndex", aliases = ["account_nextIndex"])]
  async fn nonce(&self, account: AccountId) -> RpcResult<Nonce>;
aramikm commented 3 months ago

Based on my understanding It it not possible to track the nonce consistently and accurately locally for a service that needs high throughput since we submit extrinsics asynchronously and have to continue before knowing if that block got included or not . That's why we always call account_nextIndex to check our local value with what the node thinks is the next nonce to keep them always in sync.

aramikm commented 3 months ago

I thought it would be hard but looking into the code if we decide to return existing submitted nonces we just need a new rpc and the code looks not that hard to implement by looking into how the next nonce is calculated

/// Adjust account nonce from state, so that tx with the nonce will be
/// placed after all ready txpool transactions.
fn adjust_nonce<P, AccountId, Nonce>(pool: &P, account: AccountId, nonce: Nonce) -> Nonce
where
    P: TransactionPool,
    AccountId: Clone + std::fmt::Display + Encode,
    Nonce: Clone + std::fmt::Display + Encode + traits::AtLeast32Bit + 'static,
{
    log::debug!(target: "rpc", "State nonce for {}: {}", account, nonce);
    // Now we need to query the transaction pool
    // and find transactions originating from the same sender.
    //
    // Since extrinsics are opaque to us, we look for them using
    // `provides` tag. And increment the nonce if we find a transaction
    // that matches the current one.
    let mut current_nonce = nonce.clone();
    let mut current_tag = (account.clone(), nonce).encode();
    for tx in pool.ready() {
        log::debug!(
            target: "rpc",
            "Current nonce to {}, checking {} vs {:?}",
            current_nonce,
            HexDisplay::from(&current_tag),
            tx.provides().iter().map(|x| format!("{}", HexDisplay::from(x))).collect::<Vec<_>>(),
        );
        // since transactions in `ready()` need to be ordered by nonce
        // it's fine to continue with current iterator.
        if tx.provides().get(0) == Some(&current_tag) {
            current_nonce += traits::One::one();
            current_tag = (account.clone(), current_nonce.clone()).encode();
        }
    }

    current_nonce
}