Achieving constant sized state for unordered channels, by restricting the headers you are allowed to look at for timeouts. This then fixes the IBC packet ACK state growth problem, where currently state size for IBC is linear in history size!

Sorry for the hasty writeup, I will try to come back and improve this with better naming, but I wanted to get this written before the idea is lost!

Problem were trying to solve

Today if I do a packet send with a timeout from chain A -> B, chain B stores the packet acknowledgement for succesful sends forever. This makes chain B's state linear in the total number of IBC packets ever succesfully sent to it.

This is because the logic on chain A for checking timeouts is "check a chain B block header after timeout height/time. If chain B.Time > packet timeout time AND you show me an exclusion proof stating that this packet did not get into chain B by this height, chain A will timeout the packet." So its unsafe to delete this.

The new idea is to make parameters limiting how long packet ACKs are stored, and how long timeouts can be, which will guarantee that the chain A and chain B no longer have to store all old packets. Now the timeout logic will be "check a chain B block header in this [height/time] range". The range begins at the timeout time, and ends at some "max ACK storage period". I'll point out below edge case handling, but this will

Prior suggested solution

There has been suggestions that after chain B has succesfully received a packet, we can make another packet to chain A + B, that effectively communicates the succesful receipt from chain B to chain A, and then lets chain B delete the idea. This has been written up in issues in IBC-go, calls, and most recently this note: https://hackmd.io/xRlEU2WPQtmXhN00GQSSTQ?view#State-Pruning

It is unclear how to incentivize this final packet, without introducing some sort of economic incentive for fee rebates / state deletion rewards. We can't really make it chain code, due to effectively needing this oracle into another chain. The economic incentive feels ok, but has notable protocol and fee market complexity.

New idea

(NOTE: There are two options, maintain an Acknowledgement storage period and max timeout time delta, or just an Acknowledgement storage period. You can convert between both of them, by just changing the timeout parameters that are checked on each chain. I consider it easier for understanding to define this with two variables first)

We make chain B maintain a "Acknowledgement storage period", and a "Max future timeout time delta". I suggest we parameterize both periods at a default of say 1 week. The unordered channel maintains knowledge of both these parameters, and by construction every channel to chain B must have the channel settings <= chain B's settings.

When I send a packet from chain A to chain B on channel C, chain A and chain B both check that the timeout isn't too far in the future (obeys Max future timeout time delta relative to their local time).

When I want to timeout a packet P with timeout time T, I show any block header on chain B that has timestamp between [T, T+ max_acknowledge_storage_period]

On chain B, you store every packet's acknowledgement from when it is received, until the packets timeout time + max acknowledgement storage period. Since max packet timeout time is bounded, the total time a given ACK is stored before it can be pruned is bounded.

Clock sync concerns

If chain A misbehaves or has clock sync problems, you can have a packet in outbound queue on chain A with a timeout of 1 month, but it won't land on chain B until 3 weeks in (Due to the max timeout delta parameter. This is fine , since it depends on user fault and chain A fault. Also users / dapps should really set lower timeout times and have retry loops.

Offline chain handling

Note we have to deal with one edge case where there is no block on chain B between packet timeout time and packet timeout time + max acknowledgement storage period. There are a few solutions to this. Its straightforward to prove there was no block in this time range. You just give the block headers for two consecutive blocks N, N+1, and look at their timestamps. It suffices to check either block N, or block N+1 in this case for non-inclusion.

In the event we decide to use block N+1, we just have to amend the pruning logic to state that "there has been at least one block after packets timeout time + max acknowledgement storage period before we prune. (Very minor change)

Very old packet pruning (weak subjectivity concerns)

Currently IBC only really stores "recent" client versions. Weak subjectivity means that outside of the bonding period, we can't just rely on checking that the validator set signed something to know an old block is in the chain (unless the old blocks validator set has >= 2/3rds overlap with current). So if your not in the case that the validator set has >= 2/3rds overlap with the current trusted set, then you have to check that some trusted block hashes commits to the given old block. (e.g. its in the block header hashchain)

First thing we should do is implement checking that if the old block has >= 2/3rd overlap with current trusted validator set, then we don't need any additional proofs. (This is the happy path that will most likely be hit)

I see three options for when we are not in that case.

Have chain A store clients of chain B once per B's day. Then have old timeouts past weak subjectivity require showing the header hashchain from them to a trusted block. (Note that this is data intensive, but not compute intensive. We don't need to check intermediate headers, but instead just hash them)
Have chain A store clients of chain B once per B's unbonding period, and not prune these. Then have chain B store all of their own block hashes in the unbonding period, and then you do a state query and check for equality with the provided hash. (Note that the Cosmos SDK x/staking already does this)
Use an append-only commitment structures in live state that are append only for committing to all old headers. E.g. an append only merkle tree, where full nodes / validators don't store the state constituting the tree. This idea is actually the Merkle Mountain Range (MMR) idea popular in BTC land for this exact problem. Then you can verify every old block header against live state rather easily.

Relayers

So in the happy path, literally nothing changes for relayers. So software works as normal.

The most concerning unhappy path was that a relayer missed timing out a packet until after max_acknowledge_period elapsed. Then they have to query an archive node. However this is already the case for relayers today! So its hopefully a light change to just say that "sufficiently old timeouts need to query an archive".

Relayers can even skip the offline chain handling at version start, since its such a rare problem. (Chain down for 1+ week)

There is complexity in handling for the weak subjectivity problems, but the 2/3rd overlap condition significantly reduces this problem in production. When we are in 2/3rds overlap, ~no change is needed. Relayers would need to have new code for the case where you aren't in 2/3rds overlap, but that also seems like it could come later. The MMR style solution is easiest.

Pedagogy of idea

I think the pedagogy of this idea is fun, so noting it here. We spent awhile thinking about how to fix state growth of IBC, and never had satisfying solutions. E.g. https://github.com/cosmos/ibc-go/issues/2869

In the SNARK account working group, for enabling ZK-rollups, we had been concerned with how to handle replay protection for SNARK accounts while also:

having minimal Celestia side logic + state
Not enforcing ordered withdraw semantics due to DOS reasons
Allowing many independent users to submit withdraws for a single rollup at once

I had pretty quickly proposed Option #4 as written up here - (apologies, I didn't do a great job in that writeup, Didn't spend enough time on it)

@ebuchman had been doing a great job pointing out how all this SNARK account idea should really just be modelled as ICA's, and honing in on that abstraction. (E.g. we solve replay protection there) In chats he also pointed out how this feels very much like unordered txs that we designed in the SDK.

After that, we realized that this idea is really an improvement to unordered channels everywhere in IBC :)

cosmos / ibc

Achieving constant sized state bloat for unordered channels #1100