cosmos / ibc

Interchain Standards (ICS) for the Cosmos network & interchain ecosystem.
Other
908 stars 386 forks source link

Achieving constant sized state bloat for unordered channels #1100

Open ValarDragon opened 2 months ago

ValarDragon commented 2 months ago

Achieving constant sized state for unordered channels, by restricting the headers you are allowed to look at for timeouts. This then fixes the IBC packet ACK state growth problem, where currently state size for IBC is linear in history size!

Sorry for the hasty writeup, I will try to come back and improve this with better naming, but I wanted to get this written before the idea is lost!

Problem were trying to solve

Today if I do a packet send with a timeout from chain A -> B, chain B stores the packet acknowledgement for succesful sends forever. This makes chain B's state linear in the total number of IBC packets ever succesfully sent to it.

This is because the logic on chain A for checking timeouts is "check a chain B block header after timeout height/time. If chain B.Time > packet timeout time AND you show me an exclusion proof stating that this packet did not get into chain B by this height, chain A will timeout the packet." So its unsafe to delete this.

The new idea is to make parameters limiting how long packet ACKs are stored, and how long timeouts can be, which will guarantee that the chain A and chain B no longer have to store all old packets. Now the timeout logic will be "check a chain B block header in this [height/time] range". The range begins at the timeout time, and ends at some "max ACK storage period". I'll point out below edge case handling, but this will

Prior suggested solution

There has been suggestions that after chain B has succesfully received a packet, we can make another packet to chain A + B, that effectively communicates the succesful receipt from chain B to chain A, and then lets chain B delete the idea. This has been written up in issues in IBC-go, calls, and most recently this note: https://hackmd.io/xRlEU2WPQtmXhN00GQSSTQ?view#State-Pruning

It is unclear how to incentivize this final packet, without introducing some sort of economic incentive for fee rebates / state deletion rewards. We can't really make it chain code, due to effectively needing this oracle into another chain. The economic incentive feels ok, but has notable protocol and fee market complexity.

New idea

(NOTE: There are two options, maintain an Acknowledgement storage period and max timeout time delta, or just an Acknowledgement storage period. You can convert between both of them, by just changing the timeout parameters that are checked on each chain. I consider it easier for understanding to define this with two variables first)

We make chain B maintain a "Acknowledgement storage period", and a "Max future timeout time delta". I suggest we parameterize both periods at a default of say 1 week. The unordered channel maintains knowledge of both these parameters, and by construction every channel to chain B must have the channel settings <= chain B's settings.

When I send a packet from chain A to chain B on channel C, chain A and chain B both check that the timeout isn't too far in the future (obeys Max future timeout time delta relative to their local time).

When I want to timeout a packet P with timeout time T, I show any block header on chain B that has timestamp between [T, T+ max_acknowledge_storage_period]

On chain B, you store every packet's acknowledgement from when it is received, until the packets timeout time + max acknowledgement storage period. Since max packet timeout time is bounded, the total time a given ACK is stored before it can be pruned is bounded.

Clock sync concerns

If chain A misbehaves or has clock sync problems, you can have a packet in outbound queue on chain A with a timeout of 1 month, but it won't land on chain B until 3 weeks in (Due to the max timeout delta parameter. This is fine , since it depends on user fault and chain A fault. Also users / dapps should really set lower timeout times and have retry loops.

Offline chain handling

Note we have to deal with one edge case where there is no block on chain B between packet timeout time and packet timeout time + max acknowledgement storage period. There are a few solutions to this. Its straightforward to prove there was no block in this time range. You just give the block headers for two consecutive blocks N, N+1, and look at their timestamps. It suffices to check either block N, or block N+1 in this case for non-inclusion.

In the event we decide to use block N+1, we just have to amend the pruning logic to state that "there has been at least one block after packets timeout time + max acknowledgement storage period before we prune. (Very minor change)

Very old packet pruning (weak subjectivity concerns)

Currently IBC only really stores "recent" client versions. Weak subjectivity means that outside of the bonding period, we can't just rely on checking that the validator set signed something to know an old block is in the chain (unless the old blocks validator set has >= 2/3rds overlap with current). So if your not in the case that the validator set has >= 2/3rds overlap with the current trusted set, then you have to check that some trusted block hashes commits to the given old block. (e.g. its in the block header hashchain)

First thing we should do is implement checking that if the old block has >= 2/3rd overlap with current trusted validator set, then we don't need any additional proofs. (This is the happy path that will most likely be hit)

I see three options for when we are not in that case.

Relayers

So in the happy path, literally nothing changes for relayers. So software works as normal.

The most concerning unhappy path was that a relayer missed timing out a packet until after max_acknowledge_period elapsed. Then they have to query an archive node. However this is already the case for relayers today! So its hopefully a light change to just say that "sufficiently old timeouts need to query an archive".

Relayers can even skip the offline chain handling at version start, since its such a rare problem. (Chain down for 1+ week)

There is complexity in handling for the weak subjectivity problems, but the 2/3rd overlap condition significantly reduces this problem in production. When we are in 2/3rds overlap, ~no change is needed. Relayers would need to have new code for the case where you aren't in 2/3rds overlap, but that also seems like it could come later. The MMR style solution is easiest.

Pedagogy of idea

I think the pedagogy of this idea is fun, so noting it here. We spent awhile thinking about how to fix state growth of IBC, and never had satisfying solutions. E.g. https://github.com/cosmos/ibc-go/issues/2869

In the SNARK account working group, for enabling ZK-rollups, we had been concerned with how to handle replay protection for SNARK accounts while also:

I had pretty quickly proposed Option #4 as written up here - (apologies, I didn't do a great job in that writeup, Didn't spend enough time on it)

@ebuchman had been doing a great job pointing out how all this SNARK account idea should really just be modelled as ICA's, and honing in on that abstraction. (E.g. we solve replay protection there) In chats he also pointed out how this feels very much like unordered txs that we designed in the SDK.

After that, we realized that this idea is really an improvement to unordered channels everywhere in IBC :)

AdityaSripal commented 2 months ago

Thanks @ValarDragon for the detailed writeup!! It seems to me that this proposal would change some IBC guarantees.

This proposal feels very clean in the happy case. I'm a bit more concerned about the workarounds necessary to address the edge cases.

So currently IBC can timeout a packet, or complete its acknowledgement no longer how much time has passed in between communication between the channels (modulo the clients being kept alive). With the current IBC logic, every step of the process can take an arbitrarily long amount of time; so long as the clients are brought back to the latest state, you can always use the latest state to progress an in-flight packet to a valid completion state.

With this new proposal, the acknowledgement will be pruned after the Acknowledgement Storage Period (I presume this means both packet receipt and the acknowledgement.

Thus, the third step of the process, (i.e. a Timeout/Acknowledgement) must be communicated back to the sender chain using a particular window of blocks of the receiver chain.

As you mention above, its possible that the receiver chain is offline and never makes a block in the specified time window.

It's also possible that blocks are being created on the receiver chain, but because of the sender chain being offline, or the relayers being offline, they aren't being added to the sender chain client until after those blocks are already expired and unusable for proof verification.

In the current IBC case, this would be resolved by simply recovering the client and continuing. But with this proposal we are introducing a new requirement of the proof logic.

Not only are we proving things about the latest view of the counterparty state, but we must prove something about the counterparty state in a particular window of time that may already be in the distant past.

Doing this should theoretically be possible... in the worst case, we have to deal with some way of looking at past state that is no longer trusted and building some trusted root back to it as you sketch in the weak subjectivity case.


I would be very in favor of this proposal if it was only the happy case. Given this edge case handling, should we reconsider the additional message option? Perhaps incentivization or just having validators do this isn't that bad?

Open to either, but want to make sure we weigh pros and cons before committing to one path