ICS28: Timeout behaviour

mpoke commented 2 years ago

What should be the CCV behaviour in the case the assumption of a Correct Relayer is violated and a packet does time out? In the current version if IBC, since the CCV channel is ordered, a packet timeout results in the CCV channel being closed.

Closing the CCV channel has the following implications.

On the consumer chain,
- the validator set can no longer be updated (i.e., no VSCPackets);
- the validator set cannot be slashed (i.e., no SlashPackets).
Thus, the validator set can no longer be trusted and the consumer chain should shut down.
Note that the provider chain will close its own endpoint either on timeoutOnClose or in the worst case, once its own sent packets timeout.
On the provider chain,
- all the initiated unbonding operations (e.g., undelegations) that still wait for the UnbondingPeriod to elapse on the consumer chain will never complete.
Consequently, some tokens will be permanently locked on the provider chain.

On the one hand, locking these tokens would incentivise validators to relay packets in order to avoid timeouts (i.e., ensures that the Correct Relayer assumption is practical). On the other hand, these tokens are usually staked by delegators. This means that validators could e.g., opt not to relay packets in order to stop delegators from reducing their voting power. In other words, locking these tokens would punish delegators and not validators.

mpoke commented 2 years ago

What's the attack scenario if the provider chain will no longer wait for a consumer chain that timed out, i.e., just consider all unbonding operatios to have matured on that consumer chain?

Possible scenario: A validator misbehave on the consumer, but it doesn't want to be slashed for it. It unbonds all its stake on the provider (Can it actually do that? Can a validator jail itself?). It stops relaying packets and hopes that nobody else does. Eventually, the validator relays that the packet timed out and the channel closes.

I find this very unlikely to happen, since it entails there is nobody to relay (i.e., not a single correct validator).

I think that it's more likely that due to some non-malicious event (e.g., the network is overloaded), a packet may timeout. I don't think we should punish the delegators for that.

mpoke commented 2 years ago

1/3 of validators could timeout a channel through an eclipse attack. However, CCV clearly works under of the assumption that less than 1/3 are Byzantine.

josef-widder commented 2 years ago

Thanks for opening this issue! I think this is a central question in IBC: Who is responsible for relaying?

I think the timeout is currently set to something close to the unbonding period. To my understanding, a timeout (not relaying a packet for two weeks) in the interchain is thus not due to overloads or disconnections etc. If the system is overloaded for two weeks, we have more severe issues. The reason for a timeout is that no-one cares to relay a packet or there is one who doesn't understand that it is in their interest to relay the packet.

We should make explicit that whoever has interest in an validator set change (e.g., validator, delegator) in the limit is responsible to make the operation complete, that is, get the packets relayed. (In my view, validator operators should run relayers). To my understanding IBC is designed in this open way on purpose: anyone who wants a packet delivered can relay it.

On the on the other hand, returning stake when chains got disconnected due to timeout is definitely unsafe.

So I think we should keep stake locked when a channel closes and prepare provisions within CCV for social consensus (governance) in case of a timeout to

returned locked stake
reconnect chains

josef-widder commented 2 years ago

Also we should keep in mind that in an ordered channel, a timeout does not only affect one packet A. All packet's that were send after A also do not make it, and everyone who is interested in relaying a packet B (that was sent after A), also is interested in relaying A.

mpoke commented 2 years ago

On the on the other hand, returning stake when chains got disconnected due to timeout is definitely unsafe.

Why do you say that? Do you have a possible attack in mind?

Not returning the stake I see it as a method to punish (i.e., disincentivize) certain behaviour. The questions are what behaviour are we punishing and whom are we punishing.

What behaviour are we punishing? IMO, that's hard to tell. Unlike other misbehaviours (e.g., double signing), one validator not relaying will not affect the system; you need all validators to stop relaying for a timeout to happen. Thus, I'd rather go in the direction of incentivising validators to relay. What are validators incentivised by... money 🤑, i.e., rewards which are proportional to the voting power (see below).

Whom are we punishing? I believe that by not returning the stake we are not punishing the validator operators, but rather the delegators. We shouldn't expect delegators to run relayers.

Also we should keep in mind that in an ordered channel, a timeout does not only affect one packet A. All packet's that were send after A also do not make it, and everyone who is interested in relaying a packet B (that was sent after A), also is interested in relaying A.

I agree. That's why I don't think it's feasible to intentionally timeout a packet (as long as the 1/3 assumption holds). The CCV channel is bidirectional - the provider sends VSCPackets and the consumer sends SlashPackets and MaturedVSCPackets.

AVSCPacket timing out entails that for the timeout period (~ 3-4 weeks) the consumer receives no VSCPacket. However, validator operators are directly incentivised to relay those VSCPacket's with increases in their voting power (i.e., that result from a Delegate or Redelegate message). It's very unlikely that no such messages occur during the timeout period.
The problem with SlashPackets and MaturedVSCPackets is that there is no financial incentive for validator operators to relay these packets (besides being "good" validators). MaturedVSCPackets reduce their voting power, while SlashPackets slash their stake, which reduce their voting power.

If we were to add also DistributionPackets with the distribution of consumer chain rewards to the provider chain validators, then clearly validator operators would be incentivised to relay those. Note that at the moment the consumer chain periodically sends tokens over a separate transfer channel; however, there were discussion to also send a packet with the weights for all validators to account for proposing blocks (i.e., what portion of the rewards is received by each validator).

In general though, I think validator operators would be incentivised to relay both ways by the consumer chain rewards. If the channel times out, no more consumer chain, and thus, no more rewards for any of the operators.

mpoke commented 2 years ago

@jtremback @okwme Any views on this issue from a business perspective?

josef-widder commented 2 years ago

Why do you say that? Do you have a possible attack in mind?

I guess we would like to ensure a property somewhat like:

liveness: If the consumer chain sends evidence to the provider chain, then a slashing event should eventually happen on the provider chain
safety: If stake S is paid back on the provider chain, then no evidence should be recorded (before) on the consumer chain that would have reduced S.

As far as I understand, both properties cannot be achieved if we unlock stake on timeout.

mpoke commented 2 years ago

Why do you say that? Do you have a possible attack in mind?

I guess we would like to ensure a property somewhat like:

liveness: If the consumer chain sends evidence to the provider chain, then a slashing event should eventually happen on the provider chain

safety: If stake S is paid back on the provider chain, then no evidence should be recorded (before) on the consumer chain that would have reduced S.

As far as I understand, both properties cannot be achieved if we unlock stake on timeout.

Yeah, but some of the CCV properties (including the ones mentioned by you) rely on the Correct Relayer assumption. The spec is written from the perspective that this assumption holds. IMO handling timeouts is out of the scope of the specification, at least out of the scope of the properties. The question is: What can we do to ensure (at least with a high probability) that the Correct Relayer assumption holds.

josef-widder commented 2 years ago

I think CCV will be used if someone wants to earn money from running a second chain. Thus the incentives are aligned to ensure the correct relayer assumption: you need to keep the channel open, e.g.,

to transfer the fees/gains back to the provider chain
to receive the money back after delegating

In my view, in the unlikely case of channel closing, we should still ensure safety. I would guess that these cases are so rare that we can rely on governance to eventually figure out what to do and postpone liveness to governance intervention.

jtremback commented 2 years ago

The question of whether someone bothers to relay a packet is a distraction, IMO. We should assume that someone attempts to relay all packets*. The real question is whether a validator set censors packets.

The following scenarios are based on a model where there may be differences in the provider and consumer chain validator sets. They do not make as much sense for the 100% overlap v1.

A. Here is an attack that is possible if tokens are automatically released on channel timeout:

The consumer chain validators censor any incoming packets on the channel, and/or do not produce any more acks.
They wait for the channel to time out.
Now they control the consumer chain and they have gotten their tokens back on the provider chain, and can act with impunity, including double signing** or other slashable offenses.

B. Here is an attack that is possible if tokens are NOT automatically released on channel timeout:

The consumer chain validators censor any incoming packets on the channel (for instance a packet which reduces their power).
They wait for the channel to time out.
Now they continue to earn staking rewards on the provider chain, even thought they cannot unbond their packets. They are still slashable for double signing, but unjustly control the consumer chain.

C. The most "correct" way to handle this is to slash 100% of the consumer chain validator's packets if the channel times out. Of course, this would result in destruction of the provider chain in v1.

I think that the safest way for us to handle this right now is keeping tokens locked if the channel times out, like @josef-widder suggests. However, I suspect that given the 100% overlap of validator sets in v1, it should be possible to relax this and maybe allow tokens to be automatically unlocked. However, I would like to see a more rigorous analysis of this***.

* I wonder if the heavy packet load we are intending for v1 (once a block) makes this less of a safe assumption than it was before. Can we waive the gas fee for ccv packets?

** In my analysis of double signing here, I am assuming that we would have code that allowed the provider chain to independently verify double signing evidence.

*** I think maybe we need a more rigorous analysis of v1's 100% validator set overlap across the board.

mpoke commented 2 years ago

The question of whether someone bothers to relay a packet is a distraction, IMO. We should assume that someone attempts to relay all packets*.

We already do that through the Correct Relayer assumption.

I wonder if the heavy packet load we are intending for v1 (once a block) makes this less of a safe assumption than it was before.

Each consumer chain entails two extra transactions per block (on each side) except slashing which should be rare. The consumer receives in each block a VSCPacket and an ACK for a VSCMaturedPacket . The provider receives in each block an ACK for a VSCPacket and a VSCMaturedPacket. (I assume that in every block there is a change in the val set).

Can we waive the gas fee for ccv packets?

I don't think so, since it would enable DOS attacks.

The real question is whether a validator set censors packets.

It needs at least 1/3 of the voting power. If that happens, the light clients cannot trust headers. I think we need to assume < 1/3 Byzantine voting power.

The following scenarios are based on a model where there may be differences in the provider and consumer chain validator sets. They do not make as much sense for the 100% overlap v1.

Let's focus on the next versions afterwards. Once we move away from V1, many things change. I find it difficult to discuss possible attacks for a system that I don't yet know how it will look like or what properties it will have.

I think that the safest way for us to handle this right now is keeping tokens locked if the channel times out, like @josef-widder suggests.

That will be the safest indeed. Not yet convinced that it's necessary, but if nobody complains (e.g., Cosmos Hub validators), then we can go with it.

However, I suspect that given the 100% overlap of validator sets in v1, it should be possible to relax this and maybe allow tokens to be automatically unlocked. However, I would like to see a more rigorous analysis of this. I think maybe we need a more rigorous analysis of v1's 100% validator set overlap across the board.

What do you mean by a more rigorous analysis of the validator set overlap? Do you have something specific in mind?

mpoke commented 2 years ago

In my view, validator operators should run relayers

@josef-widder If every operator relays, we'll get 300 IBC packets per block, and only 2 are actually needed. And this is per consumer chain.

To my understanding IBC is designed in this open way on purpose: anyone who wants a packet delivered can relay it.

Then we need to shift the discussion on who wants what relayed.

jtremback commented 2 years ago

Just discussed this with @mpoke. The main issue is that our protocol expects altruistic relaying, and it currently generates 2-3 packets per block, per consumer chain. This high(?) traffic volume, combined with the lack of relayer incentivization makes it more likely that the correct relayer assumption will be broken.

@AdityaSripal has previously pointed out that compared to the altruistic relaying load currently, the additional load from CCV should be minor. If the altruistic relaying problem for transfers is fixed, however, CCV will remain a vestige of the problem. Some information that would inform this discussion is- How much will it cost to relay a CCV packet? Do you have any guesses @AdityaSripal?

Possible courses of action:

Do nothing: We ship the protocol as is, with 2-3 packets per consumer chain per block. Some relayer must relay altruistically otherwise there will be problems. However, maybe this is not a big deal compared to other altruistic relaying that already is needed?
Reduce packet frequency: We make it so that CCV sends packets at a much lower frequency, maybe once an hour or once a day. The altruistic relaying problem still exists, but it is hundreds or thousands of times cheaper (in terms of gas fee) for the altruistic relayer.
Refund gas fee: Somehow we make it so that valid CCV packets do not cost any gas when they are relayed. This must be crafted so that CCV packets that have already been relayed do cost gas, otherwise we are vulnerable to DOS. This would reduce the gas cost of the altruistic relayer to 0, although there should be coordination so that multiple relayers do not try to relay the same packets, while still allowing for redundancy.
Relayer incentivization: Through some mechanism, we make it so that relayers of CCV packets are paid with inflated tokens from one or both of the chains. Depending on how much the payment is, we might end up with races, with hundreds of relayers competing to relay each CCV packet, risking the gas cost of a failed relay for the potential fee (if it is high). However, on further thought I think it is more likely that validators will do the relaying, but only when they are the proposer. Being the proposer, they can put their CCV packet in first and get the reward.

IMO, ultimately, something like 4 is probably the most comprehensive solution, but needs more work. 2 may be more feasible, but depending on what the real costs are, we may be able to launch with 1.

peterbourgon commented 2 years ago

I think that it's more likely that due to some non-malicious event (e.g., the network is overloaded), a packet may timeout. I don't think we should punish the delegators for that.

+1.

The conversation here seems focused on malicious and/or deliberate timeouts induced by bad actors in the network. But unintentional timeouts due to network, host, configuration, etc. errors are the (much) more common case. Network errors like timeouts are normal, and should be expected and accommodated by higher layers.

jtremback commented 2 years ago

Thinking about it again, I think we should allow all tokens to unbond and clean up other consumer chain state when the channel times out. This does potentially allow some kind of attack where the validators intentionally censor packets to let the channel time out, but given that we are talking about a 100% validator set overlap between provider and consumer, it's hard to imagine what this attack would be.

The option suggested by @josef-widder, where the bonded tokens stay locked until some active governance action is taken, is definitely the safer option, but would result in a much longer effective unbonding period (governance period + channel timeout) for some people and I think we need to have a really good reason to do this.

mpoke commented 2 years ago

The current consensus is to support both options: Introduce a CCV parameter (i.e., lockUnbondingOnTimeout) that indicates whether the funds corresponding to the initiated unbonding operations are released in case of a timeout. By default, this parameter would be set such that the funds are releases, i.e., lockUnbondingOnTimeout = false. In case the provider chain would like higher security for a certain consumer chain, it can lockUnbondingOnTimeout to true for that chain (i.e., lockUnbondingOnTimeout is per consumer chain).

This would be combined with a mechanism that allows the provider chain to remove a consumer chain through governance proposals. In case lockUnbondingOnTimeout == true, a governance proposal to remove the timed out consumer chain would results in the funds being released. For more details, see https://github.com/cosmos/ibc/issues/651.

@josef-widder @jovankomatovic @jtremback @AdityaSripal What do you think about this approach?

josef-widder commented 2 years ago

Sounds good. I would just set the default to lockUnbondingOnTimeout = true. By this we make sure that if one goes for the less safe option they need to do it actively, and know what they are doing.

jtremback commented 2 years ago

Sorry I didn't stay up on this issue, but it seems to me that making this a variable has just kicked the can down the road. We don't actually have consensus on which one is the "safer" option, and which should be the default. Also, we have now have to support more complicated code that handles both.

I personally think that lockUnbondingOnTimeout = false should be the default, and maybe only supported option. This provides the safeguard that a halted consumer chain (the most common failure case) can not mess up the provider too badly.

I think the alternative, that the a malicious 1/3+ of the validator set stops packets being sent from the consumer, but also doesn't use that same power to halt or censor the provider as well, is much more of an edge case.

jtremback commented 2 years ago

I think we should probably only have the lockUnbondingOnTimeout = false behavior, and I will probably remove lockUnbondingOnTimeout = true from the implementation until this can be more clearly resolved.

mpoke commented 2 years ago

it seems to me that making this a variable has just kicked the can down the road. We don't actually have consensus on which one is the "safer" option, and which should be the default.

IMO, lockUnbondingOnTimeout = true is the safer option, and the default should be lockUnbondingOnTimeout = false. When a chain has lockUnbondingOnTimeout = true, after a timeout, a StopConsumerChainProposal gov proposal will unlock all the pending unbonding ops.

Also, we have now have to support more complicated code that handles both.

I disagree, see https://github.com/cosmos/interchain-security/issues/261#issuecomment-1211800343

I personally think that lockUnbondingOnTimeout = false should be the default, and maybe only supported option. This provides the safeguard that a halted consumer chain (the most common failure case) can not mess up the provider too badly.

lockUnbondingOnTimeout only cover the case when the consumer chain shuts down as a result of a failure. If the consumer shuts down due to a failure, the state on the consumer is not cleaned and the pending unbondings will be locked until a StopConsumerChainProposal will pass governance (or the consumer restarts).

mpoke commented 2 years ago

@jtremback Do we still need this issue opened? Are you in agreement re. a solution?

cosmos / ibc

ICS28: Timeout behaviour #669