cosmos / interchain-security

Interchain Security is an open sourced IBC application which allows cosmos blockchains to lease their proof-of-stake security to one another.
https://cosmos.github.io/interchain-security/
Other
156 stars 127 forks source link

Security feedback: Consider implementing module isolation and modular shutdown #627

Open andrey-kuprianov opened 1 year ago

andrey-kuprianov commented 1 year ago

Surfaced from @informalsystems security feedback for Interchain Security, at commit 57b47af1

Problem

Presently Interchain Security (ICS) takes drastic measures on any encountered error: e.g., panics in BeginBlocker or EndBlocker (thus halting the chain, provider or consumer), or closes the channel (thus effectively killing a particular consumer chain). A more commensurate response to errors seems to be possible, which involves modular shutdown of parts of ICS functionality, with the opportunity of re-enabling them later in case the errors are corrected.

Closing criteria

Problem details

There are a number of scenarios when error handling in ICS takes more drastic measures than necessary, e.g. via panicking in BeginBlocker or EndBlocker on encountering some errors that might have a transient nature (see also the related issue ICS#621). This kind of exaggerated response might amplify the damage disproportionally, or convert transient failures into permanent ones. We outline two examples below.

Halting provider chain on the failure to start consumer chain

In x/ccv/provider/keeper/relay.go the following code is present:

// EndBlockCCR contains the EndBlock logic needed for
// the Consumer Chain Removal sub-protocol
func (k Keeper) EndBlockCCR(ctx sdk.Context) {
    currentTime := ctx.BlockTime()
    currentTimeUint64 := uint64(currentTime.UnixNano())

    for _, initTimeoutTimestamp := range k.GetAllInitTimeoutTimestamps(ctx) {
        if currentTimeUint64 > initTimeoutTimestamp.Timestamp {
            // initTimeout expired
            // stop the consumer chain and unlock the unbonding.
            // Note that the CCV channel was not established,
            // thus closeChan is irrelevant
            err := k.StopConsumerChain(ctx, initTimeoutTimestamp.ChainId, false)
            if err != nil {
                panic(fmt.Errorf("consumer chain failed to stop: %w", err))
            }
        }
    }

The above function is called from EndBlock(), thus a provider (Cosmos Hub) would halt on the above panic. On the other hand, the function StopConsumerChain() starts with the following code:

func (k Keeper) StopConsumerChain(ctx sdk.Context, chainID string, closeChan bool) (err error) {
    // check that a client for chainID exists
    if _, found := k.GetConsumerClientId(ctx, chainID); !found {
        return sdkerrors.Wrap(ccv.ErrConsumerChainNotFound,
            fmt.Sprintf("cannot stop non-existent consumer chain: %s", chainID))
    }

As we can see from the above analysis, halting of the provider (Cosmos Hub) is a disproportionally severe response wrt. the event causing it (the failure to start a consumer chains). There is no justification to halt the Cosmos Hub when a particular consumer chain fails to start.

Halting consumer chain on closing of IBC channel to provider

We see the following code in x/ccv/consumer/module.go:

func (am AppModule) BeginBlock(ctx sdk.Context, req abci.RequestBeginBlock) {
    channelID, found := am.keeper.GetProviderChannel(ctx)
    if found && am.keeper.IsChannelClosed(ctx, channelID) {
        // the CCV channel was established, but it was then closed;
        // the consumer chain is no longer safe

        channelClosedMsg := fmt.Sprintf("CCV channel %q was closed - shutdown consumer chain since it is not secured anymore", channelID)
        ctx.Logger().Error(channelClosedMsg)
        panic(channelClosedMsg)
    }

    blockHeight := uint64(ctx.BlockHeight())
    vID := am.keeper.GetHeightValsetUpdateID(ctx, blockHeight)
    am.keeper.SetHeightValsetUpdateID(ctx, blockHeight+1, vID)

    am.keeper.TrackHistoricalInfo(ctx)
}

In the above example we see again that the response (halting of consumer chain) is disproportionately severe wrt. the cause (closing of IBC channel):

Recommendations

We propose to decompose the ICS protocols and codebase into modules, such that each module can function relatively independently from the others. A reasonable decomposition seems to be:

Note that in the above the provider should handle many relatively independent tasks for each chain. We can formulate the following desirable property wrt. interaction of provider with consumers:

Consumer fault isolation: any faults occurring wrt. one of consumer chains, should not influence the operation of either the provider or the other consumer chains.

Examples of faults to which the above property applies include:

  1. communication failures wrt. one consumer chain (clients expiring, channels closing, error acknowledgments, etc.)
  2. receiving of malformed data from a consumer chain
  3. receiving of a large amounts of packets, or of very large packets from a consumer chain
  4. receiving of Slashing or VSC maturity packets about that didn't happen (e.g. downtime, double sign, maturity)

ICS provider code is much more complex than ICS consumer code: from the size alone, it's ~21 KLOC of Go for provider vs ~7 KLOC of Go for consumer. Moreover, a provider is essentially operating under a concurrent setting, communicating with multiple consumers; thus the provider code is more likely to contain bugs. It is thus necessary for consumer chains not to take any harmful actions (like halting) immediately, if they still can operate normally for some time (e.g. employing the current validator set). Thus, another desirable property seems to be:

Provider fault isolation: any faults occurring on provider, should not influence the operation of consumer chains unless absolutely necessary, in which case the damage to consumer chains should be minimized, and, preferably, reversible.

Examples of faults that may originate from the provider:

  1. communication failures (clients expiring, channels closing, error acknowledgments, etc.)
  2. receiving of malformed data from the provider chain
  3. receiving of large amounts of packets, or of very large packets from the provider chain
  4. receiving of VSC packets with validator set changes that didn't happen

Note that for both of the above properties, examples 1-3 can be handled locally by provider or consumer respectively; the correct approach for handling them is defensive programming, which is outside of the scope of this issue, and will be discussed separately. On the other hand, both examples 4 above can't be (fully) alleviated locally, because the present setup assumes complete trust between provider and consumers. The topic of evidence handling is also outside of the scope of this issue, and will be discussed later.

Coming back to modules and their isolation, it is highly desirable that for each of the above modules its boundaries and contracts with other modules are thoroughly understood and specified. For each ICS module the conditions need to be formulated under which:

Whenever some new condition from the last three items of the above list is met, the following needs to be done always:

Additionally:

Special measures allowing to reenable a module after a permanent shutdown may consist of:

TODOs

mpoke commented 1 year ago

Halting consumer chain on closing of IBC channel to provider

@andrey-kuprianov Once the IBC channel to the provider is closed, the validator set on the consumer has no longer stake in the game, i.e., the PoS mechanism no longer works. That means that the consumer is no longer safe. As a result, we halt the consumer chain.