Surfaced from @informalsystems security feedback for Interchain Security, at commit 57b47af1

Problem

Presently Interchain Security (ICS) takes drastic measures on any encountered error: e.g., panics in BeginBlocker or EndBlocker (thus halting the chain, provider or consumer), or closes the channel (thus effectively killing a particular consumer chain). A more commensurate response to errors seems to be possible, which involves modular shutdown of parts of ICS functionality, with the opportunity of re-enabling them later in case the errors are corrected.

Closing criteria

[ ] ICS modules are defined, each possessing a relatively independent functionality
[ ] Criteria for each ICS module / module part are formulated under which:
- it can function normally
- tt should be temporarily disabled
- it can be reenabled, after being disabled previously
- it should be permanently shutdown
[ ] The mechanisms to temporary disable / reenable modules and their parts are defined and implemented

Problem details

There are a number of scenarios when error handling in ICS takes more drastic measures than necessary, e.g. via panicking in BeginBlocker or EndBlocker on encountering some errors that might have a transient nature (see also the related issue ICS#621). This kind of exaggerated response might amplify the damage disproportionally, or convert transient failures into permanent ones. We outline two examples below.

Halting provider chain on the failure to start consumer chain

In x/ccv/provider/keeper/relay.go the following code is present:

// EndBlockCCR contains the EndBlock logic needed for
// the Consumer Chain Removal sub-protocol
func (k Keeper) EndBlockCCR(ctx sdk.Context) {
    currentTime := ctx.BlockTime()
    currentTimeUint64 := uint64(currentTime.UnixNano())

    for _, initTimeoutTimestamp := range k.GetAllInitTimeoutTimestamps(ctx) {
        if currentTimeUint64 > initTimeoutTimestamp.Timestamp {
            // initTimeout expired
            // stop the consumer chain and unlock the unbonding.
            // Note that the CCV channel was not established,
            // thus closeChan is irrelevant
            err := k.StopConsumerChain(ctx, initTimeoutTimestamp.ChainId, false)
            if err != nil {
                panic(fmt.Errorf("consumer chain failed to stop: %w", err))
            }
        }
    }

The above function is called from EndBlock(), thus a provider (Cosmos Hub) would halt on the above panic. On the other hand, the function StopConsumerChain() starts with the following code:

func (k Keeper) StopConsumerChain(ctx sdk.Context, chainID string, closeChan bool) (err error) {
    // check that a client for chainID exists
    if _, found := k.GetConsumerClientId(ctx, chainID); !found {
        return sdkerrors.Wrap(ccv.ErrConsumerChainNotFound,
            fmt.Sprintf("cannot stop non-existent consumer chain: %s", chainID))
    }

Triggering event:
- Failed to start a particular consumer chain
Possible cause:
- Failed to stop the consumer chain, because the client id for it is not found in the store
Effects:
- Provider chain (Cosmos Hub) halts
- All funds of all Cosmos Hub users are frozen
- The whole Cosmos ecosystem experiences hard times, with Atom plummeting

As we can see from the above analysis, halting of the provider (Cosmos Hub) is a disproportionally severe response wrt. the event causing it (the failure to start a consumer chains). There is no justification to halt the Cosmos Hub when a particular consumer chain fails to start.

Halting consumer chain on closing of IBC channel to provider

We see the following code in x/ccv/consumer/module.go:

func (am AppModule) BeginBlock(ctx sdk.Context, req abci.RequestBeginBlock) {
    channelID, found := am.keeper.GetProviderChannel(ctx)
    if found && am.keeper.IsChannelClosed(ctx, channelID) {
        // the CCV channel was established, but it was then closed;
        // the consumer chain is no longer safe

        channelClosedMsg := fmt.Sprintf("CCV channel %q was closed - shutdown consumer chain since it is not secured anymore", channelID)
        ctx.Logger().Error(channelClosedMsg)
        panic(channelClosedMsg)
    }

    blockHeight := uint64(ctx.BlockHeight())
    vID := am.keeper.GetHeightValsetUpdateID(ctx, blockHeight)
    am.keeper.SetHeightValsetUpdateID(ctx, blockHeight+1, vID)

    am.keeper.TrackHistoricalInfo(ctx)
}

Triggering event:
- the IBC channel between provider and consumer chain is closed
Possible causes:
- the IBC channel closed because of a packet timeout
- The timeout for ICS channels is set rather long, a few weeks, so theoretically this should not happen. Still, IBC transfers are done by (unincentivized) relayers, and are thus not completely reliable
- the IBC channel closed due to a logic error or a bug from the provider side.
- E.g. the channel is closed when error acknowledgement is received, timeout is received, init timeout expired, VSC timeout expired. It is unclear whether some of those events could be (maliciously) triggered.
- the IBC channel closed due to a logic error, or a bug from the consumer side
- E.g. OnAcknowledgementPacket() contains logic to close the channel upon receiving an error acknowledgement. It is unclear whether all possibilities of such event are excluded, in particular because the logic supposedly excluding it is spread between both the consumer and the provider code.
Effects:
- the consumer chain is halted
- user funds on the consumer are frozen
- owners of the consumer chain are likely to lose their business

In the above example we see again that the response (halting of consumer chain) is disproportionately severe wrt. the cause (closing of IBC channel):

On the one hand, the channel can be closed due to some logic errors, and the party experiencing the severe consequences (the consumer chain) might have nothing to do with that.
On the other hand, there is no reason to react so immediately and severely. Even if the channel is closed, the consumer chain may function securely for a few weeks: it is secured by its currently known validator set. Delays in IBC communication are normal, and there is no difference between e.g. these two scenarios:
- the channel is not closed, but no packets are relayed for a few days
- the channel is closed, so no packets are passed for a few days, but then it is reopened.
Finally, we know the examples in the design of other interchain protocols, in particular Interchain Accounts, where IBC channel closing and reopening is a normal event incorporated into the protocol; e.g. one of the principles behind ICS-27 is this:

If a channel closes, the controller chain must be able to regain access to registered interchain accounts by simply opening a new channel.

Recommendations

We propose to decompose the ICS protocols and codebase into modules, such that each module can function relatively independently from the others. A reasonable decomposition seems to be:

Provider:
- core logic (interactions with Staking and Slashing modules)
- IBC communication
- processing proposals, consumer initialization & shutdown (per chain)
- sending VSC packets (per chain)
- receiving and processing VSC maturity packets (per chain)
- receiving and processing Slashing packets (per chain)
Consumer:
- core logic (interactions with Staking, Slashing, Bank modules)
- IBC communication
- initialization & shutdown
- receiving and processing VSC packets
- processing and sending VSC maturity packets
- processing Slashing events, and sending Slashing packets

Note that in the above the provider should handle many relatively independent tasks for each chain. We can formulate the following desirable property wrt. interaction of provider with consumers:

Consumer fault isolation: any faults occurring wrt. one of consumer chains, should not influence the operation of either the provider or the other consumer chains.

Examples of faults to which the above property applies include:

communication failures wrt. one consumer chain (clients expiring, channels closing, error acknowledgments, etc.)
receiving of malformed data from a consumer chain
receiving of a large amounts of packets, or of very large packets from a consumer chain
receiving of Slashing or VSC maturity packets about that didn't happen (e.g. downtime, double sign, maturity)

ICS provider code is much more complex than ICS consumer code: from the size alone, it's ~21 KLOC of Go for provider vs ~7 KLOC of Go for consumer. Moreover, a provider is essentially operating under a concurrent setting, communicating with multiple consumers; thus the provider code is more likely to contain bugs. It is thus necessary for consumer chains not to take any harmful actions (like halting) immediately, if they still can operate normally for some time (e.g. employing the current validator set). Thus, another desirable property seems to be:

Provider fault isolation: any faults occurring on provider, should not influence the operation of consumer chains unless absolutely necessary, in which case the damage to consumer chains should be minimized, and, preferably, reversible.

Examples of faults that may originate from the provider:

communication failures (clients expiring, channels closing, error acknowledgments, etc.)
receiving of malformed data from the provider chain
receiving of large amounts of packets, or of very large packets from the provider chain
receiving of VSC packets with validator set changes that didn't happen

Note that for both of the above properties, examples 1-3 can be handled locally by provider or consumer respectively; the correct approach for handling them is defensive programming, which is outside of the scope of this issue, and will be discussed separately. On the other hand, both examples 4 above can't be (fully) alleviated locally, because the present setup assumes complete trust between provider and consumers. The topic of evidence handling is also outside of the scope of this issue, and will be discussed later.

Coming back to modules and their isolation, it is highly desirable that for each of the above modules its boundaries and contracts with other modules are thoroughly understood and specified. For each ICS module the conditions need to be formulated under which:

it can function normally
tt should be temporarily disabled
it can be reenabled, after being disabled previously
it should be permanently shutdown

Whenever some new condition from the last three items of the above list is met, the following needs to be done always:

Issue a Cosmos SDK event with the error;
Log the error;

Additionally:

when any disabling condition for a module is met:
- set a boolean flag, which prevents any actions to be taken (thus the module is disabled)
when all reenabling conditions are met:
- reset a boolean flag (thus enabling the module operation)
when any condition for a permanent module shutdown is met:
- set a separate boolean flag, which prevents any actions to be taken, and which can't be reset besides by taking special measures.

Special measures allowing to reenable a module after a permanent shutdown may consist of:

a special governance proposal
a multi-signature transaction, e.g. signed by the validators representing at least 1/3 of the current validation power:
- This measure allows faster response, e.g. when communication failures are remedied.
- It can also serve as one of the conditions under which a module should be permanently disabled quickly, e.g. when a critical bug is discovered.

TODOs

[ ] Labels have been added for issue
[ ] Issue has been added to the ICS project

cosmos / interchain-security

Security feedback: Consider implementing module isolation and modular shutdown #627