Open andrey-kuprianov opened 1 year ago
Halting consumer chain on closing of IBC channel to provider
@andrey-kuprianov Once the IBC channel to the provider is closed, the validator set on the consumer has no longer stake in the game, i.e., the PoS mechanism no longer works. That means that the consumer is no longer safe. As a result, we halt the consumer chain.
Surfaced from @informalsystems security feedback for Interchain Security, at commit 57b47af1
Problem
Presently Interchain Security (ICS) takes drastic measures on any encountered error: e.g., panics in
BeginBlocker
orEndBlocker
(thus halting the chain, provider or consumer), or closes the channel (thus effectively killing a particular consumer chain). A more commensurate response to errors seems to be possible, which involves modular shutdown of parts of ICS functionality, with the opportunity of re-enabling them later in case the errors are corrected.Closing criteria
Problem details
There are a number of scenarios when error handling in ICS takes more drastic measures than necessary, e.g. via panicking in
BeginBlocker
orEndBlocker
on encountering some errors that might have a transient nature (see also the related issue ICS#621). This kind of exaggerated response might amplify the damage disproportionally, or convert transient failures into permanent ones. We outline two examples below.Halting provider chain on the failure to start consumer chain
In x/ccv/provider/keeper/relay.go the following code is present:
The above function is called from EndBlock(), thus a provider (Cosmos Hub) would halt on the above panic. On the other hand, the function StopConsumerChain() starts with the following code:
As we can see from the above analysis, halting of the provider (Cosmos Hub) is a disproportionally severe response wrt. the event causing it (the failure to start a consumer chains). There is no justification to halt the Cosmos Hub when a particular consumer chain fails to start.
Halting consumer chain on closing of IBC channel to provider
We see the following code in x/ccv/consumer/module.go:
In the above example we see again that the response (halting of consumer chain) is disproportionately severe wrt. the cause (closing of IBC channel):
Recommendations
We propose to decompose the ICS protocols and codebase into modules, such that each module can function relatively independently from the others. A reasonable decomposition seems to be:
Note that in the above the provider should handle many relatively independent tasks for each chain. We can formulate the following desirable property wrt. interaction of provider with consumers:
Consumer fault isolation: any faults occurring wrt. one of consumer chains, should not influence the operation of either the provider or the other consumer chains.
Examples of faults to which the above property applies include:
ICS provider code is much more complex than ICS consumer code: from the size alone, it's ~21 KLOC of Go for provider vs ~7 KLOC of Go for consumer. Moreover, a provider is essentially operating under a concurrent setting, communicating with multiple consumers; thus the provider code is more likely to contain bugs. It is thus necessary for consumer chains not to take any harmful actions (like halting) immediately, if they still can operate normally for some time (e.g. employing the current validator set). Thus, another desirable property seems to be:
Provider fault isolation: any faults occurring on provider, should not influence the operation of consumer chains unless absolutely necessary, in which case the damage to consumer chains should be minimized, and, preferably, reversible.
Examples of faults that may originate from the provider:
Note that for both of the above properties, examples 1-3 can be handled locally by provider or consumer respectively; the correct approach for handling them is defensive programming, which is outside of the scope of this issue, and will be discussed separately. On the other hand, both examples 4 above can't be (fully) alleviated locally, because the present setup assumes complete trust between provider and consumers. The topic of evidence handling is also outside of the scope of this issue, and will be discussed later.
Coming back to modules and their isolation, it is highly desirable that for each of the above modules its boundaries and contracts with other modules are thoroughly understood and specified. For each ICS module the conditions need to be formulated under which:
Whenever some new condition from the last three items of the above list is met, the following needs to be done always:
Additionally:
Special measures allowing to reenable a module after a permanent shutdown may consist of:
TODOs