cosmos / interchain-security

Interchain Security is an open sourced IBC application which allows cosmos blockchains to lease their proof-of-stake security to one another.

https://cosmos.github.io/interchain-security/

Other

154 stars 115 forks source link

Trustless downtime slashing #761

Open mpoke opened 1 year ago

mpoke commented 1 year ago

Problem

We would like the provider chain to be able to verify downtime evidence on its own, instead of trusting consumers, or needing evidence to go through governance. For this, we can leverage the LastCommitInfo in the ClientUpdate messages sent to the provider.

Closing criteria

Downtime evidence for validators validating on the consumer chains can be verified on the provider chain.

Problem details

Currently, this is how downtime works on consumer chains:

The slashing module uses the LastCommitInfo provided by Comet / Tendermint via BeginBlock to detect whether a validator has missed too many blocks (see HandleValidatorSignature ).
In the current version of Replicated Security, once a validator is missing too many blocks on the consumer, a SlashPacket is sent to the provider .
Once a relayer relays the SlashPacket, the provider jails the validator if it's not already jailed.

This approach assumes that the consumer chain is trusted, i.e., the provider doesn't verify the SlashPackets besides basic validity checks. The reason is that the evidence for downtime is quite extensive, e.g., on Cosmos Hub the Signed Blocks Window is 10000 and the Min Signed Per Window is 5%, which means that the evidence of downtime consists of 9500 headers in a window.

Suggestion

Use the ClientUpdate messages sent to the provider to update the client to the consumer. These messages contain consumer headers, which means they the LastCommitInfo. As a result, downtime on the consumer could be detected directly on the provider.

The major concern with this approach is that the client to the consumer doesn't need to be updated on every block. Thus, some consumer headers will be skipped. The detection protocol in HandleValidatorSignature could be adapted to punish validators that have missed too many of the known consumer blocks. This means though that relayers have an impact on the downtime detection protocol. For example, let's consider a validator that misses the occasional block, but not enough to be punished for downtime (when using the original protocol). A relayer could try to attack this validator by updating the client to the consumer just with headers of blocks missed by the validator. At a first glance, this may not be a problem though as there can be many relayers (e.g., the validator could run it's own relayer). However, it is worth analyzing this scenario in more details.

Task list

### Must have
- [ ] Check if SDK PostHandler can be used to get access to  `ClientUpdate` messages
- [ ] Analyze the impact of "sampling" on the security of the downtime detection 
- [ ] Adapt the logic in `HandleValidatorSignature` and add to CCV module 
- [ ] E2E tests

### Nice to have

jtremback commented 1 year ago

The major concern with this approach is that the client to the consumer doesn't need to be updated on every block. Thus, some consumer headers will be skipped. The detection protocol in HandleValidatorSignature could be adapted to punish validators that have missed too many of the known consumer blocks. This means though that relayers have an impact on the downtime detection protocol. For example, let's consider a validator that misses the occasional block, but not enough to be punished for downtime (when using the original protocol). A relayer could try to attack this validator by updating the client to the consumer just with headers of blocks missed by the validator. At a first glance, this may not be a problem though as there can be many relayers (e.g., the validator could run it's own relayer). However, it is worth analyzing this scenario in more details.

Just wanted to add another point on this. I'd like to consider several ways to handle this:

Boost the downtime signal
Or don't boost the downtime signal

Boosting the downtime signal means that you overestimate the amount of downtime that a validator had. As a rough example, maybe if you are only receiving 1% of headers, you multiply any estimate of downtime you are getting from this 1% by 100x. That is, if it takes 30k missed blocks for a validator to be jailed, the system might consider them to be down after receiving only 300 headers, appropriately spread over 30,000 blocks. This is vulnerable to an attack where the attacker relays every header where a specific validator happens to be down, resulting in the downtime estimation effectively being 100x more stringent for that validator. This attack could be stopped by the validator or a good samaritan actively relaying headers where it is up to counteract the "bad impression", but it's not clear that all validators have the resources to do this.

If you don't boost the downtime signal, it means that for a validator to be counted as being down with the same strigency as they would normally be subject to, someone needs to relay every header where that validator was down. That means that if it takes 30k missed blocks for a validator to be jailed, someone needs to relay every single header where they were down. This has the opposite problem. Who is going to do that in most cases? This system would probably result in validators not being jailed at all in most cases.

I can think of a few solutions, but I haven't thought them through.

Succinct proofs: It should be possible to generate a proof of 30k headers with a missing validator, where the verifier would only actually have to look at a few of the headers. This would allow the downtime to be proven in a single cheap transaction. Of course, this requires someone who knows how to do this, and I myself am not 100% sure it is possible.
Retroactive random checks: The provider chain could check a random set of blocks from the past to see which validators were down. For example, 300 random blocks from the past 30k. This is better than the naive "boost the downtime signal" above because it wouldn't be possible for an attacker to get a validator jailed by just posting the 300 headers where the validator happened to be down, because the blocks that were to be checked would only be determined later. The validator would actually have to have been down for about 30k blocks. The difficulty here is that the provider chain might not actually have access to the headers that it wanted to check. It would have to create some kind of "wish list" that would then be fulfilled by relayers. Overall pretty complicated. Also the randomness could be gamed depending on where it comes from.

jtremback commented 1 year ago

Something I'd like to see before we spend too much time working on the trustless downtime code is some analysis of how far just improving the throttling mechanism can get us.

Let's say we improve the throttle such that jailing packets are bounced from the provider chain when the slash meter is full instead of being queued (and then are queued on the consumer chain). Now, in the scenario where a consumer chain is attacking the provider by sending a large number of jailing packets, or has even simply unjustly jailed a few validators, it can quickly be stopped by the validators. Once it is stopped, no more jailings occur.

This seems quite sufficient for our current stage, where each consumer chain already requires a lot of manual setup and monitoring by validators. It would be limiting for a system where there were a very large number of consumer chains and it needed to be possible to run them with very little oversight, but that's not where we are right now.

On the other hand, it would require the validator set to recognize that something was wrong and then coordinate to stop the chain. This is a laborious process just in terms of communication overhead so it would be best to avoid if we consider attacks and malfunctions that could cause unjust jailing to be likely.

shaspitz commented 1 year ago

Here's my thoughts:

The idea of boosting the downtime signal does not seem like the way to go imo. Relying on a subset of headers to make extrapolated decisions about downtime slashing seems unfair to validators. Not every validator has the resources to setup their own relayer, and further, a val wouldn't know it was being unfairly biased by certain relayers until after that val has been slashed.

That leaves two high level directions that I see as viable:

Close this issue or put it on backlog in favor of improved throttling, see https://github.com/cosmos/interchain-security/issues/713.
Devise a way for the provider to eventually receive knowledge of all consumer headers, even at a later time.

Improved throttling

If we want something that could be implemented quickly, and something that would suit our needs up until ~10+ consumers, improved throttling might be the way to go. Firstly, a lot of the logic around throttling is well tested and in prod (although throttling code is likely not executed often/at all in prod). The only work required is changing how the jail packets are being queued, ie. move queuing from provider -> consumer. This design will only get annoying if the mentioned jailing attack actually occurs.

Receive knowledge of all consumer headers on provider

Succinct proofs (see Jehan's comment above) - this idea seems like it could work, admittedly I'm not an expert here, but this route seems like a lot of fun to implement
Retroactive random checks (see Jehan's comment above) - not a huge fan of this idea, as providing a "wish list" for relayers seems like adding unneeded complexity. We'd likely need changes to added to hermes, right?
Submit the last N LastCommitInfo in IBC packet metadata - All of the above discussion has assumed we're obtaining headers from data already used by ClientUpdates, but we could also include the last N (say 5-10?) LastCommitInfo within the IBC packet data itself. By doing this, our assumption would change from "relayers have to submit ClientUpdate for every block", to "relayers have to submit ClientUpdate at least every 10 blocks", which is more reasonable. As these LastCommitInfos are received by the provider, they can be persisted/overwritten as needed, and downtime can be computed with every LastCommitInfo being known. Downsides to this idea include the overhead in increasing packet data size, the constraint of not being able to break the wire (I think there are ways not to break the wire here), and added complexity of data storage on the provider, pruning, etc.

TLDR

Improved throttling or succinct proofs seem like the way to go imo

shaspitz commented 1 year ago

Re Submit the last N LastCommitInfo in IBC packet metadata, maybe the LastCommitInfos could be included in the ack for VSC packets. But this solution is not the most elegant..

shaspitz commented 1 year ago

@jtremback brought up another idea which would be pretty neat. Every validator is already running a node for the provider, and every consumer chain. We could introduce an additional daemon process that each validator must run. This daemon process would monitor the validator's own consumer nodes for downtime information, via Comet directly. The process would look something like

foreach chain in consumerChains:
     foreach block:
          // query consumer comet directly for gossiped validator downtime information
          // store downtime information for this block
          // Execute same downtime logic from staking module
          // Alert provider node if any validators are seemingly down

Obviously there could be some parallelism introduced here

jtremback commented 1 year ago

To expand on the downtime daemon idea- I think it is an optimal solution, except for the fact that each validator will need to run a separate daemon process, which doesn't seem like a big deal. It only works for Replicated Security, but AFAICT, RS is the only variant where we need to jail on the provider for downtime at all.

@smarshall-spitzbart spelled out the logic on the daemon above, but the basic idea is that the daemon running on each validator queries its own consumer chain Comet processes for downtime information about other validators. The daemon then keeps track of this information and decides when a given validator has passed the threshold to be considered down.

At this point, the daemon sends a transaction to the provider chain, signed by a key unique to that validator. This is a vote. Once 2/3s of validators have voted that someone is down, they get jailed, just as if a jailing packet had been received in our current model.

This is trustless, scalable (including in some hypothetical permissionless consumer chain deploy scenario), very cheap on gas, and doesn't involve any crazy zk stuff.

Gravity bridge, Sommelier, and others use this same method to relay events from Ethereum.

jtremback commented 1 year ago

So my overall opinion right now is that we should focus on throttle improvements and possibly look into a downtime daemon system when we have more than 20~ consumer chains.

JuanBQti commented 1 year ago

I like the daemon idea. I just wonder what the incentives are to run this process. Can validators free-ride and rely on others' daemon processes? Moreover, validators know that if they do not run it, nobody is punished. Can they collude?

I also agree that the other solutions with the relayers have the problem that the relayers have the power to manipulate the information that the provider gets. Besides the attack mentioned above, a relayer can "hide" those blocks that report a particular validator in exchange for a bribe (and probably there are many other potential attacks).

The relayer's strategic behavior should relax if there are many relayers. We could try to get random information from the different relayers to build the statistic. But not sure. I just mention this in case we need to move beyond the throttle and daemon ideas.

shaspitz commented 1 year ago

Note it was decided that the daemon idea will be a good long term approach to trustless downtime slashing, however #713 will be my short term focus. This issue will be pushed off for the future

mpoke commented 1 year ago

@smarshall-spitzbart @jtremback

Jail throttling (even the improved version) doesn't stop a malicious consumer to incorrectly jail a validator. It stops a malicious consumer to jail at once a lot of validators.

RS is the only variant where we need to jail on the provider for downtime at all.

Why is that? In the case of opt-in security, what's the deterrent against not validating on a consumer chain? A consumer "signs" a contract with a subset of the provider validators. Through what mechanism is the consumer enforcing that contract? Also, in case of mesh security, the consumer must have a way to slash the provider stake that contributed to the infraction.

Regarding the daemon idea:

At this point, the daemon sends a transaction to the provider chain, signed by a key unique to that validator. This is a vote. Once 2/3s of validators have voted that someone is down, they get jailed, just as if a jailing packet had been received in our current model.

This 2/3 works only for Replicated Security. For opt-in, only a subset of validators are running the consumer node. For mesh, it may be that no validators on the consumer also run provider nodes.

Regarding the original idea: I still think it's worth analyzing the concern described in the suggested solution.

The major concern with this approach is that the client to the consumer doesn't need to be updated on every block. Thus, some consumer headers will be skipped. The detection protocol in HandleValidatorSignature could be adapted to punish validators that have missed too many of the known consumer blocks. This means though that relayers have an impact on the downtime detection protocol. For example, let's consider a validator that misses the occasional block, but not enough to be punished for downtime (when using the original protocol). A relayer could try to attack this validator by updating the client to the consumer just with headers of blocks missed by the validator. At a first glance, this may not be a problem though as there can be many relayers (e.g., the validator could run it's own relayer). However, it is worth analyzing this scenario in more details.

I do think that there are multiple incentives that would make such an attack difficult, e.g., once IBC fees are enable, relayers would compete for the fees instead of trying to jail a validator for downtime (if such an attack succeeds, the validator will not be slashed and will be jailed for only 10m).

shaspitz commented 1 year ago

Jail throttling (even the improved version) doesn't stop a malicious consumer to incorrectly jail a validator. It stops a malicious consumer to jail at once a lot of validators.

Agreed, that's why this issue is still open, it'd be a more complete solution. Imo this issue becomes more relevant to a system where we have way more consumers.

Re sampling for downtime (original idea) in the context of opt-in security, you'd still have to rely on the security of the subset of validators that are running the consumer node, right? Since downtime info ultimately comes from comet.

The described subset disadvantage for the daemon idea seems to also exist for the original idea, from my understanding