cosmos / interchain-security

Interchain Security is an open sourced IBC application which allows cosmos blockchains to lease their proof-of-stake security to one another.
https://cosmos.github.io/interchain-security/
Other
154 stars 115 forks source link

Trustless downtime slashing #761

Open mpoke opened 1 year ago

mpoke commented 1 year ago

Problem

We would like the provider chain to be able to verify downtime evidence on its own, instead of trusting consumers, or needing evidence to go through governance. For this, we can leverage the LastCommitInfo in the ClientUpdate messages sent to the provider.

Closing criteria

Downtime evidence for validators validating on the consumer chains can be verified on the provider chain.

Problem details

Currently, this is how downtime works on consumer chains:

This approach assumes that the consumer chain is trusted, i.e., the provider doesn't verify the SlashPackets besides basic validity checks. The reason is that the evidence for downtime is quite extensive, e.g., on Cosmos Hub the Signed Blocks Window is 10000 and the Min Signed Per Window is 5%, which means that the evidence of downtime consists of 9500 headers in a window.

Suggestion

Use the ClientUpdate messages sent to the provider to update the client to the consumer. These messages contain consumer headers, which means they the LastCommitInfo. As a result, downtime on the consumer could be detected directly on the provider.

The major concern with this approach is that the client to the consumer doesn't need to be updated on every block. Thus, some consumer headers will be skipped. The detection protocol in HandleValidatorSignature could be adapted to punish validators that have missed too many of the known consumer blocks. This means though that relayers have an impact on the downtime detection protocol. For example, let's consider a validator that misses the occasional block, but not enough to be punished for downtime (when using the original protocol). A relayer could try to attack this validator by updating the client to the consumer just with headers of blocks missed by the validator. At a first glance, this may not be a problem though as there can be many relayers (e.g., the validator could run it's own relayer). However, it is worth analyzing this scenario in more details.

Task list

### Must have
- [ ] Check if SDK PostHandler can be used to get access to  `ClientUpdate` messages
- [ ] Analyze the impact of "sampling" on the security of the downtime detection 
- [ ] Adapt the logic in `HandleValidatorSignature` and add to CCV module 
- [ ] E2E tests 
### Nice to have
jtremback commented 1 year ago

The major concern with this approach is that the client to the consumer doesn't need to be updated on every block. Thus, some consumer headers will be skipped. The detection protocol in HandleValidatorSignature could be adapted to punish validators that have missed too many of the known consumer blocks. This means though that relayers have an impact on the downtime detection protocol. For example, let's consider a validator that misses the occasional block, but not enough to be punished for downtime (when using the original protocol). A relayer could try to attack this validator by updating the client to the consumer just with headers of blocks missed by the validator. At a first glance, this may not be a problem though as there can be many relayers (e.g., the validator could run it's own relayer). However, it is worth analyzing this scenario in more details.

Just wanted to add another point on this. I'd like to consider several ways to handle this:

Boosting the downtime signal means that you overestimate the amount of downtime that a validator had. As a rough example, maybe if you are only receiving 1% of headers, you multiply any estimate of downtime you are getting from this 1% by 100x. That is, if it takes 30k missed blocks for a validator to be jailed, the system might consider them to be down after receiving only 300 headers, appropriately spread over 30,000 blocks. This is vulnerable to an attack where the attacker relays every header where a specific validator happens to be down, resulting in the downtime estimation effectively being 100x more stringent for that validator. This attack could be stopped by the validator or a good samaritan actively relaying headers where it is up to counteract the "bad impression", but it's not clear that all validators have the resources to do this.

If you don't boost the downtime signal, it means that for a validator to be counted as being down with the same strigency as they would normally be subject to, someone needs to relay every header where that validator was down. That means that if it takes 30k missed blocks for a validator to be jailed, someone needs to relay every single header where they were down. This has the opposite problem. Who is going to do that in most cases? This system would probably result in validators not being jailed at all in most cases.

I can think of a few solutions, but I haven't thought them through.

jtremback commented 1 year ago

Something I'd like to see before we spend too much time working on the trustless downtime code is some analysis of how far just improving the throttling mechanism can get us.

Let's say we improve the throttle such that jailing packets are bounced from the provider chain when the slash meter is full instead of being queued (and then are queued on the consumer chain). Now, in the scenario where a consumer chain is attacking the provider by sending a large number of jailing packets, or has even simply unjustly jailed a few validators, it can quickly be stopped by the validators. Once it is stopped, no more jailings occur.

This seems quite sufficient for our current stage, where each consumer chain already requires a lot of manual setup and monitoring by validators. It would be limiting for a system where there were a very large number of consumer chains and it needed to be possible to run them with very little oversight, but that's not where we are right now.

On the other hand, it would require the validator set to recognize that something was wrong and then coordinate to stop the chain. This is a laborious process just in terms of communication overhead so it would be best to avoid if we consider attacks and malfunctions that could cause unjust jailing to be likely.

shaspitz commented 1 year ago

Here's my thoughts:

The idea of boosting the downtime signal does not seem like the way to go imo. Relying on a subset of headers to make extrapolated decisions about downtime slashing seems unfair to validators. Not every validator has the resources to setup their own relayer, and further, a val wouldn't know it was being unfairly biased by certain relayers until after that val has been slashed.

That leaves two high level directions that I see as viable:

Improved throttling

If we want something that could be implemented quickly, and something that would suit our needs up until ~10+ consumers, improved throttling might be the way to go. Firstly, a lot of the logic around throttling is well tested and in prod (although throttling code is likely not executed often/at all in prod). The only work required is changing how the jail packets are being queued, ie. move queuing from provider -> consumer. This design will only get annoying if the mentioned jailing attack actually occurs.

Receive knowledge of all consumer headers on provider

TLDR

Improved throttling or succinct proofs seem like the way to go imo

shaspitz commented 1 year ago

Re Submit the last N LastCommitInfo in IBC packet metadata, maybe the LastCommitInfos could be included in the ack for VSC packets. But this solution is not the most elegant..

shaspitz commented 1 year ago

@jtremback brought up another idea which would be pretty neat. Every validator is already running a node for the provider, and every consumer chain. We could introduce an additional daemon process that each validator must run. This daemon process would monitor the validator's own consumer nodes for downtime information, via Comet directly. The process would look something like

foreach chain in consumerChains:
     foreach block:
          // query consumer comet directly for gossiped validator downtime information
          // store downtime information for this block
          // Execute same downtime logic from staking module
          // Alert provider node if any validators are seemingly down

Obviously there could be some parallelism introduced here

jtremback commented 1 year ago

To expand on the downtime daemon idea- I think it is an optimal solution, except for the fact that each validator will need to run a separate daemon process, which doesn't seem like a big deal. It only works for Replicated Security, but AFAICT, RS is the only variant where we need to jail on the provider for downtime at all.

@smarshall-spitzbart spelled out the logic on the daemon above, but the basic idea is that the daemon running on each validator queries its own consumer chain Comet processes for downtime information about other validators. The daemon then keeps track of this information and decides when a given validator has passed the threshold to be considered down.

At this point, the daemon sends a transaction to the provider chain, signed by a key unique to that validator. This is a vote. Once 2/3s of validators have voted that someone is down, they get jailed, just as if a jailing packet had been received in our current model.

This is trustless, scalable (including in some hypothetical permissionless consumer chain deploy scenario), very cheap on gas, and doesn't involve any crazy zk stuff.

Gravity bridge, Sommelier, and others use this same method to relay events from Ethereum.

jtremback commented 1 year ago

So my overall opinion right now is that we should focus on throttle improvements and possibly look into a downtime daemon system when we have more than 20~ consumer chains.

JuanBQti commented 1 year ago

I like the daemon idea. I just wonder what the incentives are to run this process. Can validators free-ride and rely on others' daemon processes? Moreover, validators know that if they do not run it, nobody is punished. Can they collude?

I also agree that the other solutions with the relayers have the problem that the relayers have the power to manipulate the information that the provider gets. Besides the attack mentioned above, a relayer can "hide" those blocks that report a particular validator in exchange for a bribe (and probably there are many other potential attacks).

The relayer's strategic behavior should relax if there are many relayers. We could try to get random information from the different relayers to build the statistic. But not sure. I just mention this in case we need to move beyond the throttle and daemon ideas.

shaspitz commented 1 year ago

Note it was decided that the daemon idea will be a good long term approach to trustless downtime slashing, however #713 will be my short term focus. This issue will be pushed off for the future

mpoke commented 1 year ago

@smarshall-spitzbart @jtremback

Jail throttling (even the improved version) doesn't stop a malicious consumer to incorrectly jail a validator. It stops a malicious consumer to jail at once a lot of validators.

RS is the only variant where we need to jail on the provider for downtime at all.

Why is that? In the case of opt-in security, what's the deterrent against not validating on a consumer chain? A consumer "signs" a contract with a subset of the provider validators. Through what mechanism is the consumer enforcing that contract? Also, in case of mesh security, the consumer must have a way to slash the provider stake that contributed to the infraction.

Regarding the daemon idea:

At this point, the daemon sends a transaction to the provider chain, signed by a key unique to that validator. This is a vote. Once 2/3s of validators have voted that someone is down, they get jailed, just as if a jailing packet had been received in our current model.

This 2/3 works only for Replicated Security. For opt-in, only a subset of validators are running the consumer node. For mesh, it may be that no validators on the consumer also run provider nodes.

Regarding the original idea: I still think it's worth analyzing the concern described in the suggested solution.

The major concern with this approach is that the client to the consumer doesn't need to be updated on every block. Thus, some consumer headers will be skipped. The detection protocol in HandleValidatorSignature could be adapted to punish validators that have missed too many of the known consumer blocks. This means though that relayers have an impact on the downtime detection protocol. For example, let's consider a validator that misses the occasional block, but not enough to be punished for downtime (when using the original protocol). A relayer could try to attack this validator by updating the client to the consumer just with headers of blocks missed by the validator. At a first glance, this may not be a problem though as there can be many relayers (e.g., the validator could run it's own relayer). However, it is worth analyzing this scenario in more details.

I do think that there are multiple incentives that would make such an attack difficult, e.g., once IBC fees are enable, relayers would compete for the fees instead of trying to jail a validator for downtime (if such an attack succeeds, the validator will not be slashed and will be jailed for only 10m).

shaspitz commented 1 year ago

Jail throttling (even the improved version) doesn't stop a malicious consumer to incorrectly jail a validator. It stops a malicious consumer to jail at once a lot of validators.

Agreed, that's why this issue is still open, it'd be a more complete solution. Imo this issue becomes more relevant to a system where we have way more consumers.

Re sampling for downtime (original idea) in the context of opt-in security, you'd still have to rely on the security of the subset of validators that are running the consumer node, right? Since downtime info ultimately comes from comet.

The described subset disadvantage for the daemon idea seems to also exist for the original idea, from my understanding