Consider the future of the VC/BN architecture

dankrad commented 4 years ago

At the moment, we have an architecture of a Beacon Node (BN) that does all networking, fork choice rule, state transition etc. work and a Validator Client (VC) whose very simple role is to keep the validator key and make sure that it doesn't sign any slashable messages. This is possible because slashable messages are very easy to determine, as they only consist of signing two blocks at the same height or violating the FFG attestation rules.

However, in the coming phases, we will add more slashing conditions:

Signing attestations with invalid custody (phase 1)
Signing shard blocks with invalid state transitions (phase 2)
Signing eth1 blocks with invalid state transitions (phase 1.5)
Signing beacon blocks with invalid state transitions (?)

In order to be able to provide slashing protection the VC needs to include all the state transition logic. To also be able to prevent slashing due to invalid beacon chain transition, we either need a stateless beacon chain (That I think is unlikely) or the VC needs to be able to hold the beacon chain state. This will add a lot of complexity to the VC and a large part of this is duplicated from the BN.

Currently I see the following possible alternatives for the BN/VC architecture split:

Fully maintain the current VC "no-slash" guarantees, meaning that the VC will have to validate all of the above properties
Change the VC/BN relationship to a trusted one; the VC is guaranteed to keep the key, but it cannot guarantee protection from slashing if the BN is bad
Decide that safe and easy staking is more valuable than the additional guarantees we get from fraud proofs.

terencechain commented 4 years ago

I think at the base, there's "complete" trust between BN and VC. Both processes can be viewed as one entity. It's like VC trusts BN the provide the most profitable attestations and beacon blocks to sign. From the base, we can further enhance its security, examples like use self-signed secure gRPC connect by default and connect validator client to a remote slashing protection service are generally what we have been recommending to users. Comes phase 1, the remote slashing protection service will need more work to support catching and preventing invalid custody attestations

mcdee commented 4 years ago

I think at the base, there's "complete" trust between BN and VC

I don't agree with that. In phase 0 there is no requirement for trust between a VC and BN to prevent slashing. A BN could feed a VC whatever lies it likes but the VC can still protect against being slashed.

In phase 0 the slashing constraint is basically "don't change your mind". This is a clear instruction that requires neither trust nor knowledge of the chain state to accomplish. An invalid block proposal (or attestation) comes with no slashing risk but does cost the proposer in terms of their lost income. Can not the same principle be applied to subsequent phases?

terencechain commented 4 years ago

In phase 0 the slashing constraint is basically "don't change your mind". This is a clear instruction that requires neither trust nor knowledge of the chain state to accomplish. An invalid block proposal (or attestation) comes with no slashing risk but does cost the proposer in terms of their lost income. Can not the same principle be applied to subsequent phases?

Which is why I'm not advocating the VC to be protected using a local DB ("don't change your mind") scheme. I'm advocating VC to be connected to a remote slashing protection service. This approach is superior and also prevents duplicated validator processes signing the same attestations. In our current scheme, local DB does not protect against that.

dankrad commented 4 years ago

Comes phase 1, the remote slashing protection service will need more work to support catching and preventing invalid custody attestations

This is not a viable alternative, because in order to prevent invalid custody attestations, the remote service will need to know your custody secret, which can get you slashed. So you might as well completely cede trust to the remote service.

Can not the same principle be applied to subsequent phases?

Well, that would be arguing for point 3 in my list. Many projects go this way, which basically means forgoing any protections from malicious supermajorities. Currently, the philosophy is different, and to minimize the potential of fraud for malicious supermajorities. This has the advantage that a much smaller stake can protect very valuable assets (as the value of the stake is not required to exceed the value of the protected assets).

terencechain commented 4 years ago

This is not a viable alternative, because in order to prevent invalid custody attestations, the remote service will need to know your custody secret, which can get you slashed. So you might as well completely cede trust to the remote service.

Makes sense. We'll have to think about this one...

mcdee commented 4 years ago

Well, that would be arguing for point 3 in my list. Many projects go this way, which basically means forgoing any protections from malicious supermajorities.

Does it lose all such protection? Surely regardless of the weight behind it, an invalid state transition (for example) would be rejected by all nodes.

So you might as well completely cede trust to the remote service.

Is this not ultimately what we expect to happen, though? "Remote service" could be an MPC configuration where no minority of the parties has enough power to subvert the result. Something is going to hold the power to create a signature, but that something can be distributed enough that trust of whatever required level (within reason) can be obtained.

That said, if we move the trust relationship to be between VC and signer rather than the BN and VC, that doesn't help significantly: in phase 0 the signer doesn't need to trust the VC so we're still creating additional requirements on the signing entity. The phase 0 slashing protection requirements are relatively straightforward and can operate without access to chain data. If this changes in phase 1 it could cause a significant reduction in the number and safety of stakers.

dankrad commented 4 years ago

Does it lose all such protection? Surely regardless of the weight behind it, an invalid state transition (for example) would be rejected by all nodes.

Well, an incorrect state transition on a shard could not be detected by most nodes, because most nodes don't follow all shards. In fact if all nodes had to follow all shards to detect invalid state transitions, sharding would not scale. Future shard blocks building on this invalid state transition would not have any evidence about a previous state transition being invalid.

For the beacon chain it's slightly more differentiated, because at least all full nodes would follow the beacon chain. So an invalid state transition would only concern beacon chain light clients. I'm still strongly in favour of protecting them from a dishonest majority of validators by having beacon chain fraud proofs.

Do we need to do slashings if we have fraud proofs? It turns out that we do. The only way to have fraud proof and not make them DOS vectors is that each fraud proof comes with a guarantee that someone gets slashed:

Either the creator of the fraud proof, if it's an incorrect fraud proof (there was no actual fraud)
Or the offender (if the fraud proof turns out correct)

There is a secondary effect of doing slashings via fraud proofs, which we explicitly use in the proof of custody: It strongly discourages signing anything that you haven't actually checked (the lazy validator problem).

If this changes in phase 1 it could cause a significant reduction in the number and safety of stakers.

I agree that this is a concern and that's why I mentioned option 3. We can potentially get a bit more decentralisation (by making secure staking more accessible), by leaving out all these extra fraud proofs. The disadvantage is that we lose a lot of protection from dishonest majorities that we were planning to have with Eth2. I reckon that the latter are more important and that it's ok to have a slightly higher bar for staking.

mcdee commented 4 years ago

Thank you for the detailed response; I now have a better understanding of the utility of fraud proofs.

I remain concerned about the difference in requirements between phase 0 and phase 1 for a signer to be able to provide slashing protection. In phase 0 all a signer needs to protect against proposer slashing (for example) is the BeaconBlockHeader it is meant to sign. This is a small data structure that does not need verification with respect to the state of the chain and allows slashing protection to be embedded relatively cheaply (inside hardware signers, for example).

If in phase 1 the signer needs to be aware of the current state of the chains (eth1 and shard), along with the ability to validate the state transitions, all while signing within the appropriate window, that seems like a big hike in terms of storage, bandwidth and computation, as well as the step change from the signer not needing to have connectivity to anything but the validator client to the signer needing full network connectivity. This feels like a large and significant change to the requirements for a signer rather than a slight increase.

(An alternative way of looking at this is that in phase 0 the validator client does not need to trust the beacon node, and the signer does not need to trust the validator, for them to operate: if a beacon node lies to a validator, for example, the validator can work this out after the fact and find another beacon node to work with. With phase 1 there will be a much higher trust requirement because of the extended slashing events, which increases the footprint of the trusted software and reduces the ability for smaller stakers to operate effectively.)

adiasg commented 4 years ago

@dankrad At the moment, we have an architecture of a Beacon Node (BN) that does all networking, fork choice rule, state transition etc. work and a Validator Client (VC) whose very simple role is to keep the validator key and make sure that it doesn't sign any slashable messages.

As @CarlBeek pointed out in the last spec call, the primary purpose of VC/BN separation is to isolate the module that handles validator keys from the one that is exposed to the network. It just so happens that the Phase 0 slashing rules do not need inputs about the state, and that slashing protection is possible on the VC.

I think you want to have the VC and BN as separate actors in the validator setup. Two points about this:

Retrofitting this requirement in the current architecture doesn't seem to have any good approaches. Option 1 is redundancy without reward -- the same level of trusting the BN would be required if the BN is something that only relays messages from the p2p network to the VC (to achieve network separation of the keys), and everything else is done in the VC.
What is the advantage of this over having (BN + VC) as a single actor? The ability to use an untrusted BN-as-a-Service could be one, but anyone running a VC would have to do the expensive task of verifying their BN's suggested state transitions anyway.

dankrad commented 4 years ago

This feels like a large and significant change to the requirements for a signer rather than a slight increase.

Absolutely, it's a huge increase, that's why I started this thread because I want to highlight that the current framework will not do.

What is the advantage of this over having (BN + VC) as a single actor? The ability to use an untrusted BN-as-a-Service could be one, but anyone running a VC would have to do the expensive task of verifying their BN's suggested state transitions anyway.

There is still a difference in needing to be a full participant in the P2P network vs. just validating ready made blocks that someone sends you together with parts of the history, where necessary. It would be interesting how much the difference actually is -- in terms of CPU usage, memory use and bandwidth. I would expect at least a factor of 2 in each, but likely much more. I'm hoping for some feedback from client teams on this.

adiasg commented 4 years ago

There is still a difference in needing to be a full participant in the P2P network vs. just validating ready made blocks that someone sends you together with parts of the history, where necessary. It would be interesting how much the difference actually is -- in terms of CPU usage, memory use and bandwidth.

Agreed!

However, the most performant architecture for validators is very much dependent on the use case:

Self-hosted validator
- Objective: Simplicity
- Use case: Hobbyist stakers running this on their home machine/cloud instance
- Architecture: Single, trusted (BN+VC) combination
Self-hosted secret-shared validator (SSV)
- Objective: Preventing validator failure
- Use case: More serious stakers running this across their machines/cloud instances
- Architecture: BNs as p2p relay, and SSVCs that do everything else
Staking platform
- Objective: Enabling commercialized validator services
- Use case: Untrusted BN-as-a-Service and running your own SSVs or combining SSVs from multiple custodial validator services
- Architecture: Option 1 from original post

unixpi commented 4 years ago

We can potentially get a bit more decentralisation (by making secure staking more accessible)

There seems to be a lot resting on this assumption. Why a bit, and not a lot?

terencechain commented 4 years ago

Self-hosted secret-shared validator (SSV)

Another benefit for approach 2 is network bandwidth saving. Having 1 p2p relay for N SSVCs versus N p2p relays (approach 1)

dankrad commented 4 years ago

There seems to be a lot resting on this assumption. Why a bit, and not a lot?

I am going to paste my answers from the secret shared validators chat here:

So I cannot quantify the exact difference this makes in decentralization. I can only say that I have always seen staking as having some minimal technical competence requirement -- that of setting up a Linux box, keeping it up to date, securely generating keys etc. This will essentially stay the same. What I however can emphasize is that it makes a huge difference to the actual security. Without these fraud proofs, one random shard assignment that puts in a dishonest majority in a shard and votes for an invalid state transition, can mess up the whole system forever -- it's an unrecoverable fault. It could create a trillion ETH. I think that is simply unacceptable. The second thing is that it means that without those fraud proofs, Ethereum simply cannot secure assets that are far beyond its own market cap. That is crazy. If the world's financial system will eventually run on Ethereum (not a definite but something I would like to be possible), then it would be insane to have as a first "requirement" for that that the ETH market cap is half of the value of that whole system. Simply not gonna happen. In the end, we are talking about the balance of making things easy for users or making things easy for validators. Because if things are easy for validators (no fraud proofs), they are hard for users: They will suddenly have to check huge amounts of data to be sure that no fraud has happened. I think that's the wrong tradeoff. I think when we encounter this kind of tradeoff, it should always be resolved in favour of users

One more thing to add, perhaps, is that super-easy staking -- in the form of a ready-made HSM that supposedly perfectly protects you from slashing -- may lead to more validators, but we now have essentially placed our trust in that HSM.

dankrad commented 4 years ago

Here is another suggestion to get the best of both worlds: Explicitly split Beacon Nodes into the part that verifies state transitions and all the rest. Allow compiling a "State Transition Verification" (STV) node as a particular target. This give the flexibility: 1) Somewhat more capable hardware can, in addition to running a VC, run and STV node. 2) The STV node can be explicitly audited on its own, and needs to change less often than the rest of the BN code.

adiasg commented 4 years ago

Expanding on the last comment -- the ideal validator architecture would be microservices-based. Each major validator service would be it's own module:

Networking & P2P
State Transition
Fork Choice & Chain Management
Slashing Prevention
Signer

This has 3 major advantages:

Allows for highly configurable clients
Allows for easily building clients that are resilient against any selection of SSV requirements
Enables a marketplace of validator microservices that is much more decentralized than simple "custodial staking" services

Of course, this is a long-term goal given the considerable design & implementation effort involved.

vbuterin commented 4 years ago

A quick recap of what I said during today's call:

The only consensus-layer decision is whether or not to have fraud proofs. If we have fraud proofs (which seem necessary especially since we're looking at the ethereum blockchain be used to secure assets much more valuable than ETH itself), then there's then the client-side decision of how to adapt to this.
There are two ways a client could adapt. First, it could assume the BN and VC are both trusted. Second, it could include into the VC a state transition verifier. The latter strategy would increase verification costs by 2x, but it would preserve all the invariants that VCs have today. The fact that it's a client-side choice is nice; we're not forcing either tradeoff on users.
If a client takes the "VC verifies state transitions" route, this actually serves a double purpose: the state transitions could be verified using a different implementation on the VC vs on the BN, adding redundancy and more graceful degradation in the event that one of the implementations has a consensus bug. This is actually a pretty major benefit of having the VC verify state transitions.
In response to @dankrad mentioning that slashing for fraud could exacerbate the verifier's dilemma, as it would make validators want to wait to see others' signatures to make sure they don't have a client bug, I suggested that we could always expand proof of custody to cover an execution trace.

dankrad commented 4 years ago

4\. In response to @dankrad mentioning that slashing for fraud could exacerbate the verifier's dilemma, as it would make validators want to wait to see others' signatures to make sure they don't have a client bug, I suggested that we could always expand proof of custody to cover an execution trace.

There's still an interesting think to consider with respect to this: The PoC can indeed ensure that you have to do your own verification, but you may still want to wait for other signatures to ensure that it's not slashable, to be extra safe. You may also ask a centralized service that executes using all different clients and tells you if any of them fail -- leading to a situation that if any client fails, a block can't get a signature. Probably not a huge problem, but something to consider.

arnetheduck commented 4 years ago

one design that we've been exploring, in addition to colocating BN and VC in a single process (being a single moving part has simplicity advantages - options add complexity) is that only an absolutely minimal API driven by the BN would be used for key handing - basically, it would have trivial, functional RPC calls that are given data to sign and return a signature and that's it (with full trust).

This is driven by a desire to support hardware with a minimal "plugin" interface so that anything underlying BLS implementation can be used to sign - in part it's because we want to encourage that the surface area of the "thing" that touches keys is minimized.

Notably, both VC and BN could use this "signing service".

The state transitions could be verified using a different implementation on the VC vs on the BN

who wins in case of disagreement? In terms of risk, this means being affected by failures in either of the implementations.

unixpi commented 4 years ago

The only consensus-layer decision is whether or not to have fraud proofs. If we have fraud proofs (which seem necessary especially since we're looking at the ethereum blockchain be used to secure assets much more valuable than ETH itself), then there's then the client-side decision of how to adapt to this.

I don't see how we can make a good decision on this until we have a better understanding of whether or not (beacon chain) fraud proofs will be too unwieldy for light clients to handle (the bottleneck may well be the data requirements rather than the verification costs).

To paste from a comment in this issue:

If ethereum truly aspires to be the base layer of the new financial system, then we can expect most users will interact with eth2 using light clients, and probably from areas with bad coverage (rural south america, parts of africa, etc) -- if this is true, then acquiring the data, rather than the actual crypto verification, may well be the bottleneck [to verifying a fraud proof]

To quote from @dankrad's response:

For the beacon chain, we haven't considered the problem that much in the past. There are some operations that require large parts of the state... The simplest way to make these fraud-proof-friendly would be to turn each of these operations into smaller steps, and commit to the state after every step. This guarantees fraud proofs of reasonable size exist, but actually constructing them will be quite complex. However, there are some other ideas that could achieve this in more elegant ways... Currently we don't know yet how difficult this will be for beacon blocks. Next step would be for someone to make a construction and see how much work it actually is :)

dankrad commented 4 years ago

I don't see how we can make a good decision on this until we have a better understanding of whether or not (beacon chain) fraud proofs will be too unwieldy for light clients to handle

Note that even if we don't have beacon chain fraud proofs we will still have shard chain fraud proofs. These are arguable way more necessary than beacon chain fraud proofs (because there are so many shard chains, and their individual security is lower than beacon chain security).

ethereum / consensus-specs

Consider the future of the VC/BN architecture #1969