Open MarkusTeufelberger opened 6 years ago
I agree that automating the failover is a very useful feature.
How do you propose distinguishing
A1
crashed and is offlineA1
Until any validator is actually responding, you can't distinguish 1 and 2 currently too. If A1
hasn't sent anything until a certain point in time, it could simply be ignored for the current ledger or treated as offline.
So if you have heard from A2
, but not A1
, after some duration you will ignore A1
and go with A2
validation? I worry how best to chose that duration, given that A1
and A2
might disagree, but you do not know if other validators (which you might trust) have received A1
s validation.
That's a problem in general - if these were independent the same situation could arise. In the proposed case it would just be harder for an entity to be considered offline if they are operating more than one validator and just one of these goes down.
I agree that this is a problem in general. However, for a single UNL with A1
and A2
acting this way, you can consider the safety of this scenario relative to your fault tolerance threshold. Your proposal seems like using conditional UNLs, in which you switch to a new UNL based on validator activity. Managing the fork-safety and overlap requirements in that setting seems combinatorially challenging.
Granted, rippled
does not currently provide a way for managing UNL overlap complexity anyway, so I am very sympathetic to your argument that this doesn't really introduce a new challenge.
Consensus is relatively fork-safe anyways, the bigger threat is not making forward progress. Also individual servers don't have any way of even querying the global state necessary to calculate these thresholds (#1751 recently celebrated its 2nd birthday with no reaction whatsoever). Even if there was some information, it would be relatively easy to feed someone false information designed to interrupt them.
Consensus doesn't really take into account the global state, I don't see that this would change it or how this proposal would require stronger guarantees. If anything, it would help with network stability.
@ChronusZ is this worth pursuing?
If not, let's close this issue.
I think Brad is concerned about a situation where A1
submits a validation for some ledger b1
and A2
submits a validation for b2
. Now suppose node C1
lists the two A nodes as belonging to the same operator, whereas node C2
lists the two A nodes as separate. Then from the perspective of C1
, A2
validates b1
, whereas from the perspective of C2
, A2
validates b2
. Thus A2
exhibits the worst kind of Byzantine behavior even though no one is behaving incorrectly and there's no way to identify that anything went wrong.
Even if all nodes share the same UNL and agree on the grouping of validators in that UNL, there are semi-practical attacks that a Byzantine validator group could execute taking advantage of this mechanic to send contradictory validations without accountability. Now the consensus algorithm remains safe without the assumption of Byzantine accountability as long as the number of Byzantine validators does not go above 20%, but with the current algorithm we have a nice soft safeguard because even with >20% Byzantine validators, forking the ledger requires extremely careful control over the p2p network to avoid being immediately identified as faulty.
From the perspective of C1
, A2
's vote would just be ignored since A1
sent a valididation (but if validations were actually logged, C1
would of course log A2
as voting for b2
). I don't see how C1
would react any different, the only thing is that currently C1
would have to drop A2
from their UNL completely to get the desired behavior of operator A
only having one single vote, not two. With validator grouping A_
validators might have a higher availability from the perspective of C1
.
With nUNLs I would actually expect C1
to vote for putting A2
on the poo-poo list in case it goes down, even if A1
is constantly up.
My understanding of your suggestion was that all A validators are treated as having validated whatever ledger was validated by the lowest-index A-validator from whom we received a validation. Is the logic you're actually suggesting as follows?: After the timeout, let b
be the ledger validated by the lowest-index A-validator from whom we received a validation. Then for each A-validator from whom we didn't receive a validation, pretend that validator actually validated b
; and for each A-validator from whom we received a validation for a different ledger than b
, pretend that validator didn't submit a validation.
If so, this scheme still has a similar exploit, although it's slightly harder to enact. A2
just needs to submit their validation to C2
such that it arrives just before the timeout. Then there won't be enough time for C1
to see the validation for b2
so C1
will treat A2
as having validated b1
. Again there's no identifiably faulty behavior going on here, so there's essentially no risk in attempting this attack.
That's just "normal" byzantine behavior though? From the perspective of C1
, A2
jsut was offline or very late, from the perspective of C2
it was still in time. I don't see how that would be any better or worse if validators were grouped.
The idea in general is that I would like to move from "An UNL contains a list (actually set...) of validators" to "An UNL contains a list (actually set) of node operating entities with their actual validators as sub-lists/sub-sets" with some easy to understand rules how to resolve eventual conflicts within the nodes of a single operator (e.g. "take the first one in the list", "take the first one that actually arrives at my node", "take the majority within that operator" or even "take a random one").
It's way worse with validator grouping. Without grouping, C2
sees an A2
validation for b2
and C1
sees no validation from A2
. This is fine and presents no safety threat, only a potential temporary loss of forward progress. With grouping, C2
still sees a validation for b2
but now C1
sees a validation for b1
. This is a safety threat even though A2
is behaving no different from a laggy honest node, which is a serious issue.
C2
would also see the validation for b1
from A1
though, so it would consider 2 opposing votes from the same operator, while a grouped UNL would only consider one. C1
would have at least one validator fewer than C2
in its UNL btw., it would not count the A1
vote twice.
I still fail to see how this would present any issue or be a safety threat, maybe the example is too simple or the explanation what happens unclear?
With the validators in the original issue:
Alice operates 3 validators (A1, A2 and A3) Bob runs 2 of them (B1, B2) Charlie runs only one (C)
Now 2 validators or nodes D1 and D2 have 2 different UNLs: D1 is grouped (3 operators: [[A1, A2, A3], [B1, B2], C]
), D2 is ungrouped(6 validators: [A1, A2, A3, B1, B2, C]
). You are concerned that if B2 is rather late from the perspective of D1 and also conflicting with B1, this is somehow worse if D1 does grouping?
Ok, then in that case the UNL overlap between D1
and D2
is effectively just [A1,B1,C]
which is only 50% of D2
's UNL, insufficient for guaranteeing safety even with 100% honest nodes.
Let's consider an example where there is a single UNL [A1,A2,B1,B2,C1,C2,D,E]
and all nodes agree to use the obvious grouping structure. Suppose [A1,B1,C1,D]
validate b1
and [A2,B2,C2,E]
validate b2
. Now if a node receives the validations from [A1,B1,C1,D]
(and possibly E
) but the validations from [A2,B2,C2]
are delayed past the timeout, then it will see 80% support for b1
and fully validate. If instead a node receives the validations from [A2,B2,C2,E]
(and possibly D
) but the validations from [A1,B1,C1]
are delayed past the timeout, then it will fully validate b2
.
Thus with your proposal, even if all nodes agree on the UNL and its grouping structure, then the network can fork in the event of (1) an extremely rare accident even with all nodes behaving honestly, (2) an adversary with strong control over the p2p network even with all validators on the UNL behaving honestly, or (3) an adversary controlling 60% of the UNL (namely the A,B,C validators) with no significant control over the p2p network. In (2) and (3) the attack can be executed while maintaining plausible deniability for the attacker. Note that in the current algorithm, even an adversary controlling 100% of the validators cannot fork the network while maintaining plausible deniability.
Thanks, that example is much clearer to me.
Of course this can be pushed further towards case 1 with various methods, but that just makes it harder or take longer to exploit, not impossible. :thinking:
One option might be to require all/most configured validators to actually cast a vote for something and then just drop some when calculating the outcome. One could also require validators from an operator to have a (simple?/super?) majority among themselves, with the current option of a single validator being the trivial case.
Still sounds a bit too hand-wavy for my liking, but I still think the problem is a relevant one unless there is a global agreement between validator operators to always operate at least a certain number of validators and add the same number per operator to recommended UNLs.
True, I guess if you count a validator group as unresponsive until you receive validations from a proper majority (i.e., strictly greater than 50%), then you at least avoid the issue of deniable Byzantine behavior when everyone agrees on the same UNL with the same grouping structure.
Actually this new functionality can be achieved without making any direct changes to the consensus mechanism. Say there is an entity A
that operates the validator group [A1,A2,...,An]
. Let t
be the smallest integer such that t > n/2
, i.e. t = floor(n/2)+1
. Now A
distributes a (t,n)
threshold secret among these nodes. Then instead of adding the public keys of A1,...,An
to the UNL, we just add the public key of the threshold secret to the UNL.
I guess we would still need to modify the p2p code to give a way for nodes to combine the threshold signatures to produce a single validation for the group. But not all p2p nodes would need to have that amendment enabled; the validation shares would be passed around like ordinary validations until at least t
of them arrive at some node with the amendment enabled, and then that node would produce the threshold validation which would again be passed around like an ordinary validation.
I'd like to propose the following feature:
Instead of weighing each single validator on a UNL by the same weight, I'd like to be able to weight lists of validators by the same weight.
Example:
Alice operates 3 validators (A1, A2 and A3) Bob runs 2 of them (B1, B2) Charlie runs only one (C)
Currently I can only add one of the A's one of the B's and the C validator to an UNL, e.g.
[A1, B2, C]
I'd like to be able to have an UNL like this:
[[A1, A2, A3], [B1, B2], C]
From each sub-list the first validator would be considered (in this example, even if A2 and A3 disagree with A1, as long as a validation from A1 reaches my node, it would count that). A different approach might be to weigh all sub-validators the same but in relation to the global UNL (all the
A
validators are weighed 1/9, theB
s 1/6 and theC
one 1/3). That would be probably closer to what might be expected, but might lead to more churn/work.Anyways: Grouping validators together for failover or just because some entity might choose to run more than one is a useful feature to have and would be also helpful for decentralization efforts.