Open drmingdrmer opened 5 years ago
Yes, I think that works. Thanks for the writeup.
Sorry for asking this, but how do you guarantee you have found the least (seq, replicaID)
? If I understand correctly, commands don't have seq
finalized until they are committed, which may happen in the order of their seq
actually decreasing. So, if a replica observes part of a long (potentially infinite) SCC, with dependencies that are not yet committed, how can you be sure those dependencies won't have seq
lower than whatever you have observed so far? For example:
A <------------- B <-------------- C <-- ...
| ^ | ^ |
\-> ab1 -> ab2 -/ \-> bc1 -> bc2 -/ \-> ...
I know epaxos technically uses optimized dependencies, but I'm trying to consider a more generic dependency graph, which seems possible to generate with 5 replicas (1 replica proposes A
, B
, C
, ... in such a way that they always conflict pairwise, other replicas propose ab1
, ab2
, bc1
, bc2
, ... in such a way that it creates a loop for each pair).
In the above graph, after receiving commits for A
, B
, ab1
and ab2
(the first loop) is it really possible to know whether or not we have observed the least seq
value? It seems like bc1
may end up with smaller or larger seq
than these 4 during phase 1.
Could you please explain why (1) would hold in such a case? (e.g. bc1
was proposed by the same replica that proposed ab1
or ab2
, such that replicas(allDeps) ⊆ replicas(CMDs)
holds, but there is no direct conflict with either, except through B
)
I'm probably missing something obvious here, or maybe graph like this is too generic/actually impossible?
As an aside, I would encourage you not to implement the version of the algorithm that requires "seq". Instead, rely on "deps" alone and ack a command to the client only after the command's dependency graph is finalized (or only after it has been executed). To order commands within an SCC use any order criterion as long as it is consistent across replicas. Without "seq", the algorithm has fewer corner cases and better consistency properties.
I'm not sure it would help with solving livelock due to infinitely growing SCC (without stopping new proposes that is), as using seq
as a tie breaker may be complicated, but doesn't seem to be the real issue. The problem I'm trying to understand is when we have an incomplete SCC and there are vertices that depend on that SCC, can we even be certain those vertices won't ever become part of it as we receive more vertices of the graph (assuming worst case scenario where commit messages between replicas are severely delayed and arbitrarily reordered), because execution order may reverse depending on it. I'm probably thinking about dependency graphs that are too general though.
Edit: I actually discovered epaxos after reading a recent paper on bpaxos (there was no seq
, but no mention of efficient solutions to livelock problem either, which sparked my interest in finding one)
Well, that's why I called the observation about seq an aside, not a direct answer to your question. About your question, though. When you talk about ensuring liveness in a real system, solutions usually revolve around better engineering. A simplistic blunt instrument in this case would be to say that replicas stop pre-accepting new commands as long as the dependency graph for their potential dependencies is larger than a certain threshold size. Replicas would thus wait until all of these dependencies finish committing. If implemented poorly, this could significantly reduce your throughput and increase your latency. If implemented well, this would hopefully kick in only so rarely that you wouldn't see its effect in the 99%ile latency.
That said, I do believe that there could very well be ways to tackle this from a theoretical point of view as well. I haven't thought about this too much, but you could try to reason about which dependencies are really necessary, and which dependencies are superfluous. The critical thing in EPaxos is to guarantee that if two commands interfere with each other, one will be in the other's deps graph. They don't need to be in the same SCC, and as you point out, it's better if they are not. Thus, are there some rules that replicas can use to reason "A must be in deps(B), so I won't add B to deps(A)"? I think so.
Sorry for asking this, but how do you guarantee you have found the least
(seq, replicaID)
? If I understand correctly, commands don't haveseq
finalized until they are committed, which may happen in the order of theirseq
actually decreasing. So, if a replica observes part of a long (potentially infinite) SCC, with dependencies that are not yet committed, how can you be sure those dependencies won't haveseq
lower than whatever you have observed so far? For example:
Assumes our discussion is only about interfering commands:
I think what confused you is this step:
Add the next least (seq, replicaID) command and its deps into CMDs.
Let me elaborate on it:
Assumes the least (seq, replicaID)
command the executing replica has been seen is X.
For an unseen command Y:
seq
.CMDs
.Thus CMDs
must contains the least seq
command.
If X does not depends on Y, then Y must depend on X thus y must have greater
seq
.
I'm not sure how to parse this. Are you talking about direct dependencies, or transitive dependencies as well? I assume direct, since in SCC everything depends on everything, in such case here's a counter example:
Z.100
(which should be read as command writing to Z
with seq
==100)A
is A := X + D
B
is B := A
C
is C := B
D
is D := C
The sequence of events is as follows (number before the colon is the replica number):
1: comes out of a network partition
1: A B C D proposed in rapid succession
initially A.0 <- B.1 <- C.2 <- D.3 (pairwise dependencies)
and also A.0 <- D.3 because read/write to D
1: sends PreAccept messages to 2, 3, 4 but they are delayed and reordered
2: receives PreAccept(B.1, [A]), sends PreAcceptOK(B.1, [A]), i.e. there is no A yet, no updates
3: receives PreAccept(B.1, [A]), sends PreAcceptOK(B.1, [A]), i.e. there is no A yet, no updates
1: receives above replies, have a fast quorum for B.1, deps=[A], will commit eventually
3: receives PreAccept(C.2, [B]), sends PreAcceptOK(C.2, [B]), i.e. there is no B yet, no updates
3: receives PreAccept(D.3, [A, C]), sends PreAcceptOK(D.3, [A, C]), i.e. conflict with C.2, no updates
4: receives PreAccept(D.3, [A, C]), sends PreAcceptOK(D.3, [A, C]), i.e. there is no C yet, no updates
1: receives above replies, have a fast quorum for D.3, deps=[A, C], will commit eventually
5: comes out of a network partition
5: receives PreAccept(A.0, []), sends PreAcceptOK(A.0, []), no updates
4: receives PreAccept(A.0, []), sends PreAcceptOK(A.101, [Z, D]), updated on conflicts with Z.100 and D.3
1: receives above replies, have a quorum for A.101, deps=[Z, D]
1: runs accept phase for A.101 via 4 and 5, will commit eventually
4: receives PreAccept(B.1, [A]), sends PreAcceptOK(B.102, [A]), but it's too late (B.1 already selected)
4: receives PreAccept(C.2, [B]), sends PreAcceptOK(C.103, [B]), updated because it has B.102 internally
1: receives above replies, have a quorum for C.103, deps=[B]
1: runs accept phase for C.103 via 3 and 4, will commit eventually
Here's a short reference of events over time on each replica:
1: ....ABCD
2: ZZZ B
3: ZZZ BCD
4: ZZZ DABC
5: ........A
Now that replica 5 is out of a network partition, it receives commits in some particular order:
5: receives and executes all Z commands
5: receives A.101, deps=[Z, D], new least command, but needs to wait for D
5: receives D.3, deps=[A, C], new least command, but needs to wait for C
5: receives C.103, deps=[B], not a new least command
Running your algorithm we may decide that D.3
is the least command that needs to be executed, however that is not the case. There's actually a B.1
command in the cycle with the lowest seq
that we haven't seen yet. Am I missing something about how seq
and your algorithm is supposed to work?
In epaxos the execution consistency only guarantees order about interfering commands, not unrelated commands:
Execution consistency: If two interfering commands γ and δ are successfully committed (by any replicas) they will be executed in the same order by every replica.
That's why I said:
Assumes our discussion is only about interfering commands:
In your example D
does not depend on B
thus the order for B
and D
is not deterministic.
I'm not sure how to parse this. Are you talking about direct dependencies, or transitive dependencies as well? I assume direct, since in SCC everything depends on everything, in such case here's a counter example:
Here what I mean is transitively depending, so that a potential minimal seq
is always reachable from an interfering command.
In your example
D
does not depend onB
thus the order forB
andD
is not deterministic.
But it does depend on B
(indirectly), and replicas would diverge if this is not taken into account. For example, assume all keys have initial value of 0.
Case 1:
replica sees A, D, C first
executing Z.100: Z := 42
executing D.3: D := C // 0
executing A.101: A := Z + D // 42
new least command C would have to wait for C
executing B.1: B := A // 42
executing C.103: C := B // 0
replica has Z = 42, A = 42, B = 42, C = 0, D = 0
Case 2:
replica sees B
executing Z.100: Z := 42
executing B.1: B := A // 0
executing D.3: D := C // 0
executing A.101: A := Z + D // 42
executing C.103: C := B // 42
replica has Z = 42, A = 42, B = 0, C = 42, D = 0
Here what I mean is transitively depending, so that a potential minimal seq is always reachable from an interfering command.
Then I'm not sure how it helps with livelocks. Waiting for all transitive dependencies would mean replica needs to see a complete SCC, which may grow indefinitely, hence a livelock.
Anyway, thank you for your answers. I think using some kind of monotonically increasing per-replica propose time as a tie breaker (used for breaking ties only, thus clock drift would only affect latency of long chains of conflicting commands and their dependencies, so not much of an issue) is more promising for solving this problem.
Update of my previous comment:
As you had the setup with D->A
but without D->B
, I assumed we were talking about the unoptimized deps
, epaxos 4.5 describes the optimization of deps
:
Instead of including all interfering instances, we include only N dependencies in each list: the instance number R.i with the highest i for which the current replica has seen an interfering command (not necessarily committed).
For unoptimized deps
, my algorithm should work, because any two interfering commands have a direct depends-on
relation.
For optimized deps
, D->A
is unnecessary in the setup.
In order to let this algorithm to work with optimized deps
, a minor change needs to add to it:
Add the next least
(seq, replicaID)
command and itsdeps
intoCMDs
.
Here we need to retrieve all of the unoptimized dependent commands. Back to your steps:
5: receives D.3, deps=[A, C], new least command, but needs to wait for C 5: receives C.103, deps=[B], not a new least command
The optimized deps
of D.3
is A,C
and needs to expand to A,B,C
.
hi @snaury what about my latest comment?
hi @snaury what about my latest comment?
I don't understand what you mean by any two interfering commands having a direct depends-on relationship. Commands B
and D
don't interfere, so don't have a relationship in my example. Do you propose expanding relationships during phase 1 or during execution? The former makes it harder to think of a counter example, but I'm not sure there isn't one (having more replicas allows for longer and more out-of-order chains, but it's harder to analyse), besides you would need to prove certain properties for the final committed seq
, but I'm not sure what those are. The latter I think would mean you always expand your least seq relationships (each new discovered node brings more transitive dependencies) until you have the complete SCC.
I don't understand what you mean by any two interfering commands having a direct depends-on relationship.
For two committed interfering command X and Y, there is at least one replica that has both of them. Thus X depends on Y or Y depends on X or both.
Commands
B
andD
don't interfere, so don't have a relationship in my example.
Yes.
Do you propose expanding relationships during phase 1 or during execution?
During execution.
The former makes it harder to think of a counter example, but I'm not sure there isn't one (having more replicas allows for longer and more out-of-order chains, but it's harder to analyse), besides you would need to prove certain properties for the final committed
seq
, but I'm not sure what those are.
I think epaxos gives complete proof about the consistency of the committed graph.
The latter I think would mean you always expand your least seq relationships (each new discovered node brings more transitive dependencies) until you have the complete SCC.
It does expand the least seq command, but this loop stops when the set of replicas of CMDs stops growing:
Repeat this step until we find a CMDs so that
replicas(allDeps) ⊆ replicas(CMDs)
, then execute the least command in CMDs.
For two committed interfering command X and Y, there is at least one replica that has both of them. Thus X depends on Y or Y depends on X or both.
I'm not sure how that helps. Here's another example, we are constructing a graph that looks like this:
A <-> B <-> C <-> D <-> E <-> F
↑ ⇵
U <-- V <-- W <-- X <-- Y <-- Z
There is a long chain of commands A, B, C, D, E, F conflicting pairwise. There is also a long chain of commands U, V, W, X, Y, Z where U conflicts with A and eventually Z conflicts with F. Here's an incomplete diagram of how proposes/pre-accepts and accepts are spread over time:
1: A u U C b B A ... E ...
2: B B a A ... D ... F ...
3: b v V B a A
4: u U v V
5: U U V V ... W ... X ... Y ... Z
Here's a textual description of how these commands are processed and what happens to their seq
and deps
:
propose A.0, B.0, U.0
pre-accept u.1 -> A
accept U.1 -> A
pre-accept b.0
propose C.0
propose V.2 -> U
pre-accept v.2 -> U
accept V.2 -> U
pre-accept b.1 -> A, C
accept B.1 -> A, C
pre-accept a.2 -> B
accept A.2 -> B
...
Here we assume that PreAccept
messages are sent immediately after command is proposed, but are somehow delayed in transit and may arrive later in time.
Notice we have already accepted these commands:
A.2 -> B
B.1 -> A, C
U.1 -> A
V.2 -> U
Notice also that replica 2 has never seen (and will never see) any of U
, V
. Eventually C
will be accepted with e.g. seq=3
and replica 2 will see this:
A.2 -> B
B.1 -> A, C
C.3 -> B, D
Replica 2 is not aware of U
, and its owner 5 is not in its dependencies (the command that would eventually link A
back to U
is not even on anyones radar yet, but will be in some distant future). Its least seq
, replicaID
command order would thus be B.1
, A.2
, C.3
, ..., and maybe eventually U.1
, V.2
, ...
However on a replica that is aware of U
(or waits for a complete SCC, which is prone to livelock) the order would be B.1
, U.1
, A.2
, V.2
, etc. Notice how directly conflicting commands A
and U
are executed in different order on different replicas. This is inconsistent and unsafe.
Yes, there is a replica that sees both A
and U
, but it doesn't help in my example. There may also be a replica that only sees one branch of the loop.
Here's an even better diagram where both replicas 1 and 2 are never aware that U
exists:
1: A C b B A ......... E .....
2: B B a A ... D ......... F
3: a u U b v V B A .................
4: u U v V .................
5: U U V V W ... X ... Y ... Z
propose A.0, B.0, U.0
pre-accept-ok a.0
pre-accept-ok u.1 -> A
accepted U.1 -> A
propose C.0
pre-accept-ok b.1 -> A
propose V.2 -> U
pre-accept-ok v.2 -> U
accepted V.2 -> U
pre-accept-ok b.1 -> A, C
accepted B.1 -> A, C
pre-accept-ok a.2 -> B
accepted A.2 -> B
Only a single replica 3 observes both branches of the eventual loop.
You are right. My algorithm works only with a set of commands in which every two of them interfering with each other.
Thanks for the detailed diagram to explain 😆😆
Although it is possible to generalize this algorithm to work, looking for the subgraph containing the least command can not be done within finite steps.
By the way, I did not use seq
in my implementation either(as imoraru suggested:)).
I just use deps
as a vector clock to sort all commands, with some additional restriction. It seems easier to proof and to programming. When I finished, it would be nice to get your options.:))
I think I have a better solution to livelock. Here's a sketch of an idea:
A
and B
, then A.deps
and B.deps
are treated as bidirectional at commit time, i.e. they show conflict edges, not dependenciesA
and B
in the final graph, then A
is executed before B
if (A.seq, A.owner) < (B.seq, B.owner)
, and edge becomes directionalA.seq <= B.seq
then A
will be in B.deps
deps
during the accept phase with (possibly redundant) conflicting instancesProof sketch for conflicting nodes A
and B
:
A
is committed with a fast commit, and B
is not in A.deps
. That means initial (A.seq, A.deps)
was PreAccepted on a quorum of replicas. That means conflicting instance B
will have B.seq >= A.seq + 1
and A
will be in B.deps
, as there will be a quorum intersection during phase 1 for B
.A
is committed using paxos accept, and B
is not in A.deps
. That means during accept phase of some final (A.seq, A.deps)
on a quorum of replicas B
was not PreAccepted yet and was not yet known, and A
will eventually commit without modifications. That means conflicting instance B
will have B.seq >= A.seq + 1
and A
will be in B.deps
, as there will be a quorum intersection during phase 1 for B
.A
and B
reversed. The key idea is that if B
is not in A.deps
then A.seq < B.seq
, which means A
will always be executed before B
, and missing edge is irrelevant.A
and B
will have each other in deps
if replicas add all currently conflicting instances to deps
(without changing seq
) before sending AcceptOK
with those resulting instances. The leader will then use a union of all deps
and commit the final result. We don't know what the final A.seq
and B.seq
will be, but we know both instances will have each other in their deps
, since when phases don't intersect on replicas will result in either of the first two cases.deps
that are slightly different from those the original leader (would have) chosen, since we may use a different quorum. Any of those deps
are equally valid, in that there will be some redundant dependencies: when A.seq < B.seq
we are guaranteed to have A
in B.deps
, so it's redundant and equally valid to have and not have B
in A.deps
, depending on a quorum we use for the decision.The execution algorithm is then simple: for every instance A
wait for commits of all A.deps
. Since we see all important edges between A
and its dependencies start executing A
after executing all dependencies that have a lower (seq, replicaID)
tuple.
The graph has no loops, so there is no livelock.
What do you think?
😲😲😲
Is the core idea about the final complete_deps
like the following?
Compare the complete commit-time deps set(complete_deps
) of A and B.
E.g. walk through deps graph of A and add all instances into complete_deps
.
Finally A.complete_deps
and B.complete_deps
have 3 relations:
A.complete_deps
⊂ B.complete_deps
: B after A.
A.complete_deps
⊃ B.complete_deps
: A after B.
A.complete_deps
⊄ B.complete_deps
and A.complete_deps
⊅ B.complete_deps
in this case, determine the order by (seq, owner)
.
If my understanding of your idea above is correct, I believe we have almost the same solution.
If it is, in this way the seq
is not necessary.
Comparing complete_deps
to determine an order is quite enough. E.g. find out the
max instance-id of n
owners in a complete_deps
: [max(R0), max(R1)...]
.
It is a vector clock about when an instance becomes stable, as I mentioned
in my previous comment.
I'm not sure what you mean about comparing complete_deps
like that. In my idea seq
is necessary because it is stable and determines the final order. Determining order using set intersections doesn't seem safe, because then it could be sensitive about complete_deps
being in flux (it's unstable and may change during recovery, different replicas may see different complete_deps
for the same instance, is it possible to show execution order doesn't change in that case?).
What I show is that if A
executes before B
and B
is not in A.complete_deps
, then it's ok for replica to not be aware of B
and it's still safe for this replica to execute A
before it even becomes aware of B
. Other replica may recover a slightly different A
that has B
in A.complete_deps
, but it doesn't change the execution order, so it's safe.
The fact that you don't have a cycle in the complete dependency graph doesn't seem sufficient to avoid livelock. You also need the ability to start executing before the entire graph is fully committed.
Consider the following sequence of PreAccepts, with non-transitive interference relations C1 ~ C2 ~ C3 ~ ... ~ Cn ~ C0
Time | Replica1 | Replica2 | Replica3 |
---|---|---|---|
1 | C0 seq = 0 | C0 seq = 0 | |
2 | C1 seq = 0 | C2 seq = 0 | |
3 | C3 seq = 1 | C1 seq = 1 | |
4 | C2 seq = 2 | C4 seq = 2 | |
5 | C5 seq = 3 | C3 seq = 3 | |
6 | C4 seq = 4 | C6 seq = 4 | |
7 | C7 seq = 5 | C5 seq = 5 | |
8 | C6 seq = 6 | C8 seq = 6 | |
... | ... | ... | ... |
n | Cn seq = n-2 | Cn-2 seq = n-2 | |
... | ... | ... | ... |
Assume that C0 commits only after this entire sequence of events. It's clear that we can construct examples like this with arbitrarily long chains of dependencies. After we commit C1 and try to execute it, we must follow the deps graph edges until we eventually get to C0, which has no deps. This is livelock, because n can be arbitrarily high. At no point can we execute C1 without having finalized the entire deps graph, because we never know that C1 has the smallest seq of any command we might discover by following the deps edges to the end.
This is livelock, because n can be arbitrarily high.
I don't think it can be called a livelock, because each instance reaches commit state independently and takes a finite number of steps. In my scheme as soon as C0
is committed it may or may not have a set of interfering nodes that would have to be observed, but otherwise its execution cannot be delayed indefinitely. The same is with C1
, when it eventually commits (it should take a finite time, as it only depends on communication with a subset of other nodes and available cpu time for processing replies) it would have accumulated some other instances that would have to be observed before it could be executed, but that set of potential dependencies is again finite. For each committed instance you only need to observe a set of its directly interfering instances (that may or may not end up as dependencies) frozen at commit time. Sequence number of interfering command chain increases, except between some commands that overlap in time, so you would eventually reach some already executed stable state.
This is different from original epaxos, where SCC can grow indefinitely and can never be complete, so it would never be executed. I may be mistaken, but I assumed that was the livelock problem mentioned in the paper.
When committed instances have other instances in their interference set and those instances are not committed for a long time the recovery should kick in and either commit the original command or replace it with no-op. It's probably possible for several replicas to fight over who finishes recovery, but that's a separate issue for paxos in general.
You also need the ability to start executing before the entire graph is fully committed.
Also, that's exactly what I'm doing. Let's assume C1
is committed with seq = 1
and deps = C0, C2, C3, C4
(e.g. at time 5). It would only have to wait for its deps
to commit, at which point we would drop C2
, C3
and C4
from its deps
(they would have a higher seq
). The final result is that C1
only has to wait for C0
, which would not be dropped (lower seq
) and will be executed first, then C1
. Even though n
can be very large (all those instances would form an SCC in original epaxos) in my proposal C1
is executed as soon as 4 other instances are committed and C0
is executed (which would have no deps
, assuming it uses fast commit).
Moreover, if C0
and C1
in your example don't interfere (at first I assumed it does), then C1
would not have C0
in its deps
and would not have to wait for it before executing, and C1
will only end up with C2
in its deps
. As soon as C2
is committed it would be obvious that C1
should be executed first (both have each other in deps, but only one direction is a dependency).
Yes, C0
has lowest seq
, but if C0
and C1
don't interfere they can (and would) be executed in arbitrary order. In my proposal seq
only determines order between a pair of instances, not an entire graph. Only when we reach Cn
would ordering between C0
and Cn
start to matter.
Note that in original epaxos dependencies in the final graph may end up looking like this:
C1 <-> C2 <-> C3 <-> C4 <-> ... <-> Cn <-> C0
This is an arbitrarily large SCC and we have to wait until it is complete. However in my proposal graph would end up looking like this:
C1 <- C2 <- C3 <- C4 <- ... <- Cn -> C0
There are no ambiguities in the graph, everything is linear (since your example is very tidy), it's obvious that C1
and C0
must be executed first, but it may happen in any order. If seq
are more randomized, then some arrows would be flipped accordingly, but you won't have to wait for the entire graph before deciding it can be executed, you only wait for a small subset of overlapping commands that you are not yet sure about. It doesn't matter what exact order is chosen between each pair of overlapping commands, it just needs to be consistent on each replica.
C1 and C0 must be executed first, but it may happen in any order
Ah, yes, that's true.
Why do you need to update deps in the Accept phase of a command? Can you give an example where that's necessary?
Why do you need to update deps in the Accept phase of a command? Can you give an example where that's necessary?
Here's a diagram:
t | 1 2 3 4 5 |
---+-----------+----------------------------------------
1 | C C C | C is committed with C.seq=100 (via some out of scope dependencies)
2 | A B | A and B are proposed, there are no initial conflicts, so A.seq = 0, B.seq = 0
3 | a a | A is pre-accepted on replicas 2 and 3, replica 3 updates A.seq=101, A.deps=[C]
4 | b b | B is pre-accepted on replicas 2 and 4, replica 2 updates B.seq=1, B.deps=[A]
5 | A A A | A is accepted via replicas 1, 2, 3, it sees pre-accepted B, replica 1 commits A.seq=101, A.deps=[C, B]
6 | B B B | B is accepted via replicas 2, 4, 5, no additional updates, replica 5 commits B.seq=1, B.deps=[A]
The final graph is C <- A -> B
, i.e. A
must not be executed before both B
and C
(which don't interfere). Note that if we didn't update A.deps
in the accept phase, then replica 1 would not be aware it needs to wait for B
(which ends up with a lower seq
due to a particular choice of participating replicas) before executing A
(even though B
must be executed before A
). Because we update A.deps
with (possibly redundant) interfering commands, all executing replicas would be forced to wait for B
to commit, at which point they would see that B
must be executed before A
.
Also note that if accept phase of A
did not intersect with B
(e.g. because pre-accept of B
on replica 2 happens after accept of A
), then that would mean B
did not finish pre-accept yet, and would necessarily pick B.seq = 102
, with final graph changed to C <- A <- B
. In that case it's ok for A
to be committed without B
in A.deps
.
Oh, I suppose it's because you want to ensure that you see all the commands that interfere with your command, and which might still get a lower seq than your final seq. The fact that the Accept has the final seq for your command guarantees that all other commands (that you don't see) will get higher seqs.
Yeah, I think it works. If I were to describe this I wouldn't make the Accept phase add anything to the final committed deps. Instead I would say the extra deps are just an execution hint for the command leader. This way you don't have to complicate recovery.
Finally, I want to point out, as I did in the paper, that this seq-based algorithm doesn't ensure strict serializability when command interference is non-transitive. For example, say that a user writes two objects, A and B (initially A = 0, B = 0). First the user commits A = 1, then a time later, the user commits B = 1. While all this is happening, someone else is trying to read both A and B. Consider the following PreAccept order:
Time | Replica 1 | Replica 2 | Replica 3 |
---|---|---|---|
1 | C0, seq = 0 | ||
2 | Read A&B, seq = 1 | ||
3 | Write A, seq = 2 | ||
4 | Write A, seq = 2 | ||
5 | |||
6 | Write B, seq = 0 | ||
7 | Write B, seq = 0 | ||
8 | Read A&B, seq = 1 |
Read A&B is ordered before Write A, but after Write B. As a consequence it returns A = 0 & B = 1, which is not consistent with the time ordering of the operations.
Finally, I want to point out, as I did in the paper, that this seq-based algorithm doesn't ensure strict serializability when command interference is non-transitive.
Yes, thank you, I'm aware. :) Our database is currently serializable, but not strict serializable (by default), due to some optimizations that cause similar "time travel" issues. But then just like the paper states it's possible to wait for dependency chain to be committed before acknowledging commit to client, that would make sure any newly interfering command acquires a higher seq than what we depended on.
I'm not sure what you mean about comparing
complete_deps
like that. In my ideaseq
is necessary because it is stable and determines the final order. Determining order using set intersections doesn't seem safe, because then it could be sensitive aboutcomplete_deps
being in flux (it's unstable and may change during recovery, different replicas may see differentcomplete_deps
for the same instance, is it possible to show execution order doesn't change in that case?).
A.complete_deps
is the set of instances A
can walk to, before A
is committed.
It only includes instances A
has seen during pre-accept phase. E.g. when handling pre-accept,
R1
there is a finite depends-on graph D1
rooted from A
, R2
there is a finite depends-on graph D2
rooted from A
,
...A.complete_deps
is the union of all nodes in D1, D2...
.
deps
is a finite set of all interfering instances with A
.
complete_deps
is a sub-graph of the final SCC that contains A
.
deps ⊆ complete_deps ⊆ SCC
.
complete_deps
does not change after committed. Sorry for the bad name. It is not complete. ˙––˙
If A
did not see B
during pre-accept(A in B.complete_deps
but B not in A.complete_deps
, execute A
before B
.
Let max(A, R[i])
to be the max id of instance that is in A.complete_deps
and belongs to owner R[i]
.
sort A
and B
by sum(max(A, R[0]), max(A, R[1])...)
and sum(max(B, R[0]), max(B, R[1])...)
.
It guarantees the execution consistency and does not need a seq
.
If I were to describe this I wouldn't make the Accept phase add anything to the final committed deps. Instead I would say the extra deps are just an execution hint for the command leader. This way you don't have to complicate recovery.
Initially I thought replicas would persist new deps to later reply with the earliest possible result. But now I think you're right, it could just be a hint if when replica receives Accept(A)
it would only include conflicting instance B
in the hint if B.seq <= A.seq
:
B
in the past and B.seq > A.seq
, then B.seq
will not change and B
will be executed after A
, so no reason to include B
in the hint.B
in the past and B.seq > A.seq
, then if it's part of pre-accept quorum the final B.seq
will not be smaller, but leader may choose a different pre-accept quorum with possibly lower B.seq
. However, since Accept(A)
must also intersect with that different quorum, B
will be included in the hint by a different replica if it happens to have B.seq <= A.seq
, thus no reason to include B
in the hint from this replica.seq
not just based on direct conflicts, but on accumulated max_seq
that we increase during communication of other replicas, then likelyhood of newly proposed instances to spuriously be included in the hint during recovery diminish other time, so there's no need to persist it until we commit.Btw, I realized that ordering by (seq, replicaID)
may not be the best idea, as multiple conflicting instances from the same replica may end up with the same final seq
. I think if each replica chooses initial-seq
based on communication with other nodes (like hybrid logical clock), and each instance proposed by the same replica has distinct initial-seq
values, then it may be better to use (seq, initial-seq, replicaID)
for ordering conflicting nodes, giving earlier proposals priority, not replicas that happened to have a lower replicaID
. It may behave nicer when conflict rate is closer to 100%.
Anyway, I'm not sure why people hate seq
so much. I love seq
, it gives very cool properties to the dependency graph. :)
Anyway, I'm not sure why people hate seq so much. I love seq, it gives very cool properties to the dependency graph. :)
To me, seq
is an extension to determine the order. The dependent graph could have contained enough information for ordering. seq
adds more rules to this system in order to work correctly. The fewer rules there are, the easier people understand it and write the right codes.
If A did not see B during pre-accept(A in B.complete_deps but B not in A.complete_deps, execute A before B.
You need to be a little more formal about your statements. Here's an example:
| 1 2 3 |
+-------+
| A B C |
| c a b |
At time 1 three instances are proposed on each replica. All of them send pre-accept to the next replica and eventually commit the result. What you get is a perfect loop, where A
depends on B
depends on C
depends on A
... and you need to somehow disambiguate the order of such SCCs, getting you into a livelock bind. Using 5 replicas it's possible to engineer chain-linked perfect loops with 4 instances (what you need to do is engineer a sequence of propose and pre-accept events over time for 4 instances, where 1 instance does not exist at the first time slot, and 1 instance and is already complete at the last time slot, then you overlap several loops like that in time, sharing one instance), it's even harder to deal with, especially if you engineer a parallel backwards loop like I showed earlier with 2-instance loops.
If you start accumulating additional dependencies during the accept phase like I propose, you must be aware that different replicas (leader and recovery) may use different quorums (we must consider the case where each replica may only be able to communicate with a subset of other replicas, you may not always see the full picture), accumulate different dependencies, and you may have different dependency graphs on different replicas. I my case that's perfectly fine, since I show how using seq
for determining direction causes it to never change. Using other arbitrary criteria may cause the order to change.
If A did not see B during pre-accept(A in B.complete_deps but B not in A.complete_deps, execute A before B.
You need to be a little more formal about your statements. Here's an example:
| 1 2 3 | +-------+ | A B C | | c a b |
Sorry, it's my bad.
A
is initiated after B
becomes safe: A
should execute after B
.A
is initiated before B
becomes safe, any consistent order for A
and B
is reasonable.Safe means:
B
has collected enough pre-accept-reply and can commit on fast-path,B
has collected enough accept-reply and can commit on slow-path.B
being safe implies recovery will never choose a different value of B
.
If A after B being safe
is satisfied, then there is a quorum contain identical safe values of B
.
∴ A
will see the safe value of B
at least once.
∴ B.complete_deps
will be added into A.complete_deps
.
∴ B.complete_deps ⊆ A.complete_deps
.
∴ sum(max(A, R[i])...) >= sum(max(B, R[i])...)
.
∴ It satisfies the Execution linearizability guarantee: If two interfering com- mands γ and δ are serialized by clients (i.e., δ is pro- posed only after γ is committed by any replica), then every replica will execute γ before δ.
If you start accumulating additional dependencies during the accept phase like I propose, you must be aware that different replicas (leader and recovery) may use different quorums (we must consider the case where each replica may only be able to communicate with a subset of other replicas, you may not always see the full picture), accumulate different dependencies, and you may have different dependency graphs on different replicas. I my case that's perfectly fine, since I show how using
seq
for determining direction causes it to never change. Using other arbitrary criteria may cause the order to change.
I do not know why this is a problem. The leader and recovery process should always commit the same value if both of them committed. And only one of them could have committed if they choose different values. And finally, every replica has the same graph. Thus the execution algorithm has the same view on every replica. Thus it should not be a problem. What I missed?
The leader and recovery process should always commit the same value if both of them committed.
That does not magically happen, the reason the committed value will be the same on recovery is because when there is potential for disagreement (there is no fast quorum of pre-accept-eq replies) epaxos goes thru a separate accept phase. Only when accept phase acquires a quorum can you be sure that the exact value you accepted will be discovered during recovery. Conversely if recovery is in progress and there's a risk of choosing a different value the leader would not be able to finish accept for its value, since quorum will be blocked by a prepare phase of a replica which started recovery.
What I'm doing is I acquire additional hints during the accept phase, but I don't have a separate accept phase for those hits (only the command itself, seq
and original deps
are fixed during the accept phase). That means all those additional hint dependencies are unstable, they depend on the exact quorum which is used for recovery, they are in flux and they may change on recovery. What's more there may be a race between a commit broadcast from the original leader, and a slightly different commit broadcast from another replica which started recovery (they would have the same seq
and command, but not accumulated dependencies). In my case it doesn't compromise safety, because some of those hint dependencies are redundant. Only those hints that are present in all possible recovery quorums are actually necessary, which means if replica receives multiple commit messages for the same instance with different hint dependencies it can update its graph by using an intersection.
I'm not sure how your solution can deal with different commit messages that have different complete_deps
in them. I suppose you could try layering more and more accept phases on top of each other, until you reach some union that agrees completely, but that would hurt latency and make recovery very complicated.
The leader and recovery process should always commit the same value if both of them committed.
That does not magically happen, the reason the committed value will be the same on recovery is because when there is potential for disagreement (there is no fast quorum of pre-accept-eq replies) epaxos goes thru a separate accept phase. Only when accept phase acquires a quorum can you be sure that the exact value you accepted will be discovered during recovery. Conversely if recovery is in progress and there's a risk of choosing a different value the leader would not be able to finish accept for its value, since quorum will be blocked by a prepare phase of a replica which started recovery.
I do not understand. In my opinion epaxos recovery guarantees that:
And yes, original epaxos does not guarantee the hint would be committed with the same value by leader and recovery.
Thus you have the intersection solution.
My thought is that to get rid of deps
, instead, to commit complete_deps
(you call it accumulated_deps
I think it is more precise:))).
In every phase, use complete_deps
as a replacement of deps
thus the epaxos guarantees the value of complete_deps
would be consistently chosen by leader or recovery.
I'm not sure how your solution can deal with different commit messages that have different
complete_deps
in them. I suppose you could try layering more and more accept phases on top of each other, until you reach some union that agrees completely, but that would hurt latency and make recovery very complicated.
No more phase, just pre-accept and accept, then commit.
In every phase, use complete_deps as a replacement of deps thus the epaxos guarantees the value of complete_deps would be consistently chosen by leader or recovery.
Can you show some examples how you would decide on complete_deps
? The way I understand it is you are using my idea of attaching extra dependencies in Accept
replies, but I may just be assuming because you proposed it after my comments on it. Here's an example without fast commit (adding extra commands that force accept phase would just clatter the picture) assuming I understood correctly:
t | 1 2 3 4 5 |
--+-----------+----
1 | A a a | replica 1 proposes A, sends PreAccept to 2, 3
| B | replica 2 proposes B, PreAccept messages are lost/stuck in queues
| A A A | replica 1 receives PreAcceptOK replies, sends Accept(A, deps=[]) to 2, 5
| A | replica 1 receives AcceptOK replies from 2 and 5, saves Commit(A, complete_deps=[B]) locally and dies
| A A A | replica 2 decides to recover A, successfully prepares on replicas 2, 3, 4
| A A A | replica 2 does recovery using Accept(A, deps=[]) on replicas 2, 3, 4
| A | replica 2 saves Commit(A, complete_deps=[]) and broadcasts
Now what is the right commit value? The original leader thinks A
depends on B
(because it received a hint about B
during the accept phase). Replica 2 that did recovery thinks A
does not depend on B
and may have already executed yet. As you can see different replicas have different graphs, so I must not understand what you propose.
How do you accumulate dependencies so they are committed consistently? How do you solve chain linked loops which grow indefinitely? How do you decide you are not missing some instances that will loop back on you later?
In every phase, use complete_deps as a replacement of deps thus the epaxos guarantees the value of complete_deps would be consistently chosen by leader or recovery.
Can you show some examples how you would decide on
complete_deps
?
complete_deps
is decided the same as deps
, in pre-accept phase only.
The content of complete_deps
is almost the same as your accumulated deps,
But no need to update it in accept phase.
To run with complete_deps
is the same as with deps
:
Leader sends pre-accept with initial complete_deps
.
Replicas receives pre-accept-request, update complete_deps
by taking union
of it and interfering instances in this replica, and also, other indirect
dependent instances. Finally it responds with the updated complete_deps
:
Leader receives pre-accept-reply-s, If the condition to commit on-fast-path is satisfied, commit. The conditions are also the same as epaxos specified. The condition could be one of:
complete_deps
is the same as inital
complete_deps
,complete_deps
are the same and all new dep
are committed;Otherwise, leader sends accept-request with union of all responded
complete_deps
.
Because enough hint info has been collected in pre-accept-phase, we do not need to do anything else in accept phase.
If ⌊n/2⌋ + 1 positive replies received, commit.
The triangle circle example:
t | 1 2 3 |
----------+
1 | A B C | initiate A B C
2 | c a b | pre-accept.
3 | | Receive pre-accept-reply, A.complete_deps=[B], B.complete_deps=[C], C.complete_deps=[A]
4 | C A B | Can not commit on fast-path. send Accept requests
5 | | A, B, C commit with A.complete_deps=[B], B.complete_deps=[C], C.complete_deps=[A]
Finally no complete_deps
is super set of any other. The order of ABC is
decided by their rank: rank(X) = sum(max(X, R[0]), max(X, R[1]), ...)
,
where max(X, R[i])
is the max instance id of owner R[i]
in the set {X} ∪ X.complete_deps
.
Here it is modified: add X
itself into the set thus there wont be rank(A) == rank(B)
.
rank(A) = sum(max(A, R0), max(A, R1), max(A, R2)) = sum(A, B, 0) = A+B
rank(B) = sum(max(B, R0), max(B, R1), max(B, R2)) = sum(0, B, C) = B+C
rank(C) = sum(max(C, R0), max(C, R1), max(C, R2)) = sum(A, 0, C) = C+A
The way I understand it is you are using my idea of attaching extra dependencies in
Accept
replies, but I may just be assuming because you proposed it after my comments on it.
I have been using the vector-clock way to solve ordering problem in my project. But I found our ideas are quite similar: to get a just big enough sub graph of SCC before committing, thus I was trying to describe it with your terminology.
Here's an example without fast commit (adding extra commands that force accept phase would just clatter the picture) assuming I understood correctly:
t | 1 2 3 4 5 | --+-----------+---- 1 | A a a | replica 1 proposes A, sends PreAccept to 2, 3 2 | B | replica 2 proposes B, PreAccept messages are lost/stuck in queues 3 | A A A | replica 1 receives PreAcceptOK replies, sends Accept(A, deps=[]) to 2, 5 4 | A | replica 1 receives AcceptOK replies from 2 and 5, saves Commit(A, complete_deps=[B]) locally and dies 5 | A A A | replica 2 decides to recover A, successfully prepares on replicas 2, 3, 4 6 | A A A | replica 2 does recovery using Accept(A, deps=[]) on replicas 2, 3, 4 7 | A | replica 2 saves Commit(A, complete_deps=[]) and broadcasts
First, at t=1, A received 2 pre-accept replies, R1 could commit on fast-path.
I do not know why there will be an accept-phase for A
.
at t=1 R1 should commit with A.complete_deps=[]
.
If R1 insists to run into an accept phase, it should be correct too.
According to epaxos, there is not any update to A
when R1
received
accept-reply.
At t=4 R1 believes A.complete_deps=[]
is safe to commit.
B
should not be added to A.complete_deps
.
At t=5, if recovery process sees any accept-phase A
, it should respect the one with
highest ballot, thus recovery process still choose A.complete_deps=[]
.
This is quite the same as classic paxos specifies.
Now what is the right commit value?
A.complete_deps=[]
The original leader thinks
A
depends onB
(because it received a hint aboutB
during the accept phase).
The leader does not need to update A
after receiving accept-reply.
Replica 2 that did recovery thinks
A
does not depend onB
and may have already executed yet.
Yes, leader and recovery should have chosen the same value.
As you can see different replicas have different graphs, so I must not understand what you propose.
How do you accumulate dependencies so they are committed consistently?
Accumulated dependencies(or complete_deps
) is collected the same way as deps
is. No update after accept-phase, thus there is no inconsistency.
How do you solve chain linked loops which grow indefinitely?
complete_deps
is finite and is determined when commit, it does not grow as
other linked instances committed.
How do you decide you are not missing some instances that will loop back on you later?
As previously I mentioned(you did not reject it thus I assume it is true:D
):
If the entire SCC is committed, executing instances by the rank: rank(X)
respects the execution linearizability guarantee, because we have:
A after B being safe
implies rank(A) > rank(B)
,
Thus to execute A
, we do not need to care about instances with higher rank
:
Wait for every instance in A.complete_deps
to be committed.
Sort the set {A} ∪ A.complete_deps
by rank(X)
.
If A
has the least rank, execute A
.
Otherwise, exeucte the least rank instance from step 1.
There wont be a loop because rank(X)
keeps going down.
complete_deps
is decided the same asdeps
, in pre-accept phase only.
Wouldn't it be just normal deps
from original epaxos then, what's the difference?
The content of
complete_deps
is almost the same as your accumulated deps, But no need to update it in accept phase.
But if it's decided in pre-accept phase, then it's not even close to being the same, my accumulated deps are accumulated during accept phase, how can it be the same if yours happens in pre-accept phase?
Here it is modified: add
X
itself into the set thus there wont berank(A) == rank(B)
.
I'll take your word for it, but I don't understand how you could ever guarantee that, especially in very complex loops. You even show a table with values for rank(A)
, rank(B)
and rank(C)
, but A
, B
and C
could have the same instance id (it's assigned at propose time and local to each replica), so wouldn't they be all equal to each other?
First, at t=1, A received 2 pre-accept replies, R1 could commit on fast-path.
I omitted fast path, because it's irrelevant to the issue (it's basically a cheat where you got lucky and something happened the same way on all replicas, being lucky is good for performance, but algorithm must be safe when fast path doesn't exist). Working around fast path is always possible, but makes graph messy, it's easier to consider two instances instead of 3, 4, 5, etc.
If the entire SCC is committed
The problem I'm trying to solve is how to start executing the graph when entire SCC is not committed, and may never be committed, because it never stops growing. Think what happens when conflict rate is close to 100%, commands proposed on multiple replicas will almost always be looped to each other with more and more dependencies that are not committed yet.
@drmingdrmer please look at the following diagram:
r | time -> |
--+-------------+
1 | A B C D |
2 | a b c d |
3 | b c d E |
4 | e a |
5 | e |
Here conflicts are pairwise, i.e. A <-> B <-> C <-> D <-> E
and also A <-> E
. After this sequence of events (where A
is propose and a
is pre-accept). The graph will eventually look like this:
A <- B <- C <- D <- E
| ↑
+-->---->--->--->---+
Now let's say replica 5 receives some (but not all) commit messages, what would be the first instance it executes? If you answer B
(seems to have the lowest rank, since it only depends on A
, which would get inflated rank due to E
), let's say E
no longer conflicts with D
(same graph, but no edge between D
and E
), is the answer still B
? Would replica 1 (if it did not receive commit for E
) execute B
first? (after all rank(B) = B < rank(A) = A+E
, if I understand correctly) What would replica 5 execute first?
Ah, I think I'm starting to understand. What you propose is to accumulate transitive conflicts of each conflicting instance you encounter during pre-accept, this way you can glimpse what could have been chosen during a commit. After conflicting instances A
and B
are committed, if A
conflicts are a subset of what pre-accept of B
seen for A
, then B
has a non-reversible dependency on A
. Otherwise there is potential for overlap, making dependency effectively bidirectional, with direction decided based on rank.
That's an interesting idea, but I'm not sure it works in some edge cases, where dependencies don't match exactly because of different quorums.
| 1 2 3 |
+-------+
| C D | C and D are proposed
| A | A is proposed, initially A.complete_deps=[]
| c | PreAccept(C) reaches replica 1, reply PreAcceptOK(C, complete_deps=[A])
| a | PreAccept(A) reaches replica 2, reply PreAcceptOK(A, complete_deps=[C])
| C | PreAcceptOK(C, complete_deps=[A]) received, broadcast Accept(C, complete_deps=[A])
| C C | Accept and Commit(C, complete_deps=[A])
| a | PreAccept(A) reaches replica 3, reply PreAcceptOK(A, complete_deps=[D])
| B | B is proposed, initially B.complete_deps=[A, D]
| b | PreAccept(B) reaches replica 1, reply PreAcceptEQ(B, complete_deps=[A, D])
| B | Fast Commit(B, complete_deps=[A, D])
| A | PreAcceptOK(A, complete_deps=[C]) received, broadcast Accept(A, complete_deps=[C])
| A | Accept(A, complete_deps=[C]) received, reply AcceptOK(A, complete_deps=[C])
| A A | AcceptOK(A, complete_deps=[C]) received, Commit(A, complete_deps=[C])
| D D | Commit(D, complete_deps=[A, B, C])
Final graph will have the following dependencies and ranks:
A -> C
C -> A
B -> A, D
D -> A, B, C
A <-> C
^ \ ^
| \ |
B <-> D
Notice how there are two loops A <-> C
and B <-> D
, with dependency from one loop to another. When replica 2 receives commits A
and C
they will be executed (in some order), because it cannot see B
and D
yet. As far as I can understand your rules B
is not ordered after A
because C
is not a subset of A, D
, so the order is determined by rank. When replica 3 receives all commits it may choose different order, e.g. let's say instance-ids are:
Then we have:
We would see B
having the lowest rank, what's stopping us from executing it before A
and C
? (even though dependency-wise B
clearly goes after A
and C
)
Unfortunately contradictions are still possible. Looking at subset/superset is a nice idea, but it cannot guarantee edges will be present on both ends of potentially intersecting commands.
r | time -> | --+-------------+ 1 | A B C D | 2 | a b c d | 3 | b c d E | 4 | e a | 5 | e |
Now let's say replica 5 receives some (but not all) commit messages, what would be the first instance it executes? If you answer
B
(seems to have the lowest rank, since it only depends onA
, which would get inflated rank due toE
),
With this time flow the final rank would be:
A: A+E
B: B
C: C
D: D
E: D+E
And it is a loop thus my answer is B
.
let's say
E
no longer conflicts withD
(same graph, but no edge betweenD
andE
), is the answer stillB
? Would replica 1 (if it did not receive commit forE
) executeB
first? (after allrank(B) = B < rank(A) = A+E
, if I understand correctly) What would replica 5 execute first?
A: A+E
B: B
C: C
D: D
E:
B or E could both be executed first, on different replica:
Since D and E do not interfere, it does not matter. The possible order is E BCD A
or BCD E A
.
PS. I'm gonna answer your question in the second last comment right away.:DDD
complete_deps
is decided the same asdeps
, in pre-accept phase only.Wouldn't it be just normal
deps
from original epaxos then, what's the difference?
The difference is complete_deps
includes all indirect dependent instances.
E.g. A -> B, B -> C
then C
is in A.complete_deps
, not just B
.
The content of
complete_deps
is almost the same as your accumulated deps, But no need to update it in accept phase.But if it's decided in pre-accept phase, then it's not even close to being the same, my accumulated deps are accumulated during accept phase, how can it be the same if yours happens in pre-accept phase?
Except the difference of the phase when they are formed, they play the same role in determining the order. They are both the just-big-enough sub graph of an SCC for determining the order.
Here it is modified: add
X
itself into the set thus there wont berank(A) == rank(B)
.I'll take your word for it, but I don't understand how you could ever guarantee that, especially in very complex loops. You even show a table with values for
rank(A)
,rank(B)
andrank(C)
, butA
,B
andC
could have the same instance id (it's assigned at propose time and local to each replica), so wouldn't they be all equal to each other?
Yes, this is not very strict in my statement.
Thanks for pointing out.
Adding X
into the set only eliminates part of the problem.
To distinguish every instance, we still need to sort by (rank(X), instance_id(X))
, or something else.
First, at t=1, A received 2 pre-accept replies, R1 could commit on fast-path.
I omitted fast path, because it's irrelevant to the issue (it's basically a cheat where you got lucky and something happened the same way on all replicas, being lucky is good for performance, but algorithm must be safe when fast path doesn't exist). Working around fast path is always possible, but makes graph messy, it's easier to consider two instances instead of 3, 4, 5, etc.
The complete_deps
is collected in pre-accept-phase.
Both slow-path and fast-path go through pre-accept-phase, thus it wont be a
problem I think.
If the entire SCC is committed
The problem I'm trying to solve is how to start executing the graph when entire SCC is not committed, and may never be committed, because it never stops growing. Think what happens when conflict rate is close to 100%, commands proposed on multiple replicas will almost always be looped to each other with more and more dependencies that are not committed yet.
I said If the entire SCC is committed
to simplify the core of this idea.
It is not a requirement for this algorithm to work.
complete_deps
contains enough information to find the first instance to
execute.
I assume you got the gist of this from your next comments thus I'm not going too
deep to explain :)
. I'm gonna see what's in your next next comment:DDD
.
Ah, I think I'm starting to understand. What you propose is to accumulate transitive conflicts of each conflicting instance you encounter during pre-accept, this way you can glimpse what could have been chosen during a commit. After conflicting instances
A
andB
are committed, ifA
conflicts are a subset of what pre-accept ofB
seen forA
, thenB
has a non-reversible dependency onA
. Otherwise there is potential for overlap, making dependency effectively bidirectional, with direction decided based on rank.
Yes I had a little trouble finding a correct word for the transitive conflicts.
:(((
That's an interesting idea, but I'm not sure it works in some edge cases, where dependencies don't match exactly because of different quorums.
| 1 2 3 | +-------+ | C D | C and D are proposed | A | A is proposed, initially A.complete_deps=[] | c | PreAccept(C) reaches replica 1, reply PreAcceptOK(C, complete_deps=[A]) | a | PreAccept(A) reaches replica 2, reply PreAcceptOK(A, complete_deps=[C]) | C | PreAcceptOK(C, complete_deps=[A]) received, broadcast Accept(C, complete_deps=[A]) | C C | Accept and Commit(C, complete_deps=[A]) | a | PreAccept(A) reaches replica 3, reply PreAcceptOK(A, complete_deps=[D]) | B | B is proposed, initially B.complete_deps=[A, D] | b | PreAccept(B) reaches replica 1, reply PreAcceptEQ(B, complete_deps=[A, D]) | B | Fast Commit(B, complete_deps=[A, D]) | A | PreAcceptOK(A, complete_deps=[C]) received, broadcast Accept(A, complete_deps=[C]) | A | Accept(A, complete_deps=[C]) received, reply AcceptOK(A, complete_deps=[C]) | A A | AcceptOK(A, complete_deps=[C]) received, Commit(A, complete_deps=[C]) | D D | Commit(D, complete_deps=[A, B, C])
Final graph will have the following dependencies and ranks:
A -> C C -> A B -> A, D D -> A, B, C A <-> C ^ \ ^ | \ | B <-> D
Notice how there are two loops
A <-> C
andB <-> D
, with dependency from one loop to another. When replica 2 receives commitsA
andC
they will be executed (in some order), because it cannot seeB
andD
yet. As far as I can understand your rulesB
is not ordered afterA
becauseC
is not a subset ofA, D
, so the order is determined by rank. When replica 3 receives all commits it may choose different order, e.g. let's say instance-ids are:
- A = 1000
- C = 1000
- D = 100
- B = 101
Then we have:
- rank(A) = A+C = 2000
- rank(C) = C+A = 2000
- rank(B) = A+B = 1101
- rank(D) = A+C+D = 2100
We would see
B
having the lowest rank, what's stopping us from executing it beforeA
andC
? (even though dependency-wiseB
clearly goes afterA
andC
)Unfortunately contradictions are still possible. Looking at subset/superset is a nice idea, but it cannot guarantee edges will be present on both ends of potentially intersecting commands.
We allow B
to be executed before AC
.
First I concluded the interfering relations are as below(as you did not specify
but it is obvious:DDD
):
A -- ~ -- C
| \ |
~ ` ~ . ~
| \ |
B -- ~ -- D
This situation is a little bit complicated because it depends on what fast-commit condition the system uses.
According to epaxos specified, there are 3 contrains for fast-commit, a system must choose one of them(to make recovery to work):
1: Leader sends pre-accept-request to only fast-quorum, no other replicas, and receives identical replis.
Leader sends pre-accept-request to arbitrary number of replicas, but:
2: the updated deps
(or complete_deps
) must be the same as its initial value and constitutes a fast-quorum.
3: the updated deps
(or complete_deps
) must all be committed and constitutes a fast-quorum.
In this scenario, it could be the second or the third constrain that the system
has chosen(because A
sends pre-accept to 2 other replicas).
And B
is not initiated after A
became safe(B
saw A
on R1, the value of A
on R1
is not the final value of A
).
Thus any consistent order for A
and B
is reasonable.
Thus in this situation execute B
before A
is reasonable.
depends on what fast-commit condition the system uses
In my example only B
is using fast commit, but it's just a minor artifact. Fast commit complicates recovery, but otherwise it's perfectly fine to send pre-accept to all replicas and wait for a single reply. For a 3 replica system fast quorum is 2 replicas.
And B is not initiated after A became safe(B saw A on R1, the value of A on R1 is not the final value of A). Thus any consistent order for A and B is reasonable. Thus in this situation execute B before A is reasonable.
I'm not saying it's unreasonable, it's perfectly reasonable. I'm saying the execution order is inconsistent:
A
, C
, B
, D
B
, A
, C
, D
Don't you think that's an issue? My example is complicated, but illustrates how it's possible to have A
in B.complete_deps
(but not vice versa), yet rank(B) < rank(A)
(so B
is executed first). Think of it this way, your rank(X)
rule flips some dependencies in the graph, but then some instances don't see what they must depend on.
I think for it to be safe you must show either:
A
is in B.complete_deps
, but B
is not in A.complete_deps
, then A
always executes before B
, that contradicts any example where there's a one-way loop of 3 or more instances, so it's necessary to flip at least one dependency in this case (i.e. you need to handle SCCs and wait for all transitive dependencies to be committed first). That's what original epaxos does.B
executes before A
(the order flips due to some external condition, like rank), then you must guarantee B
is always (somehow) discovered before or together with A
, otherwise A
not having a dependency on B
may allow some replica to execute A
before B
, making execution order inconsistent. That's what you seem to be trying to do, but I don't see how you can guarantee discoverability.A
and B
is in any way uncertain until both are committed, then both A
and B
must have each other in complete_deps
. That's what I'm doing (the order between connected instances is always based on seq
, i.e. the graph is undirected). I pay the price with some dependencies being unstable/spurious, but I show that unstable/spurious dependencies are always in the direction of lower seq
, so the order is always consistent.I would love for the second option to work somehow (it would then be possible to use with bpaxos, where dependency service is separate from command leaders), but I don't see how it can work in a single pass like you propose. :(
2020-05-21: The following solution is not proved correct. Refer to https://github.com/efficient/epaxos/issues/19 for a proved exec-algo without livelock.
I did not yet have an answer to your previous comment.
But I have another idea it seems more clear:|
:
Easy to see that every loop contains at least one instance entered accept phase(slow-path). This instance is the point to break the loop:
If X
depends on Y
but Y.complete_deps ⊄ X.complete_deps
, it means Y
changed after X
saw it.
In this case, X
do not need to execute after Y
, since the condition of execution linearizability X
executes after Y
if X
is proposed after Y
is committed does NOT hold. Removing this relation does not break the linearizability guarantee.
Thus we remove the relation X->Y
from the graph.
After removing all these unnecessary relations, every loop can be broken. The resulting graph has no loop(but there is some orphan vertex).
We call the resulting graph G'
.
1 Then choose the instance X
with minimal id, wait for all its complete_deps
to be committed.
2 Following the path in G'
, go to a minimal sibling instance and repeat from step 1.
Until there are no outgoing edges, execute the instance.
This is not currect. I have a fix here: https://github.com/efficient/epaxos/issues/14#issuecomment-599973135
If two replicas have chosen different min id instance:
G'
, then one of the replicas will find the other instance alone the path.What do you think guys?
BTW, what is bpaxos? I did not find a paper on google. May you share me a link?
BTW, what is bpaxos? I did not find a paper on google. May you share me a link?
It's kind of off topic, but here's where I initially read about it: https://arxiv.org/abs/2003.00331
2020-05-21: The following solution is not proved correct. Refer to https://github.com/efficient/epaxos/issues/19 for a proved exec-algo without livelock.
Fix: in previous comment I said
every loop contains at least one instance entered accept phase(slow-path).
It is not correct. X
saw an old version of Y
does not mean Y
enters accept
phase. But the condition to check it: Y.complete_deps ⊄ X.complete_deps
would
still be correct.
This should fix the consistency problem in my previous comment.
For some vertices in G'
, if none of them has outgoing edges, it is safe to
assign some total order to them so that no loop is formed.
We could just use instance-id to sort these instances.
To execute one instance is quite simply:
Select the one instance.
Follow after edge to walk to a tip vertex(T
), which has no more outgoing
edges. E.g. in each step, choose the sibling with the minimal id.
The after edge means that X
depends on Y
and Y
is not changed
after X
had seen it, as I previously explained.
For every instance X
in T.deps
(not complete_deps
) without outgoing edge, there is an
order between X
and T
:
execute T
or some X
by their instance-id order.
If two replicas(R1
, R2
) have chosen different tip instance X1, X2
to execute:
If X1 ~ X2
, then they have an total order, finally they will choose the same
one.
Otherwise, execute either of X1, X2
is ok.
2020-05-21: The following solution is not proved correct. Refer to https://github.com/efficient/epaxos/issues/19 for a proved exec-algo without livelock.
depends on what fast-commit condition the system uses
In my example only
B
is using fast commit, but it's just a minor artifact. Fast commit complicates recovery, but otherwise it's perfectly fine to send pre-accept to all replicas and wait for a single reply. For a 3 replica system fast quorum is 2 replicas.
Not like that. If we do not use the fast-commit-condition-1(send pre-accept to exactly a fast-quorum), we must choose one of the other two. With a 3 replicas setup, sending 2 pre-accept-request confuses the recovery:
A ~ B ~ C; C ~ A
1 2 3
-----
A B C | init A, B, C
a a | A send pre-accept to B, C, R2 has A->B, R3 has A->C
A | A received pre-accept-reply from B, it fast-commit with A->B.
p p | Recovery process sends prepare to B, C, then it can not tell which value
| would have been committed, A->B or A->C.
| Thus here there must be one of the 3 constrains introduced to solve this.
This is what I understand about epaxos. But it is a little out of our topic. :|
I'm not saying it's unreasonable, it's perfectly reasonable. I'm saying the execution order is inconsistent:
- Replica 2:
A
,C
,B
,D
- Replica 3:
B
,A
,C
,D
I have to say you must be my hero. :DDD
- If
A
is inB.complete_deps
, butB
is not inA.complete_deps
, thenA
always executes beforeB
, that contradicts any example where there's a one-way loop of 3 or more instances, so it's necessary to flip at least one dependency in this case (i.e. you need to handle SCCs and wait for all transitive dependencies to be committed first). That's what original epaxos does.
This seems to be the simplest directy to solve the problem. I followed this way and got the idea in my previous comment: https://github.com/efficient/epaxos/issues/14#issuecomment-599629187
I think it works but as always, waiting for your opinions. :DDD
- If (same-as-above), and
B
executes beforeA
(the order flips due to some external condition, like rank), then you must guaranteeB
is always (somehow) discovered before or together withA
, otherwiseA
not having a dependency onB
may allow some replica to executeA
beforeB
, making execution order inconsistent. That's what you seem to be trying to do, but I don't see how you can guarantee discoverability.
To discover an instance that no present instances depend on does not seem to be
a task can be done in finite time.
Tricks could be periodically propose some Noop instance. Thus empty instance
slot will be filled and we could discover whether there is a B
exist in a
duration.
But it is ugly because it tries to fix an algorithm level problem at engineering level.
Update: 2020-03-07:
This approach works only with a set of instances in which every two of them interfere with each other. See @snaury 's comments below: https://github.com/efficient/epaxos/issues/14#issuecomment-595467171
The livelock problem
Epaxos specifies that in an SCC, the lowest
seq
commands should be executed first.And if there is a stream of interfering commands, execution thread needs to wait until walking through the entire SCC, before it could execute any command. This is the livelock.
The solution provided by epaxos is to prioritize completing old commands over proposing new commands.
This solution might bring in some latency because we need to stop proposing a new command to break the command chain.
I've been thinking maybe there is some other way to solve the livelock problem.
We assume that all commands mentioned below interfere with each other.
Ordering
Assume we determine the command order in an SCC by a tuple
(seq, replicaID)
. Then the first command to execute is the one with the least(seq, replicaID)
in an SCC.Find the first command safe to execute
Provided with a committed, un-executed, interfering command list
CMDs = {Ra.A, Rb.B, Rc.C ..}
(Ra.A
is a command owned by replicaRa
).Assume that
Ra.A
has the least(seq, replicaID)
.Define a function
replicas(cmds)
: to extract a list of command owner replica, e.g.:replicas({Ra.A, Rb.B}) = {Ra, Rb}
.Define
allDeps
, the union of all dependency ofCMDs
:allDeps = Ra.A.deps U Rb.B.deps U Rc.C.deps...
If
CMDs
contains only the leastseq
commands(with two un-executed interfering commandRa.X
andRa.Y
: ifRa.X.seq < Ra.Y.seq
, andRa.Y
is inCMDs
,Ra.X.seq
must be inCMDs
):For an un-executed command X not in
CMDs
and owned by one ofreplicas(CMDs)
, we have:X.seq > Ra.A.seq
.Thus X should never be executed before
Ra.A
.For an un-executed command Y not in
CMDs
and NOT owned by any ofreplicas(CMDs)
, ifreplicas(allDeps) ⊆ replicas(CMDs)
, then Y depends onCMDs
.Thus Y should never be executed before
Ra.A
.With (1) and (2), we have the conclusion that
Ra.A
is the command that should be executed first.The abstract algorithm
Initialize an empty
CMDs
. Add the next least(seq, replicaID)
command and itsdeps
intoCMDs
. Repeat this step until we find aCMDs
so thatreplicas(allDeps) ⊆ replicas(CMDs)
, then execute the least command inCMDs
.Thus we could execute command even before the entire SCC committed.
Because the number of replicas is finite and small, this process would finish quickly, even when there are a lot of interfering commands.
In this way searching for an SCC is no more required and there won't be a livelock problem anymore.