Closed hasethuraman closed 2 years ago
Looks the explanation and steps are very tedious for one to look. Instead, please find this.
//1. Incorporate this code diff and run etcd-server https://github.com/etcd-io/bbolt/compare/v1.3.6...hasethuraman:bbolt:v1.3.6-test?expand=1
https://github.com/etcd-io/etcd/compare/release-3.5...hasethuraman:release-3.5?expand=1
//2. Add a key value. Allow etcdserver to acknowledge and exit immediately (with just sleep and exit to simulate the explanation) $ touch /tmp/exitnow; ./bin/etcdctl put /k1 v1 OK
//3. Remove this control flag file and restart the etcd server $ rm /tmp/exitnow
//4. Check if key present $ ./bin/etcdctl get /k --prefix $
// We can see no key-value
It seems that you raised a duplicated issue to https://github.com/etcd-io/etcd/issues/14364.
I have already answered the issue in https://github.com/etcd-io/etcd/issues/14364#issuecomment-1223391071 and https://github.com/etcd-io/etcd/issues/14364#issuecomment-1222024469
@ahrtr it is not about HA. I am saying there appears to be a data loss. When we acknowledge ok to user and there is a possibility for data loss even after acknowledging, shouldnt we synchronously write to WAL before acknowledging?
@hasethuraman haven't tried to repro, but I agree with @ahrtr, you should try 3 node cluster. Raft protocol doesn't really work on 1 node. I bet you won't be able to reproduce this.
Etcd allows to launch 1 node clusters only to make it easy to experiment with api, not for any production use.
Thanks @lavacat . I thought this; but isnt the raft message rtt covering up this observation? I am posting the screenshot here that will also give what I think basically wal can go first before the database/raft for strong consistency
Hey @hasethuraman can you please please not use screenshots. It's much easier to just link the code https://github.com/etcd-io/etcd/blob/4f0e92d94ceacdd0af26e7268764a5bce7b0a3eb/server/etcdserver/raft.go#L212-L239
Thanks @lavacat . I thought this; but isnt the raft message rtt covering up this observation? I am posting the screenshot here that will also give what I think basically wal can go first before the database/raft for strong consistency
I don't think you correctly identified order of operations. Please remember that committing to db is done after HardState is WAL entry is added to WAL. So the invocations you highlighted case r.applyc <- ap:
, r.storage.Save
are not saving the same entries.
EDIT: In single node cluster rd.Entries
= rd.CommitedEntries
which is the exact problem that causes durability issue.
I agree that it's a little confusing that etcd returns success/OK, but the data is actually lost, although there is only one member in the cluster.
In theory, we can make everything as a synchronous call, and do not respond to the client until everything (including boltDB, WAL) is successfully persisted. Obviously it will cause huge reduce of performance, and the design doesn't make any sense at all.
To prevent data loss, etcd has WAL, which is similar to the redo log of MySQL. To prevent one member total down for whatever reason (e.g. disk corruption), we recommend to setup a cluster with at least 3 members. There is no perfect solution, but this is a good solution for now.
But you intentionally fail both the WAL persistent and the BoltDB, and also with only one member. So it's expected behavior by design.
I'm just flabbergasted with the conclusion. This means that single node cluster doesn't provide durability guarantee. Documentation about API guarantees do not state it anywhere https://etcd.io/docs/v3.5/learning/api_guarantees/#durability
I will take care of this, updating the doc or enhance the existing etcdserver/raft workflow. Please note that it can only happens for a cluster with only one member.
Kube was designed with the assumption that PUT was durable and etcd was crash consistent for accepted writes, regardless of quorum size. Maybe @xiang90 or @philips could weigh in on whether this was intentional - it is certainly not expected and I'd say my read of the ecosystem over the last 10 years has been that all accepted writes should be crash safe regardless of cluster size.
I'd say my read of the ecosystem over the last 10 years has been that all accepted writes should be crash safe regardless of cluster size.
Agreed. The only discussion around relaxing single node safety guarantees I was aware of was https://github.com/etcd-io/etcd/issues/11930, and that reinforced my expectation that even single-node etcd servers default to being crash safe w.r.t. successfully persisting to disk prior to treating a write request as a success, and that overriding that default requires the admin explicitly opting into something unsafe (--unsafe-no-fsync
).
I will deliver a PR to fix this for clusters with only one member.
@ahrtr Since I did this locally and worked as expected thought of sharing. Please let me know if the fix you are planning is going to be different.
I don't think the change is expected, because it definitely cause big reduce of performance.
I just delivered a PR https://github.com/etcd-io/etcd/pull/14394 to fix this issue.
Please let me know whether you can still reproduce the issue with the PR. cc @hasethuraman
Question is how we want to roll out this fix, it's not a breaking change as it restores etcd durability which matches user expectation. However it is expected to come with performance regression. For v3.6 we should make single etcd instances durable, but we should avoid backport being to disruptive.
I think it's crucial to do benchmarking to confirm how impact-full is the change. If the change is small, i would to backport it as it is. If it could disrupt larger clusters, I would want to leave a escape hatch flag, so users can return to previous behavior. If regression is very big, we could consider backport would need to be in default off mode.
Minimal repro, by using pre-existing failpoints in etcdserver:
FAILPOINTS=enable make
GOFAIL_HTTP="127.0.0.1:22381" ./bin/etcd &
# Wait for etcd to start
curl http://127.0.0.1:22381/go.etcd.io/etcd/server/etcdserver/raftBeforeSave -XPUT -d'panic'
./bin/etcdctl put a 1
# Expect put to return OK and etcd to crash
GOFAIL_HTTP="127.0.0.1:22381" ./bin/etcd &
./bin/etcdctl get a
# Expect to get empty reponse
I think it's crucial to do benchmarking to confirm how impact-full is the change. If the change is small, i would to backport it as it is. If it could disrupt larger clusters, I would want to leave a escape hatch flag, so users can return to previous behavior. If regression is very big, we could consider backport would need to be in default off mode.
Please see my comments https://github.com/etcd-io/etcd/pull/14394#issuecomment-1229606410, https://github.com/etcd-io/etcd/pull/14394#issuecomment-1229673875 and https://github.com/etcd-io/etcd/pull/14394#issuecomment-1229676690. In short, I think we should backport the fix to 3.5, and probably 3.4.
FAILPOINTS=enable make GOFAIL_HTTP="127.0.0.1:22381" ./bin/etcd & curl http://127.0.0.1:22381/go.etcd.io/etcd/server/etcdserver/raftBeforeSave -XPUT -d'panic' ./bin/etcdctl put a 1 GOFAIL_HTTP="127.0.0.1:22381" ./bin/etcd & ./bin/etcdctl get a
These steps aren't correct to me. Although you can reproduce this issue easily using these steps, but it isn't stable. We need to make sure both the boltDB and WAL fail to save the data, but the boltDB may save data successfully or the client might get an error response in your steps. The correct steps are,
go get go.etcd.io/gofail/runtime # execute this command for both server and etcdutl
FAILPOINTS=enable make
GOFAIL_HTTP="127.0.0.1:22381" ./bin/etcd &
curl http://127.0.0.1:22381/etcdserver/raftBeforeLeaderSend -XPUT -d'sleep(100)'
curl http://127.0.0.1:22381/etcdserver/raftBeforeSave -XPUT -d'panic'
curl http://127.0.0.1:22381/backend/beforeCommit -XPUT -d'sleep(200)'
./etcdctl put k1 v1
# The client will get an "OK" response, and the etcd crashes when running this step. please start the etcd again
./etcdctl get k1 ## no data because the data was lost
For anyone reference: https://github.com/etcd-io/gofail/blob/master/doc/design.md
I think we should fix etcd's RAFT implementation rather than the etcd "apply" layer.
The RAFT protocol is pretty explicit about this:
Seems that etcd's RAFT implementation skips the rules needed for 'committing' an entry... I think that the rule intents to say: "The leader may even commit an entry before it has been written to its own disk, if a followers that had written it to their disks are the majority of all members;".
So in case of 1-member RAFT, the leader needs to write and flush first - before considering the entry to be "committed".
Thanks @ptabor for the feedback. The 10.2.1 is for the performance optimization for multi-member cluster, and I don't think it is the root cause of this issue. The performance optimization is based on the key point that a follower must sync the WAL entries before responding to the leader. Anyway, it's for multi-member cluster.
For the case of 1-member cluster, I believe the solution "the leader needs to write and flush first - before considering the entry to be "committed"
" definitely works, and several people raised the same solution . But I mentioned the reason why I did not adopt in https://github.com/etcd-io/etcd/pull/14394#discussion_r956643329. Let me recap the points:
Anyway, I agree that we should enhance the raft protocol for one-member cluster, but we'd better do it in future instead of now, because we have more higher priority things to do. WDYT? @ptabor @serathius @spzala
Thank you @ahrtr . My intuition is that 1. and 2. are premature optimisations.
If Raft change is O(10-20) lines change and it will not show the (more-significant) performance degradation, I would prefer that than introducing another 'case' (walNatifyc) on the etcd side. I think it will:
If the applications want's to do 'hedging' (apply early but do not bolt commit when probability of commit is high), such optimisation can be added in etcd layer on top of proper raft protocol (although I don't think it's worth it).
Raft change would need to be Config gated to keep the behavior stable - with previous behavior default in 3.4 & 3.5 (overwritten in etcd).
My intuition is that 1. and 2. are premature optimizations.
I just delivered another PR https://github.com/etcd-io/etcd/pull/14400, in which the implementation is greatly simplified. There is about 2.7% performance downgrade as compared to https://github.com/etcd-io/etcd/pull/14394 . So it' about 8.2% performance downgrade as compared to the main
branch.
In summary, | https://github.com/etcd-io/etcd/pull/14394 | https://github.com/etcd-io/etcd/pull/14400 | |
---|---|---|---|
Good side | better performance, but not too much (2.7% higher than 14400). Client can also get quicker response when applying fails | The implementation is much simpler | |
Bad side | The implementation is a little complicated, but it only adds one more channel walNotifyc , and only adds about 13 lines of code in server.go, so it seems acceptable |
performance downgrade, but not too much (2.7%) |
@serathius mentioned that he will work with K8s Scalability folks validate the change for K8s in https://github.com/etcd-io/etcd/pull/14394#issuecomment-1229925525 . So Let's wait for the feedback from @serathius and K8s Scalability folks, and we can make a decision afterwards. The key point is to depend on the comparison of performance results.
Seems that etcd's RAFT implementation skips the rules needed for 'committing' an entry... I think that the rule intents to say: "The leader may even commit an entry before it has been written to its own disk, if a followers that had written it to their disks are the majority of all members;".
If Raft change is O(10-20) lines change and it will not show the (more-significant) performance degradation, I would prefer that than introducing another 'case' (walNatifyc) on the etcd side.
The key point is the etcd's raft implementation follows a minimalistic design philosophy, and it delegates both network transport and storage to users. When the raft (etcd's RAFT implementation) considers an entry committed, it doesn't mean that the entry has already been successfully saved to disk on majority members. But it isn't a problem for multi-member cluster, because when the leader broadcasts the commitId to followers in the next ready loop, it must waits for previous loop to finish, accordingly eventually the WAL must be successfully persisted in majorities members disk, including itself, before any member starts apply the data. But it is an issue for one-member cluster, because raft sends identical unstable and committed entries to etcdserver, and it saves WAL entries concurrently with applying the committed entries.
I will think about how to formally fix this, either update doc to guide users how to correctly use the raft package (just like what https://github.com/etcd-io/etcd/pull/14394 or https://github.com/etcd-io/etcd/pull/14400 does), or enhance the raft package so that raft will not commit an entry until it's really persisted by etcdserver (WAL) for one-member cluster, and accordingly users (etcdserver in this case) do not need to worry about this at all. But I think we should only enhance the raft package/protocol in main
branch, and cherry pick either https://github.com/etcd-io/etcd/pull/14394 or https://github.com/etcd-io/etcd/pull/14400 to 3.5 and probably 3.4.
What do you think? @ptabor @serathius @spzala
I have serious doubts the issue is limited to single instance etcd deployments:
it must waits for previous loop to finish, accordingly eventually the WAL must be successfully persisted in majorities members disk, including itself, before any member starts apply the data.
My understanding is that the source of the issue is line 638:
https://github.com/etcd-io/etcd/blob/cdd2b737f04812a919a5735380fdaa1f932346d0/raft/raft.go#L637-L640
Lider just appended entries to the unstable (in-memory) log queue, and notifies progress tracker that the leader (r.id) instance had persisted (!!!) up to the 'unstable' entry li (last-index). The progress tracker is the source of truth that the majority of members had persisted the given entry in WAL. I think that state of the leader is inaccurate in the voting process in case of multi-instance RAFT as well. I don't have a repro yet but I don't see a reason why following scenario is not possible:
I think #14400 properly workarounds the problem (in single & multi-instance) case. Trying in the background (still on vacation) find raft level fix.
Playing with the RAFT level fix in: https://github.com/etcd-io/etcd/pull/14406/files#diff-b9adbc46e4a317ffbb3d11a66c38d6b9af41a09170d77d87efbd96d115da452f
5. lider crashes before persisting WAL, but manages to apply or broadcast hard state (so let follower 1 to apply) It seems we can land in corrupted multi-instance state.
The leader will never successfully apply or broadcast the hard state in this case, because the etcdserver is still in progress of persisting the WAL entries, so it will not help the raft to do the task (apply or broadcast the hard state) before it finishes the WAL persisting.
Playing with the RAFT level fix in: https://github.com/etcd-io/etcd/pull/14406/files#diff-b9adbc46e4a317ffbb3d11a66c38d6b9af41a09170d77d87efbd96d115da452f
I was thinking exactly the same solution on raft layer. The only concern is worse performance for one-member cluster. The impact on multi-member cluster should be negligible. Let me diff the performance today.
- lider crashes before persisting WAL, but manages to apply or broadcast hard state (so let follower 1 to apply) It seems we can land in corrupted multi-instance state.
The leader will never successfully apply or broadcast the hard state in this case, because the etcdserver is still in progress of persisting the WAL entries, so it will not help the raft to do the task (apply or broadcast the hard state) before it finishes the WAL persisting.
Which line prevents that situation ?
I claim that single "Ready" call can have already incremented hard-state and overlapping Entries & CommittedEntries on multi-instance etcd. In such case, the current code would:
Playing with the RAFT level fix in: https://github.com/etcd-io/etcd/pull/14406/files#diff-b9adbc46e4a317ffbb3d11a66c38d6b9af41a09170d77d87efbd96d115da452f
I was thinking exactly the same solution on raft layer. The only concern is worse performance for one-member cluster. The impact on multi-member cluster should be negligible. Let me diff the performance today.
Good side | Bad side | |
---|---|---|
Solution 1: https://github.com/etcd-io/etcd/pull/14394 | better performance, about 5% downgrade as compared to main . Client can also get quicker response when applying fails |
The implementation is a little complicated, but it only adds one more channel walNotifyc , and only adds about 13 lines of code in server.go, so it seems acceptable |
Solution 2: https://github.com/etcd-io/etcd/pull/14400 | The implementation is much simpler | performance downgrade, about 6 ~7% downgrade as compared to main |
Solution 3: https://github.com/etcd-io/etcd/pull/14407 | Same as above, but it encapsulates the details in raft layer, so that it can benefit other users of the raft package. It is also a good supplement & enhancement to the raft package/lib for one-member cluster. | same performance as above. The applications which depend on the raft package need to be updated following the guide |
Solution 4: https://github.com/etcd-io/etcd/pull/14411 (based on https://github.com/etcd-io/etcd/pull/14406 ) | It looks elegant, and is consistent with the raft paper/protocol | same performance as above. It updates the raft layer, so it might have some potential risk and need more test and review |
Note that solution 3 and 4 have the same performance result, both of them have about 6 ~7% downgrade as compared to the main
branch for one-member cluster, and there is no any performance impact for multi-member cluster.
###### Solution main
Summary:
Total: 55.6366 secs.
Slowest: 0.1451 secs.
Fastest: 0.0018 secs.
Average: 0.0555 secs.
Stddev: 0.0307 secs.
Requests/sec: 3594.7538
Response time histogram:
0.0018 [1] |
0.0161 [19974] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0305 [20153] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0448 [47003] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0591 [43090] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0734 [8496] |∎∎∎∎∎∎∎
0.0878 [16161] |∎∎∎∎∎∎∎∎∎∎∎∎∎
0.1021 [25094] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.1164 [16493] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.1307 [3290] |∎∎
0.1451 [245] |
Latency distribution:
10% in 0.0161 secs.
25% in 0.0344 secs.
50% in 0.0471 secs.
75% in 0.0844 secs.
90% in 0.1021 secs.
95% in 0.1097 secs.
99% in 0.1194 secs.
99.9% in 0.1338 secs.
###### Solution 4
Summary:
Total: 59.9271 secs.
Slowest: 0.1403 secs.
Fastest: 0.0058 secs.
Average: 0.0598 secs.
Stddev: 0.0199 secs.
Requests/sec: 3337.3909
Response time histogram:
0.0058 [1] |
0.0192 [344] |
0.0327 [4603] |∎∎∎
0.0462 [59793] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0596 [50277] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0731 [30341] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0865 [25795] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.1000 [25223] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.1134 [2902] |∎
0.1269 [473] |
0.1403 [248] |
Latency distribution:
10% in 0.0382 secs.
25% in 0.0426 secs.
50% in 0.0547 secs.
75% in 0.0748 secs.
90% in 0.0895 secs.
95% in 0.0944 secs.
99% in 0.1049 secs.
99.9% in 0.1297 secs.
###### Solution 3
Summary:
Total: 59.3026 secs.
Slowest: 0.1479 secs.
Fastest: 0.0068 secs.
Average: 0.0592 secs.
Stddev: 0.0198 secs.
Requests/sec: 3372.5346
Response time histogram:
0.0068 [1] |
0.0209 [416] |
0.0350 [7403] |∎∎∎
0.0491 [77514] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0632 [39921] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0773 [28729] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.0914 [32529] |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
0.1056 [11067] |∎∎∎∎∎
0.1197 [2127] |∎
0.1338 [232] |
0.1479 [61] |
Latency distribution:
10% in 0.0382 secs.
25% in 0.0420 secs.
50% in 0.0545 secs.
75% in 0.0748 secs.
90% in 0.0890 secs.
95% in 0.0944 secs.
99% in 0.1069 secs.
99.9% in 0.1257 secs.
2. broadcast async: https://github.com/etcd-io/etcd/blob/cdd2b737f04812a919a5735380fdaa1f932346d0/server/etcdserver/raft.go#L223
This step can only be running in the following loop to step 3. Let's work with an detailed workflow (based on the current main branch) below. Assuming it's 3-member cluster, including one leader and two followers.
https://github.com/etcd-io/etcd/blob/cdd2b737f04812a919a5735380fdaa1f932346d0/raft/raft.go#L638
and broadcast the data to followers,
https://github.com/etcd-io/etcd/blob/cdd2b737f04812a919a5735380fdaa1f932346d0/raft/raft.go#L1076
and persist the WAL entries, https://github.com/etcd-io/etcd/blob/cdd2b737f04812a919a5735380fdaa1f932346d0/server/etcdserver/raft.go#L237
https://github.com/etcd-io/etcd/blob/cdd2b737f04812a919a5735380fdaa1f932346d0/raft/raft.go#L1259
and tries to broadcast the new commitId to all followers,
https://github.com/etcd-io/etcd/blob/cdd2b737f04812a919a5735380fdaa1f932346d0/raft/raft.go#L1263
Once etcdserver receives the new Ready, then it means that it must have already finished the WAL persistence (step 3)
@ptabor Please let me know if you still need clarification.
Eventually solution 3
and solution 4
are the two best candidates. Recap the points again,
@ptabor Please let me know if you still need clarification.
No - thank you. You are right. Broadcasting goes together with WAL write from leader and it's a barrier for next raft round.
Note that solution 3 and 4 have the same performance result, both of them have about 6 ~7% downgrade.
So 1,3 &4 has 6-7% degradation and 2 has 2.7% degradation ? The table says for 3 "same as above".
The applications which depend on the raft package need to be updated following the guide.
If integrated into .Advance(), the application would not need to be changed. But for safety we should make it an opt-in behavior in 3.4 and 3.5 backports.
So 1,3 &4 has 6-7% degradation and 2 has 2.7% degradation ? The table says for 3 "same as above".
Sorry for the confusion. 2.7%
mentioned in the table means that solution 2 has about 2.7% performance downgrades as compared to solution 1.
I tested all the solutions as compared to main
today. Solution 1 has the best performance, it has about 5% performance downgrade as compared to the main
branch. Solution 2/3/4 has the same performance, all of them have about 6~7% performance downgrade as compared to the main
branch.
If integrated into .Advance(), the application would not need to be changed. But for safety we should make it an opt-in behavior in 3.4 and 3.5 backports.
Personally I don't like the idea of integrated into .Advance()
, please see my comment https://github.com/ahrtr/etcd/commit/7ab0842a9df897878734070ebb5594baef5c56c1#r82690914
We need to make a decision between solution 3 and solution 4. Please note that solution 3 is completed and ready for review, while solution 4 is still in progress of development.
For solution 4, we also need to decide which way to go, follow my commit or @ptabor 's draft PR.
Please also see https://github.com/etcd-io/etcd/issues/14370#issuecomment-1232592188
@tbg @bdarnell - Could you, please, take a look ? Your RAFT expertise would be critical here.
Outsider perspective: if you fix this locally for performance reasons, it'd be good to fix the RAFT library too.
Based on all the discussion so far, it seems solution 4 https://github.com/etcd-io/etcd/pull/14411 is almost the final solution. The good point is that it doesn't require any change on the applications which depend on the etcd's RAFT implementation.
Although it looks like a big PR, but actually it has only 10 ~ 20 lines of production code change, all others are test code/data changes. I am still working on the test data/code, but it's appreciated anyone can provide early feedback. cc @ptabor @tbg @bdarnell @serathius @spzala thx
Thanks to the community for acknowledging this is an issue finally. And thanks @hasethuraman for persisting with this and reporting this issue as a good open-source citizen.
Ultimately this problem is caused by ambiguity around what it means to correctly handle a raft.Ready
. As it stands, CommittedEntries
is populated under the assumption that the Entries
field has already been handled, i.e. the entry is durably in the WAL.
This is ultimately a performance optimization for a single-threaded implementation against raft: you only need to go through the raft handling loop once to get a command committed and applied. It doesn't apply to the multi-node case, since you will go through the loop once to send the message to at least one follower + write locally to WAL, but then you still need the second loop to realize the entry is now committed.
I can't blame the etcd community for getting this wrong in their use of the raft library. Ready
is notoriously under-documented and leaves much to the imagination, despite this being the crucial touchpoint between the raft library and the implementer.
I once had an update to the comments in progress, which pointed out the exact caveat that prompted this issue: https://github.com/etcd-io/etcd/pull/10861/files#diff-bd82b488417cfb0c169dbf02eb5c4355165028295855fc78bdb72a62eb338c98R124
I regret not pushing this over the finish line, but it's unclear whether it would've prevented the issue at hand, anyway.
It has been a while since I closely engaged with the raft library, but I agree with removing this optimization and, by default, simplifying the Ready
contract. Entries
and Committed
should be independent, and there shouldn't be an implicit requirement to handle the former after the latter. Most advanced implementations will at some point add concurrency between log writes and entry application, and this issue is likely to be found in many of them.
I'll take a look at #14411 next week, and help find the right way to address this. I hope to also revive #10861.
I submitted a draft https://github.com/etcd-io/etcd/pull/14413 which I think will end up a little cleaner than https://github.com/etcd-io/etcd/pull/14411 but is basically following in its footprints.
The draft has no chance of passing CI, since I need to fix up all sorts of tests. But it's the diff that most clearly demonstrates the change and its effect on a single-voter unit test.
Thanks @tbg for the feedback, I agree that https://github.com/etcd-io/etcd/pull/14413 is a little cleaner than https://github.com/etcd-io/etcd/pull/14411/commits/d1957fe2beea3c444c4294edc70231be0515155c (first commit of https://github.com/etcd-io/etcd/pull/14411) and it keeps everything in the real raft layer, but it seems that there are some edge cases (e.g. en empty entry may be appended in some cases) to be resolved.
For anyone reference, https://github.com/etcd-io/etcd/pull/14407 and https://github.com/etcd-io/etcd/pull/14400 are two simple workarounds.
As mentioned previously multiple times, I still suggest to cherry pick either https://github.com/etcd-io/etcd/pull/14407 or https://github.com/etcd-io/etcd/pull/14400 to release-3.5 and 3.4, and update the raft layer only in main
branch for safety. What do you think? cc @ptabor @serathius @spzala @tbg
After second thought, it seems that https://github.com/etcd-io/etcd/pull/14411 is clearer & better than https://github.com/etcd-io/etcd/pull/14413, points:
Since @ptabor @tbg @bdarnell have more expertise on the etcd RAFT implementation, so I'd defer to all of you (including @serathius and @spzala as well) on how to move forward.
Again, I think we'd better do it in main
branch only; and cherry pick https://github.com/etcd-io/etcd/pull/14407 or https://github.com/etcd-io/etcd/pull/14400 to release-3.5 and 3.4.
Please let me know if you have any comments or concerns. cc @dims @liggitt
Thank you @tbg for looking and providing the PR.
I wouldn't proceed with #14407 (Third solution), as it complicates the raft contract (introduces 'MustSaveEntriesBeforeApply' in the Ready state).
I'm on the edge with #14400. I think it's correct and safe. What I don't like is neither of: a) Keeping the #14400 on the main branch, as it carries not necessary complexity with RAFT fix b) Diverging far between 3.4&3.5 and 'evaluating' 2 different fixes. So it's down to the level of confidence around the RAFT fix. Tobias opinion is crucial here.
Thank you @ahrtr on the tremendous work on all the approaches and tests.
Ok let me try to summarize the current discussion. Hope that it will be useful for external observers and make making the decision easier.
Whole API durability issue is caused by difference in raft library behavior in multi vs single node scenario. In multi node raft processes entries in 2 steps (first append to log, second apply to database), which differs from single node clusters where those steps are combined. This leads to hidden assumption (not documented as #10861 was not merged) that appending entries should be done before committing. At least when there is overlap in entries to append and to commit.
Etcd was unaware of this assumption and implemented optimization to start the apply first (asynchronously) so users will get results faster. This however caused a problem as in single node clusters, entries will be processed in one step (they will be present in both to append and to apply list). If the apply executes before the append and etcd crashes it will report success to user without persisting entries to WAL.
There are two ways to fix the problem:
I'm currently on the side of fixing raft library https://github.com/etcd-io/etcd/pull/14413 for both main and older branches (assuming that performance regression is acceptable).
There are two ways to fix the problem:
- On etcd side, detect that there is overlap between lists and pospone applying the entries. This is the https://github.com/etcd-io/etcd/pull/14407
- On raft side, separate append and apply steps in single node cluster. This is the https://github.com/etcd-io/etcd/pull/14413
I'm currently on the side of fixing raft library https://github.com/etcd-io/etcd/pull/14413 for both main and older branches (assuming that performance regression is acceptable).
This isn't accurate. Based on all the discussion so far, the action plan would be one of the following two approaches:
main
and release-3.4/5
, using either https://github.com/etcd-io/etcd/pull/14411 (already approved by one maintainer @ptabor ) or https://github.com/etcd-io/etcd/pull/14413 . Please see my previous comment as well https://github.com/etcd-io/etcd/issues/14370#issuecomment-1235091312.main
only, same solutions as above. Fix release-3.4/5 using https://github.com/etcd-io/etcd/pull/14400 (already approved by one maintainer @ptabor ) Let's wait for @tbg and @bdarnell 's feedback, which is crucial here just as @ptabor pointed out.
In case anyone wants a simple summary on this issue, FYI. https://github.com/ahrtr/etcd-issues/tree/master/issues/14370
@ahrtr responding to https://github.com/etcd-io/etcd/issues/14370#issuecomment-1235091312 above.
After second thought, it seems that https://github.com/etcd-io/etcd/pull/14411 is clearer & better than https://github.com/etcd-io/etcd/pull/14413, points:
(*raft).advance
, which ensures that any implementation will get the right behavior. For example, CockroachDB implements uses *raft.RawNode
, which wraps a *raft
. #14411 fixes the problem for users of *node
(which internally wraps a *RawNode
), but this leaves direct users of *RawNode
dangling.I am obviously biased because of my affiliation with CockroachDB, but I consider *RawNode
the true external interface of the library. *node
is certainly helpful for some (it's *RawNode
with a bit of goroutine wrapping), but we better fix *RawNode
if we can, and we definitely can in the exact same way. That will allow us to have everything work the same way, and to document the desired semantics on raft.Ready
without having to reference the ways (RawNode
, node
) that exist for consuming Ready
.
The leader raft node updates each follower's pr.Match on receiving the pb.MsgAppResp
I can see that injecting an MsgAppResp
automatically can be preferred over directly expanding to the tracker update (which I did in #14413 at time time of writing). That's a change I can make.
I don't follow the argument that seems to suggest that doing work in Advance
amounts to undesirable special casing.
There is always going to be some special casing for the leader node and that special casing is useful.
Imagine we would completely reject any special casing for the leader. The leader would have to emit an MsgApp to itself, which we would then step back into the leader. The next cycle we would see the entry in Entries
. The cycle after that would see an MsgAppResp
emitted. The cycle after that would step it back into the leader and in the subsequent cycle we'd finally see the CommittedEntries
field populated.
This wouldn't be particularly performant, and it would also burden the user with new footguns, such as having to make sure that their transports allow self-directed messages, etc.
Explicitly asking the user to pass a parameter back in, etc, seems to be overdoing it. Advance()
signals to raft that everything it has asked for was done, which includes appending entries. So it strikes me as the idiomatic place to self-deliver an MsgAppResp
, freeing the user from having to do any kind of special tracking for local proposals, while also avoiding unnecessary work.
I think https://github.com/etcd-io/etcd/pull/14411 is safer and clearer, because it has the least impact on the raft protocol.
I'm not sure I understand, first of all it seems like the exact same fix but at the level of *node
vs *RawNode
, and isn't the point to fix it in a way that lets all consumers of the library avoid the footgun? If that is not the goal we could fix in etcdserver, put a comment on raft.Ready
, and leave raft completely as is.
Hey @tbg, thanks for response. It's a good argument that both Node
and RawNode
are part of raft public interface.
Do I conclude correctly that your suggestion would be to prefer https://github.com/etcd-io/etcd/pull/14413 over https://github.com/etcd-io/etcd/pull/14411 ?
What would be your suggestion about picking a fix for backport? I think the discussion is to either:
main
, and use etcdserver fix (e.g. https://github.com/etcd-io/etcd/pull/14400) for release-3.4
and `release-3.5)Do I conclude correctly that your suggestion would be to prefer https://github.com/etcd-io/etcd/pull/14413 over https://github.com/etcd-io/etcd/pull/14411 ?
Yes, I think #14413 with the .Step
approach is the right fix. (And then of course work out all of the test failures, etc). There is a bit in #14411 that I don't understand, such as special casing of the empty entry, but I think if that is really necessary it will carry over as well.
What would be your suggestion about picking a fix for backport? I think the discussion is to either:
I would merge solution 2 (#14400) immediately on all branches (main
, release-*
) since it's clearly correct, has a test, and solves the problem at hand immediately. This avoids having to rush in a raft-level fix.
Once a raft-level fix has landed on main
, #14400 can then be reverted on master.
I observed the possibility of data loss and I would like the community to comment / correct me otherwise.
Before explaining that, I would like to explain the happy path when user does a PUT <key, value>. I have tried to only necessary steps to focus this issue. And considered a single etcd instance.
==================================================================================== ----------api thread --------------
User calls etcdctl PUT k v
It lands in v3_server.go::put function with the message about k,v
Call delegates to series of function calls and enters v3_server.go::processInternalRaftRequestOnce
It registers for a signal with wait utility against this keyid
Call delegates further to series of function calls and enters raft/node.go::stepWithWaitOption(..message..)
It wraps this message in a msgResult channel and updates its result channel; then sends this message to propc channel.
After sending it waits on msgResult.channel ----------api thread waiting --------------
On seeing a message in propc channel, raft/node.go::run(), it wakes up and sequence of calls adds the message.Entries to raftLog
Notifies the msgResult.channel
----------api thread wakes--------------
In next iteration of raft/node.go::run(), it gets the entry from raftLog and add it to readyc etcdserver/raft.go::start wakes up on seeing this entry in readyc and adds this entry to applyc channel and synchronously writes to wal log ---------------------> wal log etcdserver/server.go wakes up on seeing entry in applyc channel (added in step https://github.com/etcd-io/etcd/pull/12) From step#14, the call goes through series of calls and lands in server.go::applyEntryNormal applyEntryNormal calls applyV3.apply which will eventually puts the KV to mvcc kvstore txn kvindex applyEntryNormal now sends the signal for this key which is basically to wake up api thread that is waiting in 7 ----------api thread wakes--------------
Batcher flushes the entries added to kvstore txn kvindex to database file. (also this can happen before 18 based on its timer)
Here if step https://github.com/etcd-io/etcd/pull/13 thread is pre-empted and rescheduled by the underlying operating system after completing step https://github.com/etcd-io/etcd/pull/18 and when there is a power failure at the end of step 18 where after user sees error, then the kv is neither written to wal nor to database file
I think this is not seen today because it is a small window where the server has to restart immediately after step 18 (and immediately after step 12 the underlying os must have pre-empted the etcdserver/raft.go::start and added to end of the runnable Q.). Given these multiple conditions, it appears that we dont see data loss.
But it appears from the code that it is possible. To simulate, added sleep after step 12 (also added exit) and 19. I was able to see ok but the data is not in both wal and db.
If I am not correct, my apology and also please correct my understanding.
Before repro please do the changes:
2.Do the code changes in tx.go
Now follow the steps to repro //1. Start etcd server with changes
//2. Add a key value. Allow etcdserver to acknowledge and exit immediately (with just sleep and exit to simulate the explanation) $ touch /tmp/exitnow; ./bin/etcdctl put /k1 v1 OK
//3. Remove this control flag file and restart the etcd server $ rm /tmp/exitnow
//4. Check if key present $ ./bin/etcdctl get /k --prefix $
// We can see no key-value