etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.56k stars 9.74k forks source link

Durability API guarantee broken in single node cluster #14370

Closed hasethuraman closed 2 years ago

hasethuraman commented 2 years ago

I observed the possibility of data loss and I would like the community to comment / correct me otherwise.

Before explaining that, I would like to explain the happy path when user does a PUT <key, value>. I have tried to only necessary steps to focus this issue. And considered a single etcd instance.

==================================================================================== ----------api thread --------------

User calls etcdctl PUT k v

It lands in v3_server.go::put function with the message about k,v

Call delegates to series of function calls and enters v3_server.go::processInternalRaftRequestOnce

It registers for a signal with wait utility against this keyid

Call delegates further to series of function calls and enters raft/node.go::stepWithWaitOption(..message..)

It wraps this message in a msgResult channel and updates its result channel; then sends this message to propc channel.

After sending it waits on msgResult.channel ----------api thread waiting --------------

On seeing a message in propc channel, raft/node.go::run(), it wakes up and sequence of calls adds the message.Entries to raftLog

Notifies the msgResult.channel

----------api thread wakes--------------

  1. Upon seeing the msgResult.channel, api thread wakes and returns down the stack back to v3_server.go::processInternalRaftRequestOnce and waits for signal that it registered at step#4 ----------api thread waiting --------------

In next iteration of raft/node.go::run(), it gets the entry from raftLog and add it to readyc etcdserver/raft.go::start wakes up on seeing this entry in readyc and adds this entry to applyc channel and synchronously writes to wal log ---------------------> wal log etcdserver/server.go wakes up on seeing entry in applyc channel (added in step https://github.com/etcd-io/etcd/pull/12) From step#14, the call goes through series of calls and lands in server.go::applyEntryNormal applyEntryNormal calls applyV3.apply which will eventually puts the KV to mvcc kvstore txn kvindex applyEntryNormal now sends the signal for this key which is basically to wake up api thread that is waiting in 7 ----------api thread wakes--------------

  1. User thread here wakes and sends back acknowledgement ----------user sees ok--------------

Batcher flushes the entries added to kvstore txn kvindex to database file. (also this can happen before 18 based on its timer)

Here if step https://github.com/etcd-io/etcd/pull/13 thread is pre-empted and rescheduled by the underlying operating system after completing step https://github.com/etcd-io/etcd/pull/18 and when there is a power failure at the end of step 18 where after user sees error, then the kv is neither written to wal nor to database file

I think this is not seen today because it is a small window where the server has to restart immediately after step 18 (and immediately after step 12 the underlying os must have pre-empted the etcdserver/raft.go::start and added to end of the runnable Q.). Given these multiple conditions, it appears that we dont see data loss.

But it appears from the code that it is possible. To simulate, added sleep after step 12 (also added exit) and 19. I was able to see ok but the data is not in both wal and db.

If I am not correct, my apology and also please correct my understanding.

Before repro please do the changes:

  1. Do the code changes in raft.go image

2.Do the code changes in tx.go image

  1. Rebuild etcd server

Now follow the steps to repro //1. Start etcd server with changes

//2. Add a key value. Allow etcdserver to acknowledge and exit immediately (with just sleep and exit to simulate the explanation) $ touch /tmp/exitnow; ./bin/etcdctl put /k1 v1 OK

//3. Remove this control flag file and restart the etcd server $ rm /tmp/exitnow

//4. Check if key present $ ./bin/etcdctl get /k --prefix $

// We can see no key-value

hasethuraman commented 2 years ago

Looks the explanation and steps are very tedious for one to look. Instead, please find this.

//1. Incorporate this code diff and run etcd-server https://github.com/etcd-io/bbolt/compare/v1.3.6...hasethuraman:bbolt:v1.3.6-test?expand=1

https://github.com/etcd-io/etcd/compare/release-3.5...hasethuraman:release-3.5?expand=1

//2. Add a key value. Allow etcdserver to acknowledge and exit immediately (with just sleep and exit to simulate the explanation) $ touch /tmp/exitnow; ./bin/etcdctl put /k1 v1 OK

//3. Remove this control flag file and restart the etcd server $ rm /tmp/exitnow

//4. Check if key present $ ./bin/etcdctl get /k --prefix $

// We can see no key-value

ahrtr commented 2 years ago

It seems that you raised a duplicated issue to https://github.com/etcd-io/etcd/issues/14364.

I have already answered the issue in https://github.com/etcd-io/etcd/issues/14364#issuecomment-1223391071 and https://github.com/etcd-io/etcd/issues/14364#issuecomment-1222024469

hasethuraman commented 2 years ago

@ahrtr it is not about HA. I am saying there appears to be a data loss. When we acknowledge ok to user and there is a possibility for data loss even after acknowledging, shouldnt we synchronously write to WAL before acknowledging?

lavacat commented 2 years ago

@hasethuraman haven't tried to repro, but I agree with @ahrtr, you should try 3 node cluster. Raft protocol doesn't really work on 1 node. I bet you won't be able to reproduce this.

Etcd allows to launch 1 node clusters only to make it easy to experiment with api, not for any production use.

hasethuraman commented 2 years ago

Thanks @lavacat . I thought this; but isnt the raft message rtt covering up this observation? I am posting the screenshot here that will also give what I think basically wal can go first before the database/raft for strong consistency

image

serathius commented 2 years ago

Hey @hasethuraman can you please please not use screenshots. It's much easier to just link the code https://github.com/etcd-io/etcd/blob/4f0e92d94ceacdd0af26e7268764a5bce7b0a3eb/server/etcdserver/raft.go#L212-L239

serathius commented 2 years ago

Thanks @lavacat . I thought this; but isnt the raft message rtt covering up this observation? I am posting the screenshot here that will also give what I think basically wal can go first before the database/raft for strong consistency

image

I don't think you correctly identified order of operations. Please remember that committing to db is done after HardState is WAL entry is added to WAL. So the invocations you highlighted case r.applyc <- ap:, r.storage.Save are not saving the same entries.

EDIT: In single node cluster rd.Entries = rd.CommitedEntries which is the exact problem that causes durability issue.

ahrtr commented 2 years ago

I agree that it's a little confusing that etcd returns success/OK, but the data is actually lost, although there is only one member in the cluster.

In theory, we can make everything as a synchronous call, and do not respond to the client until everything (including boltDB, WAL) is successfully persisted. Obviously it will cause huge reduce of performance, and the design doesn't make any sense at all.

To prevent data loss, etcd has WAL, which is similar to the redo log of MySQL. To prevent one member total down for whatever reason (e.g. disk corruption), we recommend to setup a cluster with at least 3 members. There is no perfect solution, but this is a good solution for now.

But you intentionally fail both the WAL persistent and the BoltDB, and also with only one member. So it's expected behavior by design.

serathius commented 2 years ago

I'm just flabbergasted with the conclusion. This means that single node cluster doesn't provide durability guarantee. Documentation about API guarantees do not state it anywhere https://etcd.io/docs/v3.5/learning/api_guarantees/#durability

ahrtr commented 2 years ago

I will take care of this, updating the doc or enhance the existing etcdserver/raft workflow. Please note that it can only happens for a cluster with only one member.

smarterclayton commented 2 years ago

Kube was designed with the assumption that PUT was durable and etcd was crash consistent for accepted writes, regardless of quorum size. Maybe @xiang90 or @philips could weigh in on whether this was intentional - it is certainly not expected and I'd say my read of the ecosystem over the last 10 years has been that all accepted writes should be crash safe regardless of cluster size.

liggitt commented 2 years ago

I'd say my read of the ecosystem over the last 10 years has been that all accepted writes should be crash safe regardless of cluster size.

Agreed. The only discussion around relaxing single node safety guarantees I was aware of was https://github.com/etcd-io/etcd/issues/11930, and that reinforced my expectation that even single-node etcd servers default to being crash safe w.r.t. successfully persisting to disk prior to treating a write request as a success, and that overriding that default requires the admin explicitly opting into something unsafe (--unsafe-no-fsync).

ahrtr commented 2 years ago

I will deliver a PR to fix this for clusters with only one member.

hasethuraman commented 2 years ago

@ahrtr Since I did this locally and worked as expected thought of sharing. Please let me know if the fix you are planning is going to be different.

https://github.com/etcd-io/etcd/compare/release-3.5...hasethuraman:etcd:14370?expand=1#diff-54bdb7a5ed2d92f64598fb472372562ff64f8417e63d7ac672eaa485704cea9f

ahrtr commented 2 years ago

https://github.com/etcd-io/etcd/compare/release-3.5...hasethuraman:etcd:14370?expand=1#diff-54bdb7a5ed2d92f64598fb472372562ff64f8417e63d7ac672eaa485704cea9f

I don't think the change is expected, because it definitely cause big reduce of performance.

ahrtr commented 2 years ago

I just delivered a PR https://github.com/etcd-io/etcd/pull/14394 to fix this issue.

Please let me know whether you can still reproduce the issue with the PR. cc @hasethuraman

serathius commented 2 years ago

Question is how we want to roll out this fix, it's not a breaking change as it restores etcd durability which matches user expectation. However it is expected to come with performance regression. For v3.6 we should make single etcd instances durable, but we should avoid backport being to disruptive.

I think it's crucial to do benchmarking to confirm how impact-full is the change. If the change is small, i would to backport it as it is. If it could disrupt larger clusters, I would want to leave a escape hatch flag, so users can return to previous behavior. If regression is very big, we could consider backport would need to be in default off mode.

serathius commented 2 years ago

Minimal repro, by using pre-existing failpoints in etcdserver:

FAILPOINTS=enable make
GOFAIL_HTTP="127.0.0.1:22381" ./bin/etcd & 
# Wait for etcd to start
curl http://127.0.0.1:22381/go.etcd.io/etcd/server/etcdserver/raftBeforeSave -XPUT -d'panic'
./bin/etcdctl put a 1
# Expect put to return OK and etcd to crash
GOFAIL_HTTP="127.0.0.1:22381" ./bin/etcd & 
./bin/etcdctl get a
# Expect to get empty reponse
ahrtr commented 2 years ago

I think it's crucial to do benchmarking to confirm how impact-full is the change. If the change is small, i would to backport it as it is. If it could disrupt larger clusters, I would want to leave a escape hatch flag, so users can return to previous behavior. If regression is very big, we could consider backport would need to be in default off mode.

Please see my comments https://github.com/etcd-io/etcd/pull/14394#issuecomment-1229606410, https://github.com/etcd-io/etcd/pull/14394#issuecomment-1229673875 and https://github.com/etcd-io/etcd/pull/14394#issuecomment-1229676690. In short, I think we should backport the fix to 3.5, and probably 3.4.

FAILPOINTS=enable make GOFAIL_HTTP="127.0.0.1:22381" ./bin/etcd & curl http://127.0.0.1:22381/go.etcd.io/etcd/server/etcdserver/raftBeforeSave -XPUT -d'panic' ./bin/etcdctl put a 1 GOFAIL_HTTP="127.0.0.1:22381" ./bin/etcd & ./bin/etcdctl get a

These steps aren't correct to me. Although you can reproduce this issue easily using these steps, but it isn't stable. We need to make sure both the boltDB and WAL fail to save the data, but the boltDB may save data successfully or the client might get an error response in your steps. The correct steps are,

go get go.etcd.io/gofail/runtime   # execute this command for both server and etcdutl
FAILPOINTS=enable make
GOFAIL_HTTP="127.0.0.1:22381" ./bin/etcd &

curl http://127.0.0.1:22381/etcdserver/raftBeforeLeaderSend -XPUT -d'sleep(100)'
curl http://127.0.0.1:22381/etcdserver/raftBeforeSave -XPUT -d'panic'
curl http://127.0.0.1:22381/backend/beforeCommit -XPUT -d'sleep(200)'
./etcdctl  put k1 v1   

# The client will get an "OK" response, and the etcd crashes when running this step. please start the etcd again

./etcdctl  get k1   ## no data because the data was lost

For anyone reference: https://github.com/etcd-io/gofail/blob/master/doc/design.md

ptabor commented 2 years ago

I think we should fix etcd's RAFT implementation rather than the etcd "apply" layer.

The RAFT protocol is pretty explicit about this:

Seems that etcd's RAFT implementation skips the rules needed for 'committing' an entry... I think that the rule intents to say: "The leader may even commit an entry before it has been written to its own disk, if a followers that had written it to their disks are the majority of all members;".

So in case of 1-member RAFT, the leader needs to write and flush first - before considering the entry to be "committed".

ahrtr commented 2 years ago

Thanks @ptabor for the feedback. The 10.2.1 is for the performance optimization for multi-member cluster, and I don't think it is the root cause of this issue. The performance optimization is based on the key point that a follower must sync the WAL entries before responding to the leader. Anyway, it's for multi-member cluster.

For the case of 1-member cluster, I believe the solution "the leader needs to write and flush first - before considering the entry to be "committed"" definitely works, and several people raised the same solution . But I mentioned the reason why I did not adopt in https://github.com/etcd-io/etcd/pull/14394#discussion_r956643329. Let me recap the points:

  1. It has some performance improvement to save WAL entries in parallel with committing&applying it, although it might not be too big performance improvement;
  2. When etcdserver fails to apply the entries for whatever reason, it can respond to the client asap, and no need to wait for the WAL syncing at all in this case.
  3. The raft package is relative stable against the etceserver, so we are always cautious to enhance/refactor the raft package. Raft handle the commit logic in the same way no matter it's one-member cluster or multi-member cluster. For one-member cluster, it just commits the proposal immediately because it doesn't need to get confirmation from itself. It might be OK for now.

Anyway, I agree that we should enhance the raft protocol for one-member cluster, but we'd better do it in future instead of now, because we have more higher priority things to do. WDYT? @ptabor @serathius @spzala

ptabor commented 2 years ago

Thank you @ahrtr . My intuition is that 1. and 2. are premature optimisations.

  1. Committing/flushing bolt is the most expensive operation and in both cases we are not doing this on the synchronous part. From in-memory apply I would expect insignificant overhead.
  2. That's interesting. This would apply if etcd is used with many transactions with failing preconditions... this used to be a case more before introduction of leases... as such usage model is expensive due to need to write (but not flush) WAL log for all such (practically RO) entries. I would not optimize for patterns that should be rather avoided.
  3. TBH I failed to find where the exception on happens in RAFT implementation:

If Raft change is O(10-20) lines change and it will not show the (more-significant) performance degradation, I would prefer that than introducing another 'case' (walNatifyc) on the etcd side. I think it will:

If the applications want's to do 'hedging' (apply early but do not bolt commit when probability of commit is high), such optimisation can be added in etcd layer on top of proper raft protocol (although I don't think it's worth it).

Raft change would need to be Config gated to keep the behavior stable - with previous behavior default in 3.4 & 3.5 (overwritten in etcd).

ahrtr commented 2 years ago

My intuition is that 1. and 2. are premature optimizations.

I just delivered another PR https://github.com/etcd-io/etcd/pull/14400, in which the implementation is greatly simplified. There is about 2.7% performance downgrade as compared to https://github.com/etcd-io/etcd/pull/14394 . So it' about 8.2% performance downgrade as compared to the main branch.

In summary, https://github.com/etcd-io/etcd/pull/14394 https://github.com/etcd-io/etcd/pull/14400
Good side better performance, but not too much (2.7% higher than 14400). Client can also get quicker response when applying fails The implementation is much simpler
Bad side The implementation is a little complicated, but it only adds one more channel walNotifyc, and only adds about 13 lines of code in server.go, so it seems acceptable performance downgrade, but not too much (2.7%)

@serathius mentioned that he will work with K8s Scalability folks validate the change for K8s in https://github.com/etcd-io/etcd/pull/14394#issuecomment-1229925525 . So Let's wait for the feedback from @serathius and K8s Scalability folks, and we can make a decision afterwards. The key point is to depend on the comparison of performance results.


Seems that etcd's RAFT implementation skips the rules needed for 'committing' an entry... I think that the rule intents to say: "The leader may even commit an entry before it has been written to its own disk, if a followers that had written it to their disks are the majority of all members;".

If Raft change is O(10-20) lines change and it will not show the (more-significant) performance degradation, I would prefer that than introducing another 'case' (walNatifyc) on the etcd side.

The key point is the etcd's raft implementation follows a minimalistic design philosophy, and it delegates both network transport and storage to users. When the raft (etcd's RAFT implementation) considers an entry committed, it doesn't mean that the entry has already been successfully saved to disk on majority members. But it isn't a problem for multi-member cluster, because when the leader broadcasts the commitId to followers in the next ready loop, it must waits for previous loop to finish, accordingly eventually the WAL must be successfully persisted in majorities members disk, including itself, before any member starts apply the data. But it is an issue for one-member cluster, because raft sends identical unstable and committed entries to etcdserver, and it saves WAL entries concurrently with applying the committed entries.

I will think about how to formally fix this, either update doc to guide users how to correctly use the raft package (just like what https://github.com/etcd-io/etcd/pull/14394 or https://github.com/etcd-io/etcd/pull/14400 does), or enhance the raft package so that raft will not commit an entry until it's really persisted by etcdserver (WAL) for one-member cluster, and accordingly users (etcdserver in this case) do not need to worry about this at all. But I think we should only enhance the raft package/protocol in main branch, and cherry pick either https://github.com/etcd-io/etcd/pull/14394 or https://github.com/etcd-io/etcd/pull/14400 to 3.5 and probably 3.4.

What do you think? @ptabor @serathius @spzala

ptabor commented 2 years ago

I have serious doubts the issue is limited to single instance etcd deployments:

it must waits for previous loop to finish, accordingly eventually the WAL must be successfully persisted in majorities members disk, including itself, before any member starts apply the data.

My understanding is that the source of the issue is line 638:

https://github.com/etcd-io/etcd/blob/cdd2b737f04812a919a5735380fdaa1f932346d0/raft/raft.go#L637-L640

Lider just appended entries to the unstable (in-memory) log queue, and notifies progress tracker that the leader (r.id) instance had persisted (!!!) up to the 'unstable' entry li (last-index). The progress tracker is the source of truth that the majority of members had persisted the given entry in WAL. I think that state of the leader is inaccurate in the voting process in case of multi-instance RAFT as well. I don't have a repro yet but I don't see a reason why following scenario is not possible:

  1. client sends PUT
  2. lider replicates (successfully) the put to 1 instance. The other instance fails to accept the replication message.
  3. lider receives confirmation of WAL write from the follower
  4. lider assumes its majority confirmed (leader + 1 instance), moves commit inde).
  5. lider crashes before persisting WAL, but manages to apply or broadcast hard state (so let follower 1 to apply) It seems we can land in corrupted multi-instance state.

I think #14400 properly workarounds the problem (in single & multi-instance) case. Trying in the background (still on vacation) find raft level fix.

ptabor commented 2 years ago

Playing with the RAFT level fix in: https://github.com/etcd-io/etcd/pull/14406/files#diff-b9adbc46e4a317ffbb3d11a66c38d6b9af41a09170d77d87efbd96d115da452f

ahrtr commented 2 years ago

5. lider crashes before persisting WAL, but manages to apply or broadcast hard state (so let follower 1 to apply) It seems we can land in corrupted multi-instance state.

The leader will never successfully apply or broadcast the hard state in this case, because the etcdserver is still in progress of persisting the WAL entries, so it will not help the raft to do the task (apply or broadcast the hard state) before it finishes the WAL persisting.

Playing with the RAFT level fix in: https://github.com/etcd-io/etcd/pull/14406/files#diff-b9adbc46e4a317ffbb3d11a66c38d6b9af41a09170d77d87efbd96d115da452f

I was thinking exactly the same solution on raft layer. The only concern is worse performance for one-member cluster. The impact on multi-member cluster should be negligible. Let me diff the performance today.

ptabor commented 2 years ago
  1. lider crashes before persisting WAL, but manages to apply or broadcast hard state (so let follower 1 to apply) It seems we can land in corrupted multi-instance state.

The leader will never successfully apply or broadcast the hard state in this case, because the etcdserver is still in progress of persisting the WAL entries, so it will not help the raft to do the task (apply or broadcast the hard state) before it finishes the WAL persisting.

Which line prevents that situation ?

I claim that single "Ready" call can have already incremented hard-state and overlapping Entries & CommittedEntries on multi-instance etcd. In such case, the current code would:

  1. start applying asynchronously: https://github.com/etcd-io/etcd/blob/cdd2b737f04812a919a5735380fdaa1f932346d0/server/etcdserver/raft.go#L213
  2. broadcast async: https://github.com/etcd-io/etcd/blob/cdd2b737f04812a919a5735380fdaa1f932346d0/server/etcdserver/raft.go#L223
  3. write to WAL (sync - but after): https://github.com/etcd-io/etcd/blob/cdd2b737f04812a919a5735380fdaa1f932346d0/server/etcdserver/raft.go#L237

    Playing with the RAFT level fix in: https://github.com/etcd-io/etcd/pull/14406/files#diff-b9adbc46e4a317ffbb3d11a66c38d6b9af41a09170d77d87efbd96d115da452f

    I was thinking exactly the same solution on raft layer. The only concern is worse performance for one-member cluster. The impact on multi-member cluster should be negligible. Let me diff the performance today.

ahrtr commented 2 years ago
Good side Bad side
Solution 1: https://github.com/etcd-io/etcd/pull/14394 better performance, about 5% downgrade as compared to main. Client can also get quicker response when applying fails The implementation is a little complicated, but it only adds one more channel walNotifyc, and only adds about 13 lines of code in server.go, so it seems acceptable
Solution 2: https://github.com/etcd-io/etcd/pull/14400 The implementation is much simpler performance downgrade, about 6 ~7% downgrade as compared to main
Solution 3: https://github.com/etcd-io/etcd/pull/14407 Same as above, but it encapsulates the details in raft layer, so that it can benefit other users of the raft package. It is also a good supplement & enhancement to the raft package/lib for one-member cluster. same performance as above.
The applications which depend on the raft package need to be updated following the guide
Solution 4: https://github.com/etcd-io/etcd/pull/14411 (based on https://github.com/etcd-io/etcd/pull/14406 ) It looks elegant, and is consistent with the raft paper/protocol same performance as above.
It updates the raft layer, so it might have some potential risk and need more test and review

Note that solution 3 and 4 have the same performance result, both of them have about 6 ~7% downgrade as compared to the main branch for one-member cluster, and there is no any performance impact for multi-member cluster.

###### Solution main
Summary:
  Total:    55.6366 secs.
  Slowest:  0.1451 secs.
  Fastest:  0.0018 secs.
  Average:  0.0555 secs.
  Stddev:   0.0307 secs.
  Requests/sec: 3594.7538

Response time histogram:
  0.0018 [1]    |
  0.0161 [19974]    |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0305 [20153]    |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0448 [47003]    |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0591 [43090]    |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0734 [8496] |∎∎∎∎∎∎∎
  0.0878 [16161]    |∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.1021 [25094]    |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.1164 [16493]    |∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.1307 [3290] |∎∎
  0.1451 [245]  |

Latency distribution:
  10% in 0.0161 secs.
  25% in 0.0344 secs.
  50% in 0.0471 secs.
  75% in 0.0844 secs.
  90% in 0.1021 secs.
  95% in 0.1097 secs.
  99% in 0.1194 secs.
  99.9% in 0.1338 secs.

###### Solution 4 

Summary:
  Total:    59.9271 secs.
  Slowest:  0.1403 secs.
  Fastest:  0.0058 secs.
  Average:  0.0598 secs.
  Stddev:   0.0199 secs.
  Requests/sec: 3337.3909

Response time histogram:
  0.0058 [1]    |
  0.0192 [344]  |
  0.0327 [4603] |∎∎∎
  0.0462 [59793]    |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0596 [50277]    |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0731 [30341]    |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0865 [25795]    |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.1000 [25223]    |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.1134 [2902] |∎
  0.1269 [473]  |
  0.1403 [248]  |

Latency distribution:
  10% in 0.0382 secs.
  25% in 0.0426 secs.
  50% in 0.0547 secs.
  75% in 0.0748 secs.
  90% in 0.0895 secs.
  95% in 0.0944 secs.
  99% in 0.1049 secs.
  99.9% in 0.1297 secs.

###### Solution 3
Summary:
  Total:    59.3026 secs.
  Slowest:  0.1479 secs.
  Fastest:  0.0068 secs.
  Average:  0.0592 secs.
  Stddev:   0.0198 secs.
  Requests/sec: 3372.5346

Response time histogram:
  0.0068 [1]    |
  0.0209 [416]  |
  0.0350 [7403] |∎∎∎
  0.0491 [77514]    |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0632 [39921]    |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0773 [28729]    |∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.0914 [32529]    |∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.1056 [11067]    |∎∎∎∎∎
  0.1197 [2127] |∎
  0.1338 [232]  |
  0.1479 [61]   |

Latency distribution:
  10% in 0.0382 secs.
  25% in 0.0420 secs.
  50% in 0.0545 secs.
  75% in 0.0748 secs.
  90% in 0.0890 secs.
  95% in 0.0944 secs.
  99% in 0.1069 secs.
  99.9% in 0.1257 secs.  
ahrtr commented 2 years ago

2. broadcast async: https://github.com/etcd-io/etcd/blob/cdd2b737f04812a919a5735380fdaa1f932346d0/server/etcdserver/raft.go#L223

This step can only be running in the following loop to step 3. Let's work with an detailed workflow (based on the current main branch) below. Assuming it's 3-member cluster, including one leader and two followers.

  1. Client puts a k/v;
  2. The leader receives the k/v pair, append the data its unstable list and update the progress for itself (see below).

https://github.com/etcd-io/etcd/blob/cdd2b737f04812a919a5735380fdaa1f932346d0/raft/raft.go#L638

and broadcast the data to followers,

https://github.com/etcd-io/etcd/blob/cdd2b737f04812a919a5735380fdaa1f932346d0/raft/raft.go#L1076

  1. etcdserver gets the notification from raft

https://github.com/etcd-io/etcd/blob/cdd2b737f04812a919a5735380fdaa1f932346d0/server/etcdserver/raft.go#L169

  1. etcdserver sends the data to all followers,

https://github.com/etcd-io/etcd/blob/cdd2b737f04812a919a5735380fdaa1f932346d0/server/etcdserver/raft.go#L223

and persist the WAL entries, https://github.com/etcd-io/etcd/blob/cdd2b737f04812a919a5735380fdaa1f932346d0/server/etcdserver/raft.go#L237

  1. Assuming one follower is unreachable for whatever reason, so leader only gets response from the other follower. Since there are two confirmations, including itself, so the leader marks the entry as committed,

https://github.com/etcd-io/etcd/blob/cdd2b737f04812a919a5735380fdaa1f932346d0/raft/raft.go#L1259

and tries to broadcast the new commitId to all followers,

https://github.com/etcd-io/etcd/blob/cdd2b737f04812a919a5735380fdaa1f932346d0/raft/raft.go#L1263

  1. Assuming the etcdserver (leader) hasn't finish saving the WAL entries (received at step 3). Then there is no chance for the etcdserver get the new ready at raft.go#L169.

Once etcdserver receives the new Ready, then it means that it must have already finished the WAL persistence (step 3)

@ptabor Please let me know if you still need clarification.

ahrtr commented 2 years ago

Eventually solution 3 and solution 4are the two best candidates. Recap the points again,

  1. They have the same performance;
  2. Both of them require applications which depend on the raft lib to be updated.
  3. Solution 3 looks more safe for now because solution 4 changes the raft layer. The raft implementation is really subtle!
ptabor commented 2 years ago

@ptabor Please let me know if you still need clarification.

No - thank you. You are right. Broadcasting goes together with WAL write from leader and it's a barrier for next raft round.

Note that solution 3 and 4 have the same performance result, both of them have about 6 ~7% downgrade.

So 1,3 &4 has 6-7% degradation and 2 has 2.7% degradation ? The table says for 3 "same as above".

The applications which depend on the raft package need to be updated following the guide.

If integrated into .Advance(), the application would not need to be changed. But for safety we should make it an opt-in behavior in 3.4 and 3.5 backports.

ahrtr commented 2 years ago

So 1,3 &4 has 6-7% degradation and 2 has 2.7% degradation ? The table says for 3 "same as above".

Sorry for the confusion. 2.7% mentioned in the table means that solution 2 has about 2.7% performance downgrades as compared to solution 1.

I tested all the solutions as compared to main today. Solution 1 has the best performance, it has about 5% performance downgrade as compared to the main branch. Solution 2/3/4 has the same performance, all of them have about 6~7% performance downgrade as compared to the main branch.

If integrated into .Advance(), the application would not need to be changed. But for safety we should make it an opt-in behavior in 3.4 and 3.5 backports.

Personally I don't like the idea of integrated into .Advance(), please see my comment https://github.com/ahrtr/etcd/commit/7ab0842a9df897878734070ebb5594baef5c56c1#r82690914

ahrtr commented 2 years ago

We need to make a decision between solution 3 and solution 4. Please note that solution 3 is completed and ready for review, while solution 4 is still in progress of development.

For solution 4, we also need to decide which way to go, follow my commit or @ptabor 's draft PR.

Please also see https://github.com/etcd-io/etcd/issues/14370#issuecomment-1232592188

ptabor commented 2 years ago

@tbg @bdarnell - Could you, please, take a look ? Your RAFT expertise would be critical here.

lavalamp commented 2 years ago

Outsider perspective: if you fix this locally for performance reasons, it'd be good to fix the RAFT library too.

ahrtr commented 2 years ago

Based on all the discussion so far, it seems solution 4 https://github.com/etcd-io/etcd/pull/14411 is almost the final solution. The good point is that it doesn't require any change on the applications which depend on the etcd's RAFT implementation.

Although it looks like a big PR, but actually it has only 10 ~ 20 lines of production code change, all others are test code/data changes. I am still working on the test data/code, but it's appreciated anyone can provide early feedback. cc @ptabor @tbg @bdarnell @serathius @spzala thx

ananduw commented 2 years ago

Thanks to the community for acknowledging this is an issue finally. And thanks @hasethuraman for persisting with this and reporting this issue as a good open-source citizen.

tbg commented 2 years ago

Ultimately this problem is caused by ambiguity around what it means to correctly handle a raft.Ready. As it stands, CommittedEntries is populated under the assumption that the Entries field has already been handled, i.e. the entry is durably in the WAL.

This is ultimately a performance optimization for a single-threaded implementation against raft: you only need to go through the raft handling loop once to get a command committed and applied. It doesn't apply to the multi-node case, since you will go through the loop once to send the message to at least one follower + write locally to WAL, but then you still need the second loop to realize the entry is now committed.

I can't blame the etcd community for getting this wrong in their use of the raft library. Ready is notoriously under-documented and leaves much to the imagination, despite this being the crucial touchpoint between the raft library and the implementer.

I once had an update to the comments in progress, which pointed out the exact caveat that prompted this issue: https://github.com/etcd-io/etcd/pull/10861/files#diff-bd82b488417cfb0c169dbf02eb5c4355165028295855fc78bdb72a62eb338c98R124

I regret not pushing this over the finish line, but it's unclear whether it would've prevented the issue at hand, anyway.

It has been a while since I closely engaged with the raft library, but I agree with removing this optimization and, by default, simplifying the Ready contract. Entries and Committed should be independent, and there shouldn't be an implicit requirement to handle the former after the latter. Most advanced implementations will at some point add concurrency between log writes and entry application, and this issue is likely to be found in many of them.

I'll take a look at #14411 next week, and help find the right way to address this. I hope to also revive #10861.

tbg commented 2 years ago

I submitted a draft https://github.com/etcd-io/etcd/pull/14413 which I think will end up a little cleaner than https://github.com/etcd-io/etcd/pull/14411 but is basically following in its footprints.

The draft has no chance of passing CI, since I need to fix up all sorts of tests. But it's the diff that most clearly demonstrates the change and its effect on a single-voter unit test.

ahrtr commented 2 years ago

Thanks @tbg for the feedback, I agree that https://github.com/etcd-io/etcd/pull/14413 is a little cleaner than https://github.com/etcd-io/etcd/pull/14411/commits/d1957fe2beea3c444c4294edc70231be0515155c (first commit of https://github.com/etcd-io/etcd/pull/14411) and it keeps everything in the real raft layer, but it seems that there are some edge cases (e.g. en empty entry may be appended in some cases) to be resolved.

ahrtr commented 2 years ago

For anyone reference, https://github.com/etcd-io/etcd/pull/14407 and https://github.com/etcd-io/etcd/pull/14400 are two simple workarounds.

ahrtr commented 2 years ago

As mentioned previously multiple times, I still suggest to cherry pick either https://github.com/etcd-io/etcd/pull/14407 or https://github.com/etcd-io/etcd/pull/14400 to release-3.5 and 3.4, and update the raft layer only in main branch for safety. What do you think? cc @ptabor @serathius @spzala @tbg

ahrtr commented 2 years ago

After second thought, it seems that https://github.com/etcd-io/etcd/pull/14411 is clearer & better than https://github.com/etcd-io/etcd/pull/14413, points:

  1. https://github.com/etcd-io/etcd/pull/14413 is similar to @ptabor 's original draft PR https://github.com/etcd-io/etcd/pull/14406, which is slightly worse than https://github.com/etcd-io/etcd/pull/14411 per the discussion.
  2. The leader raft node updates each follower's pr.Match on receiving the pb.MsgAppResp, and tries to update the commitId afterwards. So we just need to let users (e.g., leader etcdserver) to explicitly acknowledge persisting the Entries and raft follows the same way to process the user's acknowledgment, instead of updating the raft protocol to process the single-instance cluster case separately. Since we already have the contract Advance() between raft and users (etcdserver in this case), so we can encapsulate the logic in it so that users do not need to worry about it at all. Please also refer to https://github.com/ahrtr/etcd/commit/7ab0842a9df897878734070ebb5594baef5c56c1#r82690914.
  3. I think https://github.com/etcd-io/etcd/pull/14411 is safer and clearer, because it has the least impact on the raft protocol.

Since @ptabor @tbg @bdarnell have more expertise on the etcd RAFT implementation, so I'd defer to all of you (including @serathius and @spzala as well) on how to move forward.

Again, I think we'd better do it in main branch only; and cherry pick https://github.com/etcd-io/etcd/pull/14407 or https://github.com/etcd-io/etcd/pull/14400 to release-3.5 and 3.4.

Please let me know if you have any comments or concerns. cc @dims @liggitt

ptabor commented 2 years ago

Thank you @tbg for looking and providing the PR.

I wouldn't proceed with #14407 (Third solution), as it complicates the raft contract (introduces 'MustSaveEntriesBeforeApply' in the Ready state).

I'm on the edge with #14400. I think it's correct and safe. What I don't like is neither of: a) Keeping the #14400 on the main branch, as it carries not necessary complexity with RAFT fix b) Diverging far between 3.4&3.5 and 'evaluating' 2 different fixes. So it's down to the level of confidence around the RAFT fix. Tobias opinion is crucial here.

Thank you @ahrtr on the tremendous work on all the approaches and tests.

serathius commented 2 years ago

Ok let me try to summarize the current discussion. Hope that it will be useful for external observers and make making the decision easier.

Whole API durability issue is caused by difference in raft library behavior in multi vs single node scenario. In multi node raft processes entries in 2 steps (first append to log, second apply to database), which differs from single node clusters where those steps are combined. This leads to hidden assumption (not documented as #10861 was not merged) that appending entries should be done before committing. At least when there is overlap in entries to append and to commit.

Etcd was unaware of this assumption and implemented optimization to start the apply first (asynchronously) so users will get results faster. This however caused a problem as in single node clusters, entries will be processed in one step (they will be present in both to append and to apply list). If the apply executes before the append and etcd crashes it will report success to user without persisting entries to WAL.

There are two ways to fix the problem:

I'm currently on the side of fixing raft library https://github.com/etcd-io/etcd/pull/14413 for both main and older branches (assuming that performance regression is acceptable).

ahrtr commented 2 years ago

There are two ways to fix the problem:

I'm currently on the side of fixing raft library https://github.com/etcd-io/etcd/pull/14413 for both main and older branches (assuming that performance regression is acceptable).

This isn't accurate. Based on all the discussion so far, the action plan would be one of the following two approaches:

  1. Fix the etcd RAFT implementation for both main and release-3.4/5, using either https://github.com/etcd-io/etcd/pull/14411 (already approved by one maintainer @ptabor ) or https://github.com/etcd-io/etcd/pull/14413 . Please see my previous comment as well https://github.com/etcd-io/etcd/issues/14370#issuecomment-1235091312.
  2. Fix the etcd RAFT implementation for main only, same solutions as above. Fix release-3.4/5 using https://github.com/etcd-io/etcd/pull/14400 (already approved by one maintainer @ptabor )

Let's wait for @tbg and @bdarnell 's feedback, which is crucial here just as @ptabor pointed out.

ahrtr commented 2 years ago

In case anyone wants a simple summary on this issue, FYI. https://github.com/ahrtr/etcd-issues/tree/master/issues/14370

tbg commented 2 years ago

@ahrtr responding to https://github.com/etcd-io/etcd/issues/14370#issuecomment-1235091312 above.


After second thought, it seems that https://github.com/etcd-io/etcd/pull/14411 is clearer & better than https://github.com/etcd-io/etcd/pull/14413, points:

14413 fixes the problem at the level of (*raft).advance, which ensures that any implementation will get the right behavior. For example, CockroachDB implements uses *raft.RawNode, which wraps a *raft. #14411 fixes the problem for users of *node (which internally wraps a *RawNode), but this leaves direct users of *RawNode dangling.

I am obviously biased because of my affiliation with CockroachDB, but I consider *RawNode the true external interface of the library. *node is certainly helpful for some (it's *RawNode with a bit of goroutine wrapping), but we better fix *RawNode if we can, and we definitely can in the exact same way. That will allow us to have everything work the same way, and to document the desired semantics on raft.Ready without having to reference the ways (RawNode, node) that exist for consuming Ready.


The leader raft node updates each follower's pr.Match on receiving the pb.MsgAppResp

I can see that injecting an MsgAppResp automatically can be preferred over directly expanding to the tracker update (which I did in #14413 at time time of writing). That's a change I can make.

I don't follow the argument that seems to suggest that doing work in Advance amounts to undesirable special casing.

There is always going to be some special casing for the leader node and that special casing is useful.

Imagine we would completely reject any special casing for the leader. The leader would have to emit an MsgApp to itself, which we would then step back into the leader. The next cycle we would see the entry in Entries. The cycle after that would see an MsgAppResp emitted. The cycle after that would step it back into the leader and in the subsequent cycle we'd finally see the CommittedEntries field populated.

This wouldn't be particularly performant, and it would also burden the user with new footguns, such as having to make sure that their transports allow self-directed messages, etc.

Explicitly asking the user to pass a parameter back in, etc, seems to be overdoing it. Advance() signals to raft that everything it has asked for was done, which includes appending entries. So it strikes me as the idiomatic place to self-deliver an MsgAppResp, freeing the user from having to do any kind of special tracking for local proposals, while also avoiding unnecessary work.


I think https://github.com/etcd-io/etcd/pull/14411 is safer and clearer, because it has the least impact on the raft protocol.

I'm not sure I understand, first of all it seems like the exact same fix but at the level of *node vs *RawNode, and isn't the point to fix it in a way that lets all consumers of the library avoid the footgun? If that is not the goal we could fix in etcdserver, put a comment on raft.Ready, and leave raft completely as is.

serathius commented 2 years ago

Hey @tbg, thanks for response. It's a good argument that both Node and RawNode are part of raft public interface.

Do I conclude correctly that your suggestion would be to prefer https://github.com/etcd-io/etcd/pull/14413 over https://github.com/etcd-io/etcd/pull/14411 ?

What would be your suggestion about picking a fix for backport? I think the discussion is to either:

tbg commented 2 years ago

Do I conclude correctly that your suggestion would be to prefer https://github.com/etcd-io/etcd/pull/14413 over https://github.com/etcd-io/etcd/pull/14411 ?

Yes, I think #14413 with the .Step approach is the right fix. (And then of course work out all of the test failures, etc). There is a bit in #14411 that I don't understand, such as special casing of the empty entry, but I think if that is really necessary it will carry over as well.

What would be your suggestion about picking a fix for backport? I think the discussion is to either:

I would merge solution 2 (#14400) immediately on all branches (main, release-*) since it's clearly correct, has a test, and solves the problem at hand immediately. This avoids having to rush in a raft-level fix.

Once a raft-level fix has landed on main, #14400 can then be reverted on master.