etcd-io / etcd

Distributed reliable key-value store for the most critical data of a distributed system
https://etcd.io
Apache License 2.0
47.49k stars 9.74k forks source link

Lost update and cyclic information flow on process crash #14890

Closed aphyr closed 1 month ago

aphyr commented 1 year ago

What happened?

In Jepsen tests of etcd 3.5.5 with process crashes, running in five node clusters on Debian Bullseye, the wr workload (which performs transactions which read and write keys without any guards) appears to exhibit both lost update and cyclic information flow.

Lost Update

In this test run, we executed the following pair of transactions:

            :anomalies {:lost-update [{:key 277,
                                       :value 1,
                                       :txns [{:type :ok,
                                               :f :txn,
                                               :value [[:r 277 1]
                                                       [:w 277 4]],
                                               :time 26267084609,
                                               :process 50,
                                               :index 14140,
                                               :debug {:txn-res {:succeeded? true,
                                                                 :txns [],
                                                                 :header {:member-id -5790610604219245522,
                                                                          :revision 3927,
                                                                          :raft-term 5},
                                                                 :results [{:count 1,
                                                                            :kvs {277 {:value 1,
                                                                                       :version 1,
                                                                                       :create-revision 3925,
                                                                                       :mod-revision 3925}},
                                                                            :more? false,
                                                                            :header {:member-id 0,
                                                                                     :revision 3926,
                                                                                     :raft-term 0}}
                                                                           {:prev-kv [277
                                                                                      {:value 1,
                                                                                       :version 1,
                                                                                       :create-revision 3925,
                                                                                       :mod-revision 3925}],
                                                                            :prev-kv? true,
                                                                            :header {:member-id 0,
                                                                                     :revision 3927,
                                                                                     :raft-term 0}}]}}}
                                              {:type :ok,
                                               :f :txn,
                                               :value [[:r 277 1]
                                                       [:w 277 11]],
                                               :time 26511349152,
                                               :process 99,
                                               :index 14569,
                                               :debug {:txn-res {:succeeded? true,
                                                                 :txns [],
                                                                 :header {:member-id 5198333996179334402,
                                                                          :revision 4045,
                                                                          :raft-term 5},
                                                                 :results [{:count 1,
                                                                            :kvs {277 {:value 1,
                                                                                       :version 17,
                                                                                       :create-revision 3925,
                                                                                       :mod-revision 3985}},
                                                                            :more? false,
                                                                            :header {:member-id 0,
                                                                                     :revision 4044,
                                                                                     :raft-term 0}}
                                                                           {:prev-kv [277
                                                                                      {:value 1,
                                                                                       :version 17,
                                                                                       :create-revision 3925,
                                                                                       :mod-revision 3985}],
                                                                            :prev-kv? true,
                                                                            :header {:member-id 0,
                                                                                     :revision 4045,
                                                                                     :raft-term 0}}]}}}]}]},

Both of these transactions read key 277 = 1 at revision 3925, and wrote key = 277--one set it to 4, the other to 11. This constitutes a lost update: under snapshot isolation (and serializability) at most one of these transactions ought to have committed.

G1c (cyclic information)

In this test run, we executed the following pair of transactions.

Screenshot from 2022-12-04 17-54-21

Process 73 performed a transaction which set key 1754 to 11. That write was read by process 106, which in its transaction set key 1756 to 27. That write of 27 was in turn visible to process 73's transaction. Per the etcd headers for the two transactions, process 73's transaction executed at revision 25932, and observed 1756 at version 15, mod revision 25580. Process 106's txn executed at revision 26052, and observed key 1754's value of 11 at version 19, mod revision 25932. This seems impossible--perhaps a sign of split-brain?

            :anomalies {:G1c [{:cycle [{:type :ok,
                                        :f :txn,
                                        :value [[:w 1756 27]
                                                [:w 1757 14]
                                                [:r 1754 11]],
                                        :time 83810717854,
                                        :process 106,
                                        :index 93021,
                                        :debug {:txn-res {:succeeded? true,
                                                          :txns [],
                                                          :header {:member-id -6773460376229475734,
                                                                   :revision 26052,
                                                                   :raft-term 9},
                                                          :results [{:prev-kv [1756
                                                                               {:value 30,
                                                                                :version 16,
                                                                                :create-revision 25561,
                                                                                :mod-revision 25975}],
                                                                     :prev-kv? true,
                                                                     :header {:member-id 0,
                                                                              :revision 26052,
                                                                              :raft-term 0}}
                                                                    {:prev-kv [1757
                                                                               {:value 15,
                                                                                :version 16,
                                                                                :create-revision 25576,
                                                                                :mod-revision 25932}],
                                                                     :prev-kv? true,
                                                                     :header {:member-id 0,
                                                                              :revision 26052,
                                                                              :raft-term 0}}
                                                                    {:count 1,
                                                                     :kvs {1754 {:value 11,
                                                                                 :version 19,
                                                                                 :create-revision 25544,
                                                                                 :mod-revision 25932}},
                                                                     :more? false,
                                                                     :header {:member-id 0,
                                                                              :revision 26052,
                                                                              :raft-term 0}}]}}}
                                       {:type :ok,
                                        :f :txn,
                                        :value [[:r 1756 27]
                                                [:w 1757 15]
                                                [:w 1754 11]],
                                        :time 83532976783,
                                        :process 73,
                                        :index 92575,
                                        :debug {:txn-res {:succeeded? true,
                                                          :txns [],
                                                          :header {:member-id 5579440767535971996,
                                                                   :revision 25932,
                                                                   :raft-term 9},
                                                          :results [{:count 1,
                                                                     :kvs {1756 {:value 27,
                                                                                 :version 15,
                                                                                 :create-revision 25561,
                                                                                 :mod-revision 25580}},
                                                                     :more? false,
                                                                     :header {:member-id 0,
                                                                              :revision 25931,
                                                                              :raft-term 0}}
                                                                    {:prev-kv [1757
                                                                               {:value 17,
                                                                                :version 15,
                                                                                :create-revision 25576,
                                                                                :mod-revision 25739}],
                                                                     :prev-kv? true,
                                                                     :header {:member-id 0,
                                                                              :revision 25932,
                                                                              :raft-term 0}}
                                                                    {:prev-kv [1754
                                                                               {:value 13,
                                                                                :version 18,
                                                                                :create-revision 25544,
                                                                                :mod-revision 25794}],
                                                                     :prev-kv? true,
                                                                     :header {:member-id 0,
                                                                              :revision 25932,
                                                                              :raft-term 0}}]}}}

What did you expect to happen?

Neither of these phenomena should be possible under serializability.

How can we reproduce it (as minimally and precisely as possible)?

Check out jepsen.etcd e273986b2248ea00085e398a3bb0a6991d701d20 and run

lein run test-all --workload wr --time-limit 200 --rate 10000 --concurrency 10n --test-count 30 --nemesis kill --version 3.5.5

Violations of strict serializability (e.g. G-single-realtime) are fairly frequent--on my cluster, they take only a few minutes to show up. Violations of serializability (e.g. G1c, lost update) are less common; might take a few hours.

Anything else we need to know?

No response

Etcd version (please run commands below)

```console $ etcd --version etcd Version: 3.5.5 Git SHA: 19002cfc6 Go Version: go1.16.15 Go OS/Arch: linux/amd64 $ etcdctl version etcdctl version: 3.5.5 API version: 3.5 ```

Etcd configuration (command line flags or environment variables)

2022-12-04 18:11:03 Jepsen starting etcd --enable-v2 --log-outputs stderr --logger zap --name n1 --listen-peer-urls http://192.168.122.101:2380 --listen-client-urls http://192.168.122.101:2379 --advertise-client-urls http://192.168.122.101:2379 --initial-cluster-state new --initial-advertise-peer-urls http://192.168.122.101:2380 --initial-cluster n1=http://192.168.122.101:2380,n2=http://192.168.122.102:2380,n3=http://192.168.122.103:2380,n4=http://192.168.122.104:2380,n5=http://192.168.122.105:2380 --snapshot-count 100

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

``` $ etcdctl member list -w table root@n1:/opt/etcd# ./etcdctl --endpoints=n1:2379 member list -w table +------------------+---------+------+-----------------------------+-----------------------------+------------+ | ID | STATUS | NAME | PEER ADDRS | CLIENT ADDRS | IS LEARNER | +------------------+---------+------+-----------------------------+-----------------------------+------------+ | 1153c9690d2b2284 | started | n3 | http://192.168.122.103:2380 | http://192.168.122.103:2379 | false | | 4824313a421b2502 | started | n5 | http://192.168.122.105:2380 | http://192.168.122.105:2379 | false | | 4d6e27d122507e9c | started | n4 | http://192.168.122.104:2380 | http://192.168.122.104:2379 | false | | a1ffd5acd6a88a6a | started | n2 | http://192.168.122.102:2380 | http://192.168.122.102:2379 | false | | afa39e55dee6dc2e | started | n1 | http://192.168.122.101:2380 | http://192.168.122.101:2379 | false | +------------------+---------+------+-----------------------------+-----------------------------+------------+ $ etcdctl --endpoints= endpoint status -w table root@n1:/opt/etcd# ./etcdctl --endpoints=n1:2379 endpoint status -w table +----------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +----------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | n1:2379 | afa39e55dee6dc2e | 3.5.5 | 4.3 MB | true | false | 4 | 56100 | 56100 | | +----------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ ```

Relevant log output

No response

aphyr commented 1 year ago

Here's another case, this one on 3.5.6: http://jepsen.io.s3.amazonaws.com/analyses/etcd-3.5.6/20221204T185724.970-0500.zip

aphyr commented 1 year ago

Oh this is fascinating. Check this cycle (from the 3.5.6 case just above this comment) out:

Screenshot from 2022-12-04 19-34-17

Recall that we only ever write values once. The only write of key 15063 with value 23 was:

    {:member-id 5579440767535971996, :revision 220019, :raft-term 15},
     {:prev-kv
      [15063
       {:value 29,
        :version 20,
        :create-revision 219980,
        :mod-revision 220017}],
      :prev-kv? true,
      :header {:member-id 0, :revision 220019, :raft-term 0}}

Yet somehow we have two different reads of 15063 = 23:

(->> t :history h/client-ops h/oks (h/filter (fn [op] (some (fn [[f k v]] (and (= k 15063) (= v 23))) (:value op)))) pprint)
...
    {:member-id 5198333996179334402, :revision 220003, :raft-term 14},
     {:count 1,
      :kvs
      {15063
       {:value 23,
        :version 14,
        :create-revision 219980,
        :mod-revision 220002}}

    {:member-id 5579440767535971996, :revision 220022, :raft-term 15},
      {15063
       {:value 23,
        :version 21,
        :create-revision 219980,
        :mod-revision 220019}},

I've clipped these a bit for clarity: full logs are here:

```clj jepsen.etcd=> (->> t :history h/client-ops h/oks (h/filter (fn [op] (some (fn [[f k v]] (and (= k 15063) (= v 23))) (:value op)))) pprint) [{:process 484, :type :ok, :f :txn, :value [[:r 15062 29] [:r 15063 23] [:w 15062 32] [:r 15062 32]], :debug {:txn-res {:succeeded? true, :txns [], :header {:member-id 5198333996179334402, :revision 220003, :raft-term 14}, :results [{:count 1, :kvs {15062 {:value 29, :version 16, :create-revision 219976, :mod-revision 219999}}, :more? false, :header {:member-id 0, :revision 220002, :raft-term 0}} {:count 1, :kvs {15063 {:value 23, :version 14, :create-revision 219980, :mod-revision 220002}}, :more? false, :header {:member-id 0, :revision 220002, :raft-term 0}} {:prev-kv [15062 {:value 29, :version 16, :create-revision 219976, :mod-revision 219999}], :prev-kv? true, :header {:member-id 0, :revision 220003, :raft-term 0}} {:count 1, :kvs {15062 {:value 32, :version 17, :create-revision 219976, :mod-revision 220003}}, :more? false, :header {:member-id 0, :revision 220003, :raft-term 0}}]}}, :index 780977, :time 117944800025} {:process 438, :type :ok, :f :txn, :value [[:r 15063 23] [:w 15064 3]], :debug {:txn-res {:succeeded? true, :txns [], :header {:member-id 5579440767535971996, :revision 220022, :raft-term 15}, :results [{:count 1, :kvs {15063 {:value 23, :version 21, :create-revision 219980, :mod-revision 220019}}, :more? false, :header {:member-id 0, :revision 220021, :raft-term 0}} {:prev-kv [15064 {:value 27, :version 10, :create-revision 220004, :mod-revision 220021}], :prev-kv? true, :header {:member-id 0, :revision 220022, :raft-term 0}}]}}, :index 781064, :time 122404476575} {:process 433, :type :ok, :f :txn, :value [[:r 15062 32] [:w 15063 23] [:r 15063 23]], :debug {:txn-res {:succeeded? true, :txns [], :header {:member-id 5579440767535971996, :revision 220019, :raft-term 15}, :results [{:count 1, :kvs {15062 {:value 32, :version 17, :create-revision 219976, :mod-revision 220003}}, :more? false, :header {:member-id 0, :revision 220018, :raft-term 0}} {:prev-kv [15063 {:value 29, :version 20, :create-revision 219980, :mod-revision 220017}], :prev-kv? true, :header {:member-id 0, :revision 220019, :raft-term 0}} {:count 1, :kvs {15063 {:value 23, :version 21, :create-revision 219980, :mod-revision 220019}}, :more? false, :header {:member-id 0, :revision 220019, :raft-term 0}}]}}, :index 781066, :time 122404512264} {:process 435, :type :ok, :f :txn, :value [[:w 15064 1] [:r 15063 23]], :debug {:txn-res {:succeeded? true, :txns [], :header {:member-id -5790610604219245522, :revision 220024, :raft-term 15}, :results [{:prev-kv [15064 {:value 31, :version 12, :create-revision 220004, :mod-revision 220023}], :prev-kv? true, :header {:member-id 0, :revision 220024, :raft-term 0}} {:count 1, :kvs {15063 {:value 23, :version 21, :create-revision 219980, :mod-revision 220019}}, :more? false, :header {:member-id 0, :revision 220024, :raft-term 0}}]}}, :index 781067, :time 122404516208}] ```

This is impossible! The write didn't take place until term 15 and revision 22019, but somehow at term 14 and revision 22003 we've got a client reading the value 23, apparently written at revision 220002. OK, what transaction interacted with revision 220002?

jepsen.etcd=> (->> t :history h/client-ops h/oks (h/filter (fn [op] (-> op :debug :txn-res :header :revision (= 220002)))) pprint)
[]

Nothing we know of--could have been a timed out txn though. 220001?

jepsen.etcd=> (->> t :history h/client-ops h/oks (h/filter (fn [op] (-> op :debug :txn-res :header :revision (= 220001)))) pprint)
[{:process 382,
  :type :ok,
  :f :txn,
  :value [[:r 15062 29] [:r 15062 29] [:w 15063 22] [:r 15063 22]],
  :debug
  {:txn-res
   {:succeeded? true,
    :txns [],
    :header
    {:member-id 1248562974741439108, :revision 220001, :raft-term 14},
    :results
    [{:count 1,
      :kvs
      {15062
       {:value 29,
        :version 16,
        :create-revision 219976,
        :mod-revision 219999}},
      :more? false,
      :header {:member-id 0, :revision 220000, :raft-term 0}}
     {:count 1,
      :kvs
      {15062
       {:value 29,
        :version 16,
        :create-revision 219976,
        :mod-revision 219999}},
      :more? false,
      :header {:member-id 0, :revision 220000, :raft-term 0}}
     {:prev-kv
      [15063
       {:value 19,
        :version 12,
        :create-revision 219980,
        :mod-revision 220000}],
      :prev-kv? true,
      :header {:member-id 0, :revision 220001, :raft-term 0}}
     {:count 1,
      :kvs
      {15063
       {:value 22,
        :version 13,
        :create-revision 219980,
        :mod-revision 220001}},
      :more? false,
      :header {:member-id 0, :revision 220001, :raft-term 0}}]}},
  :index 780979,
  :time 117945424227}]

This is a write of 22 (not 23!) to key 15063. It executed at 220001, raft term 14. That write overwrote value 19 at version 12, mod-revision 220000, and returned a header with revision 220001.

Something feels deeply wrong here. We shouldn't have these two versions 15063=23 with different mod-revisions! It almost feels like the leader for term 14 somehow got a copy of the write of 23 and like... applied it at revision 220002, let that write be visible to a reader, but then the leader for term 15 applied it and won out later. Both versions have the same create-revision, which helps rule out something weird like state leaking over from an earlier version of the test cluster... what on earth is going on here?

serathius commented 1 year ago

Looks interesting. Would be happy to collaborate on reproducing this, however I'm struggling a bit in following the closure. I' working on introducing linearizability logs to etcd in https://github.com/etcd-io/etcd/issues/14045. However we still haven't got to implementing TXN.

Can you give more context on how you validate linearizability of Txn requests?

ahrtr commented 1 year ago

@aphyr Thanks for reporting this. Before I have a deep dive into this, just a couple of quick questions:

  1. Is it easy to reproduce this issue?
  2. Do you have bandwidth to reproduce this issue using etcd 3.5.6, which includes two critical fixes?
  3. Is it possible for us to connect directly so as to better collaborate this?
aphyr commented 1 year ago

Can you give more context on how you validate linearizability of Txn requests?

Yup! This is broadly the same strategy we used in the 2020 list append workload. I've also added a write-read register workload based on Elle.

Is it easy to reproduce this issue?

Yup! Takes about 5 minutes to hit a strict-1SR violation with the repro instructions I gave above. Maybe a bit longer for G1c specifically.

Do you have bandwidth to reproduce this issue using etcd 3.5.6, which includes two critical fixes?

As I mentioned above 3.5.6 exhibits the same behavior.

Is it possible for us to connect directly so as to better collaborate this?

I'm afraid I'm in the middle of moving so my availability is going to be a hot mess for a few months. And, well... to be somewhat frank, my experience in #14092 was profoundly disappointing; I put weeks of time into describing and analyzing that behavior and it led nowhere.

serathius commented 1 year ago

@aphyr I have implemented similar append workload in https://github.com/etcd-io/etcd/pull/14904. No reproduction. Please take a look.

I've also added a write-read register workload based on Elle.

Could you give more context about this?

ahrtr commented 1 year ago

I just got time to have closer look at this issue. It turned out to be not an issue to me.

Note that there is no cyclic dependency between any two transactions. All transactions are executed sequentially in etcd. Let's use the second issue (under G1c) as an example. Obviously the TXN in process 73 happened before the TXN in process 106, because the revision in header of process 73 TXN is smaller than process 106. All the revisions (including create-revision, mod-revisions and revision in header) make perfect sense to me.

Note:

  1. A TXN increases the revision by one at most. Readonly TXN doesn't increase revision.
  2. The revision in top level header is equal to the revision in the header of the last response in the TXN. (Note: A TXN may include multiple responses)

jepsen_test_14890

@aphyr Please let me know if you have any comments or concern.

aphyr commented 1 year ago

@aphyr Please let me know if you have any comments or concern.

I have many concerns. At the risk of repeating myself, the cycle is right here:

44D6F093-2892-4D30-8600-D7C5AB3FE6F2

aphyr commented 1 year ago

Could you give more context about this?

Sure. You’ll find a detailed explanation of how Elle finds cycles via two properties we call recoverability and traceability in the Elle paper, also linked in the Elle README. Pardon my brevity here: my computer is in pieces as I prepare for moving.

We perform transactions comprised of unguarded writes and reads across a rotating pool of keys with exponentially distributed frequency. Choosing unique writes allows recoverability. We infer constraints on the version graph for each key using the known initial version nil, plus optional writes-follow-reads within a transaction, per-key sequential consistency, or per-key linearizability assumptions. We infer constraints on the transaction dependency graph using the version dependency graph for rw and ww edges, plus wr edges derived directly from recoverability. As in the list-append test, we then identify strongly connected components of the constrained transaction graph via Tarjan’s algorithm, and use breadth-first search and a finite-state automata to identify cycles in those components matching G1c, G-single, G-nonadjacent, and so on.

serathius commented 1 year ago

@ahrtr I think there are multiple assumptions made in Jepsen that are not obvious. Each write writes a unique value, so 2 transactions write[k:1756 v:27] and read [k:1756 v:27] conflict with each other.

Write returned revision number 26052, so how read managed to get the same value [k:1756 v:27] with revision 2591 before the write and different mod-revision 25580. This creates a cycle. @aphyr did I understand it correctly?

@aphyr thanks for context on high level approach. I think I have some basic level of understanding. What I'm missing is implementation details like translating Jepsen terms read, write to etcd terms. I think I miss some nuances when doing the translation myself. For example you only presented read write requests that take part in the anomaly, they are presented in form that is hard to parse.

Would love to get dump of etcd requests/responses in easy to parse format. With them I should be able to validate what exactly happend. Maybe I missed them in the report, so could you maybe point them out?

tjungblu commented 1 year ago

Thanks a lot for your work, Kyle. I've just briefly browsed the etcd logs from your s3 bucket.

One thing that I found strange would be on n5, where it says:

{"level":"info","ts":"2022-12-04T19:00:25.996-0500","caller":"etcdserver/server.go:2416","msg":"skip compaction since there is an inflight snapshot"}

and it constantly complains about this on every snapshot. The other nodes do not report any of that.

Aside that, the T4 read that was found loopy ended up on n4, which during that time was actually restarted with one other node. So I wouldn't rule out a split brain. Is there a way to visualize this a bit better? Maybe have the read/write plots also include which member they targeted?

ahrtr commented 1 year ago

I still do not see any issue. Please see explanation below,

Key: 1756, value: 27 (under the G1c)

Just as I mentioned previously, the TXN in process 73 happened before the TXN in process 106.

Note the same value 27 was written twice, the first time was at revision 25580, the second time was at revision 26052.

key_1756

Key: 1754, value: 11 (under the G1c)

Process 73 wrote the value 11 at revision 25932, and process 106 read the value 11 with exactly the same mod_revision 25932 at revision 26052.

key_1754

Key: 277, value: 1/4/11 (under the Lost Update)

There are no any data loss. All the historical values are persisted in the db file. The reason why the second TXN still read value 1 is because someone wrote value 1 at revision 3985 between the two TXNs.

key_277

I have many concerns.

@aphyr Please let me know if you still have any concerns, and I am happy to resolve all your concerns.

xiang90 commented 1 year ago

Process 73 wrote the value 11 at revision 25932, and process 106 read the value 11 with exactly the same mod_revision 25932 at revision 26052.

When Process 73 was writing the value, the version of the key was 15, and the value was 13. See https://github.com/etcd-io/etcd/issues/14890#issuecomment-1340300176

If you look at the etcd dump data, the version of the key should be 18 rather than 15. I am unsure if the graph you created is wrong or if the actual reply was problematic. But there is something wrong...

Maybe @aphyr (I am no longer on the maintainer list :P, so do not expect me to be very responsive) can clarify if Jepsen would write duplicate values to the same key. If the assumption is no, then either Jepsen is buggy (which is unlikely as a very simple producer...), or etcd does double writes somehow.

ahrtr commented 1 year ago

If you look at the etcd dump data, the version of the key should be 18 rather than 15.

Thanks for pointing this out. Yes, it's a typo. The version should be 18 instead of 15. @aphyr this doesn't affect my previous comment. Please still let me know if you have any concern.


                                                                    {:prev-kv [1754
                                                                               {:value 13,
                                                                                :version 18,
                                                                                :create-revision 25544,
                                                                                :mod-revision 25794}],
                                                                     :prev-kv? true,
                                                                     :header {:member-id 0,
                                                                              :revision 25932,
                                                                              :raft-term 0}}]}}}
lavalamp commented 1 year ago

What is the status of this? Is it an actual problem? I can't quite tell from the comments :)

ahrtr commented 1 year ago

What is the status of this? Is it an actual problem? I can't quite tell from the comments :)

Based on my previous comment https://github.com/etcd-io/etcd/issues/14890#issuecomment-1346006405, I don't see any issue. I do not see any new feedback from the reporter @aphyr either.

I will close this ticket if no any further feedback is provided in the following two weeks.

serathius commented 1 year ago

I have done an extensive local testing of the same scenario using linearizability tests. I was not able reproduce the issue even once. Possibly it's still a limitation of the framework, however I'm starting to suspect that it's a bug in Jepsen etcd client.

I plan to send a PR that adds this scenario for nightly linearizability tests that run for ~1h every day.

aphyr commented 1 year ago

I'm not sure how else I can convince y'all that these are serious issues, and the fact that etcd's maintainers do not see anything wrong here has me dismayed. Transactions in a serializable system are not supposed to observe transactions which did not succeed, to read their own effects, or to apparently diverge into multiple timelines. The entire point of etcd is to ensure linearizability and serializability, but after months of experimentation I cannot figure out how to get etcd to do those things. I think it very likely that either etcd or jetcd is badly broken. Either of these things is cause for alarm.

I've given up on this. I've put months of unpaid work into these behaviors and written a full report, but the entity which sponsored some of this work earlier last year (and allowed me to file some of these tickets) opted not to allow Jepsen to publicly acknowledge their support. Per Jepsen's ethics policy I'm forced to shelve the report. :-/

serathius commented 1 year ago

@aphyr I share your concern. I think the problem is lack of maintainer familiarity with tools you used making it impossible for us to investigate. Can you send an email to etcd-maintainers@googlegroups.com so we can discuss options including creating another report for etcd? It's a private mailing group for etcd maintainers.

lavalamp commented 1 year ago

@aphyr e.g. the "cycle" in https://github.com/etcd-io/etcd/issues/14890#issuecomment-1341028058 seems to be completely explained by the different versions of the key 1756 -- it had the same value at two different versions (a version is a point in time), and so it's not actually a cycle.

It sounds like you think the tool wouldn't write the same value twice? Maybe there is some other problem that confused the tool, but if there were a time-travel cycle the version numbers would line up, and they don't.

Personally I found the lost updates to be a drastically more plausible bug, a priori, but it seems like no discussion focused on that half of the report?

aphyr commented 1 year ago

It sounds like you think the tool wouldn't write the same value twice?

As I have already explained, this test never writes the same value twice.

ahrtr commented 1 year ago

Thanks @aphyr for the feedback.

I'm not sure how else I can convince y'all that these are serious issues, and the fact that etcd's maintainers do not see anything wrong here has me dismayed.

There are knowledge gaps, etcd maintainers do not know much about Jepsen, while Jepsen engineers do not know much about etcd. It's exactly the reason to make this conversation harder.

I think it very likely that either etcd or jetcd is badly broken. Either of these things is cause for alarm.

Based on my comment https://github.com/etcd-io/etcd/issues/14890#issuecomment-1346006405 (note there is a typo see https://github.com/etcd-io/etcd/issues/14890#issuecomment-1347603606), I do not see any issue on etcdserver side. So I suspect there might be some issue on jetcd or Jepsen test cases. But unfortunately, just as I mentioned above, the existing etcd maintainers do not know much about Jepsen. I doubt that the jetcd is well maintained by existing etcd maintainer either.

I've put months of unpaid work into these behaviors and written a full report

I am thinking probably we can raise a ticket to CNCF to extend the Jepsen support for some time? Anyway, I would like to get this ticket clearly clarified. cc @ptabor @spzala @serathius

ahrtr commented 1 year ago

cc @lburgazzoli

Ideally we can get etcd maintainer, jetcd maintainer and Jepsen engineer to work together to get this clarified.

lavalamp commented 1 year ago

As I have already explained, this test never writes the same value twice.

I mean, seeing the same value with different versions is extremely strong evidence that it did get written twice; the other cause I can think of is data corruption but that would be much more likely to just render it unreadable. Is Jepsen open source? I can't seem to find the source with a quick search.

I think the most likely explanation is that Jepsen attempted a write, it believed the write failed and therefore it was free to re-use the value, but etcd actually recorded the write. (e.g. maybe network failure during the response)

bsbds commented 2 months ago

Hi, I recently investigated these issues as they relate to Jepsen tests of our database. I discovered that the cyclic information flow anomaly was caused by the automatic retry mechanism of the etcd client.

Describe In Detail

From Jepsen's perspective, the cyclic history is as follows:

{:index 0 :time 0 :type :invoke :process 0 :f :txn :value [[:r 1756 27] [:w 1757 15] [:w 1754 11]]}
{:index 1 :time 1 :type :invoke :process 1 :f :txn :value [[:w 1756 27] [:w 1757 14] [:r 1754 11]]}

Let's denote the first txn as T0, and the second txn as T1.

Examining the actual history on etcd (represented in Jepsen history style), we could observe:

{:index 0 :time 0 :type :invoke :process 0 :f :txn :value [[:w 1756 27] [:w 1757 14] [:r 1754 inital-value-of-1754]]}
{:index 1 :time 1 :type :invoke :process 1 :f :txn :value [[:r 1756 27] [:w 1757 15] [:w 1754 11]]}
{:index 2 :time 2 :type :invoke :process 0 :f :txn :value [[:w 1756 27] [:w 1757 14] [:r 1754 11]]}

This history is valid, but we could see that T0 was executed twice by etcd. Why does this not appear in Jepsen's history?

Further investigation into the Jepsen test history reveals the following events:

{:type :ok, :f :txn, :value [[:w 1756 26] [:r 1756 26]], :time 80295095318, :process 140, :index 91318, :debug {:txn-res {:succeeded? true, :txns [], :header {:member-id -5790610604219245522, :revision 25579, :raft-term 8}, :results [{:prev-kv [1756 {:value 23, :version 13, :create-revision 25561, :mod-revision 25578}], :prev-kv? true, :header {:member-id 0, :revision 25579, :raft-term 0}} {:count 1, :kvs {1756 {:value 26, :version 14, :create-revision 25561, :mod-revision 25579}}, :more? false, :header {:member-id 0, :revision 25579, :raft-term 0}}]}}}
{:type :invoke, :f :txn, :value [[:r 1756 nil] [:r 1757 nil]], :time 80295701516, :process 187, :index 91319}
{:type :invoke, :f :txn, :value [[:w 1756 28] [:w 1756 29] [:r 1754 nil] [:r 1756 nil]], :time 80297678596, :process 128, :index 91320}
{:type :invoke, :f :txn, :value [[:r 1756 nil] [:w 1757 15] [:w 1754 11]], :time 80298215719, :process 73, :index 91321}
{:type :invoke, :f :txn, :value [[:r 1756 nil] [:r 1757 nil] [:r 1756 nil] [:r 1756 nil]], :time 80298436213, :process 134, :index 91322}
{:type :invoke, :f :txn, :value [[:w 1756 30]], :time 80299261993, :process 168, :index 91323}
{:type :invoke, :f :txn, :value [[:w 1756 31] [:w 1756 32] [:r 1754 nil] [:w 1758 1]], :time 80299888946, :process 140, :index 91324}
{:type :invoke, :f :txn, :value [[:r 1757 nil]], :time 80301011882, :process 120, :index 91325}
{:type :invoke, :f :txn, :value [[:r 1757 nil] [:r 1758 nil] [:r 1758 nil]], :time 80301327115, :process 144, :index 91326}
{:type :invoke, :f :txn, :value [[:r 1758 nil]], :time 80302766509, :process 114, :index 91327}
{:type :invoke, :f :txn, :value [[:w 1758 2]], :time 80303953141, :process 129, :index 91328}
{:type :invoke, :f :txn, :value [[:w 1754 12]], :time 80304989021, :process 149, :index 91329}
{:type :invoke, :f :txn, :value [[:r 1757 nil] [:w 1757 16]], :time 80308764346, :process 130, :index 91330}
{:type :invoke, :f :txn, :value [[:w 1758 3]], :time 80310859246, :process 182, :index 91331}
{:type :invoke, :f :txn, :value [[:r 1758 nil] [:w 1758 4] [:w 1757 17]], :time 80312295133, :process 108, :index 91332}
{:type :invoke, :f :txn, :value [[:w 1758 5] [:r 1758 nil]], :time 80313974064, :process 177, :index 91333}
{:type :invoke, :f :txn, :value [[:r 1758 nil] [:w 1754 13] [:r 1754 nil] [:r 1754 nil]], :time 80315621308, :process 150, :index 91334}
{:type :invoke, :f :txn, :value [[:r 1758 nil]], :time 80317302096, :process 85, :index 91335}
{:type :invoke, :f :txn, :value [[:r 1758 nil] [:w 1758 6]], :time 80318035351, :process 141, :index 91336}
{:type :invoke, :f :txn, :value [[:w 1758 7]], :time 80318588884, :process 197, :index 91337}
{:type :invoke, :f :txn, :value [[:w 1757 18]], :time 80319621346, :process 101, :index 91338}
{:type :invoke, :f :txn, :value [[:r 1758 nil] [:r 1754 nil] [:w 1757 19] [:w 1758 8]], :time 80321705952, :process 167, :index 91339}
{:type :invoke, :f :txn, :value [[:w 1758 9] [:r 1758 nil] [:r 1758 nil]], :time 80322186767, :process 136, :index 91340}
{:type :invoke, :f :txn, :value [[:w 1754 14] [:r 1757 nil] [:r 1757 nil]], :time 80324143189, :process 113, :index 91341}
{:type :info, :f :kill, :value {"n1" nil, "n2" nil, "n3" nil, "n4" nil, "n5" nil}, :time 82356631435, :process :nemesis, :index 91342}  ;; kill all etcd nodes
{:type :info, :f :start, :value :all, :time 82356715368, :process :nemesis, :index 91343}
{:type :info, :f :start, :value {"n1" :started, "n2" :started, "n3" :started, "n4" :started, "n5" :started}, :time 82443872654, :process :nemesis, :index 91344}
{:type :ok, :f :txn, :value [[:w 1757 18]], :time 82770715209, :process 101, :index 91345, :debug {:txn-res {:succeeded? true, :txns [], :header {:member-id -6773460376229475734, :revision 25581, :raft-term 9}, :results [{:prev-kv [1757 {:value 14, :version 4, :create-revision 25576, :mod-revision 25580}], :prev-kv? true, :header {:member-id 0, :revision 25581, :raft-term 0}}]}}}

The header revision 25580 is missing from the history, and jepsen generate a kill operation before that. This indicates that etcd committed this command but was killed before it could return the result to the client.

As a result, the etcd client proceeded with a retry at revision 26052, causing T0 to be executed twice:

{:type :ok, :f :txn, :value [[:w 1756 27] [:w 1757 14] [:r 1754 11]], :time 83810717854, :process 106, :index 93021, :debug {:txn-res {:succeeded? true, :txns [], :header {:member-id -6773460376229475734, :revision 26052, :raft-term 9}, :results [{:prev-kv [1756 {:value 30, :version 16, :create-revision 25561, :mod-revision 25975}], :prev-kv? true, :header {:member-id 0, :revision 26052, :raft-term 0}} {:prev-kv [1757 {:value 15, :version 16, :create-revision 25576, :mod-revision 25932}], :prev-kv? true, :header {:member-id 0, :revision 26052, :raft-term 0}} {:count 1, :kvs {1754 {:value 11, :version 19, :create-revision 25544, :mod-revision 25932}}, :more? false, :header {:member-id 0, :revision 26052, :raft-term 0}}]}}}

Conclusion

I believe this is an etcd issue, because the client retry is transparent to Jepsen. To address this issue, please consider implementing exactly-once semantics for etcd.

cc @lavalamp @ahrtr @serathius

bsbds commented 2 months ago

The lost update might be a false positive caused by Jepsen itself, please refer to this issue: https://github.com/jepsen-io/elle/issues/28

https://github.com/etcd-io/etcd/issues/14890#issuecomment-2247082363

serathius commented 2 months ago

OMG, huge thanks @bsbds from finding this and informing us that it might be false positive!

ahrtr commented 2 months ago

Thanks @bsbds for the feedback.

I did not fully follow the details you provided, but high level seems reasonable, and it aligns with what I analysed previously.

etcd client sdk indeed has retry logic, see below, https://github.com/etcd-io/etcd/blob/d6c0127d264d18255090b4a51e04c5b8e84d5a05/client/v3/client.go#L273

A couple of personal thoughts,

$ git diff
diff --git a/client/v3/client.go b/client/v3/client.go
index 79654db9b..948f4d289 100644
--- a/client/v3/client.go
+++ b/client/v3/client.go
@@ -270,7 +270,7 @@ func (c *Client) dialSetupOpts(creds grpccredentials.TransportCredentials, dopts
                // Disable stream retry by default since go-grpc-middleware/retry does not support client streams.
                // Streams that are safe to retry are enabled individually.
                grpc.WithStreamInterceptor(c.streamClientInterceptor(withMax(0), rrBackoff)),
-               grpc.WithUnaryInterceptor(c.unaryClientInterceptor(withMax(unaryMaxRetries), rrBackoff)),
+               grpc.WithUnaryInterceptor(c.unaryClientInterceptor(withMax(0), rrBackoff)),
        )

        return opts
aphyr commented 2 months ago

I'm fairly sure the false positive @bsbds reported reflects a broken database or test harness, not a problem in Elle.

ahrtr commented 2 months ago

I'm fairly sure the false positive @bsbds reported reflects a broken database or test harness

IMO, most likely it's the latter (test harness) or jetcd. I don't see any etcdserver side issue so far; There is clear/strong evidence that the same key/value 1756/27 were written twice. I don't think client-side retrying is an issue either.

As mentioned in my previous comment, etcd client sdk indeed has retry logic in the grpc-go interceptor for unary RPCs (see code snippets below). But my understanding is that it's sync invocation. It means the retrying is transparent to client, once the client gets a response (no matter it's success or failure), then the interceptor won't do anything anymore. @dfawley could you please double confirm this? If confirmed, is there any higher level retry in Jepsen side (e.g. retry if timedout)?

https://github.com/etcd-io/etcd/blob/9f58999f1b173d9ced04553104f65bc0d31f469d/client/v3/client.go#L273

https://github.com/etcd-io/etcd/blob/9f58999f1b173d9ced04553104f65bc0d31f469d/client/v3/retry_interceptor.go#L40-L51

But again, Jepsen uses etcd java client sdk, so we need to analyze jetcd's behaviour. cc @lburgazzoli

bsbds commented 2 months ago

It turns out the lost update is not a problem in Elle, but also a client retry issue.

The issue arises because the following command was executed twice by the server.

{:type :invoke, :f :txn, :value [[:w 277 1]], :time 23632426147, :process 65, :index 14091}

In the Jepsen history, this would appear as two transactions reading the same value but updating it to different values.

ahrtr commented 2 months ago

The issue arises because the following command was executed twice by the server.

What's the "server"? etcdserver or part of Jepsen test?

bsbds commented 2 months ago

The issue arises because the following command was executed twice by the server.

What's the "server"? etcdserver or part of Jepsen test?

The etcd server. For the same reason, it get killed before returning the result.

ahrtr commented 2 months ago

It turns out the lost update is not a problem in Elle, but also a client retry issue.

The issue arises because the following command was executed twice by the server.

{:type :invoke, :f :txn, :value [[:w 277 1]], :time 23632426147, :process 65, :index 14091}

In the Jepsen history, this would appear as two transactions reading the same value but updating it to different values.

To be clearer, there are two issues in this ticket (see the description of this issue). We kept discussing the "cyclic information" issue above, but now you are talking about the other "lost update" issue.

If I understood your comment correctly, you mean that there is a race condition between multiple concurrent requests, which was caused by client retry mechanism? Theoretically, it's possible (see example below). But I don't think it's etcdsever issue, nor etcd client sdk issue. Instead, it's application's responsibility to take care of it (e.g. avoid overwrite the same key using a TXN similar to optimisticTxn ).

Let's work with an example/workflow,

bsbds commented 1 month ago

It turns out the lost update is not a problem in Elle, but also a client retry issue.

The issue arises because the following command was executed twice by the server.

{:type :invoke, :f :txn, :value [[:w 277 1]], :time 23632426147, :process 65, :index 14091}

In the Jepsen history, this would appear as two transactions reading the same value but updating it to different values.

To be clearer, there are two issues in this ticket (see the description of this issue). We kept discussing the "cyclic information" issue above, but now you are talking about the other "lost update" issue.

If I understood your comment correctly, you mean that there is a race condition between multiple concurrent requests, which was caused by client retry mechanism? Theoretically, it's possible (see example below). But I don't think it's etcdsever issue, nor etcd client sdk issue. Instead, it's application's responsibility to take care of it (e.g. avoid overwrite the same key using a TXN similar to optimisticTxn ).

Let's work with an example/workflow,

  • client 1 sends a request write k/v: 277/1;
  • client 2 sends a request write k/v: 277/4
  • client 2 receives a success response; It means 277/4 was successfully persisted;
  • kill the etcdserver & restart etcdserver;
  • etcd client sdk retries write k/v: 277/1; so it's also successfully persisted.
  • client 3 read k:277, but gets 277/1 instead of 277/4.

Yes, it's a race condition caused by the client retry. In your example, client 1 sent two requests in total and they both got committed and persistent by the etcdserver. However, only the second response is observed by Jepsen. So from Jepsen's perspective, the etcdserver committed a command it never requested.

I agree that it's not an etcdserver issue, but the client SDK must disable automatic retry for non-idempotent operations.

ahrtr commented 1 month ago

but the client SDK must disable automatic retry for non-idempotent operations.

If you understood my previous comment https://github.com/etcd-io/etcd/issues/14890#issuecomment-2244472376 correctly, the retry is transparent to users (e.g. Jepsen). Does Jepsen retry before it get a response from previous request?

Also as mentioned in https://github.com/etcd-io/etcd/issues/14890#issuecomment-2242527758, we might need to expose an interface to let users decide to enable/disable the retry.

serathius commented 1 month ago

I think we have enough evidence to think that problem is related to Java client and not on etcdserver side. With that I think we can close the investigation, leaving followups to separate issues.

If there is still any doubt, please feel free to continue the discussion and reopen the issue, however at the current moment I think we can official reject Jepsen report as inaccurate.

ahrtr commented 1 month ago

at the current moment I think we can official reject Jepsen report as inaccurate.

Agreed.

that problem is related to Java client

The problem might be related to etcd java client or Jepsen. Based on all the discussion so far, I would bet Jepsen.

aphyr commented 1 month ago

Based on all the discussion so far, I would bet Jepsen.

It's jetcd. https://jepsen.io/analyses/jetcd-0.8.2

ahrtr commented 1 month ago

It's jetcd. https://jepsen.io/analyses/jetcd-0.8.2

Looks good to me. The 3.2 Circular Information Flow means it breaks "Serializable", and Aborted Reads means it breaks "Read committed". But I don't buy the analysis of "3.1 Lost Update" if you read my comment above that "retry is transparent to users".

I will raise a new issue to track the client retry issue.

ahrtr commented 1 month ago

Read https://github.com/etcd-io/etcd/issues/18424