cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.07k stars 3.8k forks source link

jepsen: bank/majority-ring failed with `wrong-total` #34321

Closed cockroach-teamcity closed 5 years ago

cockroach-teamcity commented 5 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/e10fb557b11b5ff1b8609aa963da23c37a1143c8

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen-batch1/bank/majority-ring PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1113854&tab=buildLog

The test failed on master:
    test.go:743,jepsen.go:247,jepsen.go:308: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1113854-jepsen-batch1:6 -- bash -e -c "\
        cd /mnt/data1/jepsen/cockroachdb && set -eo pipefail && \
         ~/lein run test \
           --tarball file://${PWD}/cockroach.tgz \
           --username ${USER} \
           --ssh-private-key ~/.ssh/id_rsa \
           --os ubuntu \
           --time-limit 300 \
           --concurrency 30 \
           --recovery-time 25 \
           --test-count 1 \
           -n 10.128.0.70 -n 10.128.0.92 -n 10.128.0.78 -n 10.128.0.29 -n 10.128.0.88 \
           --test bank --nemesis majority-ring \
        > invoke.log 2>&1 \
        " returned:
        stderr:

        stdout:
        Error:  exit status 1
        : exit status 1
bdarnell commented 5 years ago

This is a real failure:

INFO [2019-01-29 08:42:48,388] jepsen results - jepsen.store Wrote /mnt/data1/jepsen/cockroachdb/store/cockroachdb-bank:majring/20190129T083549.000Z/results.edn
INFO [2019-01-29 08:42:49,270] main - jepsen.core {:perf
 {:latency-graph {:valid? true},
  :rate-graph {:valid? true},
  :valid? true},
 :timeline {:valid? true},
 :details
 {:valid? false,
  :bad-reads
  [{:type :wrong-total,
    :expected 50,
    :found 53,
    :op
    {:type :ok,
     :f :read,
     :process 452,
     :time 233308728429,
     :value [0 11 20 9 13],
     :index 12746}}
   {:type :wrong-total,
    :expected 50,
    :found 53,
    :op
    {:type :ok,
     :f :read,
     :process 387,
     :time 233308992089,
     :value [0 11 20 9 13],
     :index 12748}}
...
   {:type :wrong-total,
    :expected 50,
    :found 53,
    :op
    {:type :ok,
     :f :read,
     :process 431,
     :time 296967625647,
     :value [15 0 18 3 17],
     :index 14615}}
   {:type :wrong-total,
    :expected 50,
    :found 53,
    :op
    {:type :ok,
     :f :read,
     :process 541,
     :time 296967653011,
     :value [15 0 18 3 17],
     :index 14616}}
   {:type :wrong-total,
    :expected 50,
    :found 53,
    :op
    {:type :ok,
     :f :read,
     :process 500,
     :time 296967702596,
     :value [15 0 18 3 17],
     :index 14617}}
   {:type :wrong-total,
    :expected 50,
    :found 53,
    :op
    {:type :ok,
     :f :read,
     :process 529,
     :time 296967709008,
     :value [25 7 6 3 12],
     :index 14618}}
   {:type :wrong-total,
    :expected 50,
    :found 53,
    :op
    {:type :ok,
     :f :read,
     :process 670,
     :time 296967733680,
     :value [24 2 17 3 7],
     :index 14620}}
   {:type :wrong-total,
    :expected 50,
    :found 53,
    :op
    {:type :ok,
     :f :read,
     :process 462,
     :time 296986725618,
     :value [25 2 9 3 14],
     :index 14632}}
   {:type :wrong-total,
    :expected 50,
    :found 53,
    :op
    {:type :ok,
     :f :read,
     :process 542,
     :time 296986915503,
     :value [4 9 25 3 12],
     :index 14634}}
   {:type :wrong-total,
    :expected 50,
    :found 53,
    :op
    {:type :ok,
     :f :read,
     :process 510,
     :time 296986999786,
     :value [2 7 27 3 14],
     :index 14635}}
   {:type :wrong-total,
    :expected 50,
    :found 53,
    :op
    {:type :ok,
     :f :read,
     :process 547,
     :time 339949928998,
     :value [0 7 7 14 25],
     :index 14695}}]},
 :valid? false}

Analysis invalid! (ノಥ益ಥ)ノ ┻━┻
bdarnell commented 5 years ago

Possibly has the same cause as the TPC-C errors @nvanbenschoten is looking at (#34228?)

This is a bank test with 5 accounts (and network partitions). At some point just prior to operation 12746 one of the transfers created $3 out of thin air, and all subsequent reads report this inconsistency. The nearest transfers of $3 were at indexes 12723 and 12725, just before a nemesis activation. The client ultimately got a timeout error on these connections, but not until 12898 (after the inconsistent reads, and long after the network partition healed at 12733)

{:type :invoke, :f :transfer, :value {:from 1, :to 0, :amount 1}, :process 426, :time 226181944237, :index 12721}
{:type :ok, :f :transfer, :value {:from 2, :to 3, :amount 4}, :process 298, :time 226194131568, :index 12722}
{:type :invoke, :f :transfer, :value {:from 4, :to 0, :amount 3}, :process 298, :time 226194266856, :index 12723}
{:type :ok, :f :read, :process 395, :time 226284989087, :value [4 7 17 9 13], :index 12724}
{:type :invoke, :f :transfer, :value {:from 1, :to 4, :amount 3}, :process 395, :time 226285168556, :index 12725}
{:type :info, :f ["majring" :start], :process :nemesis, :time 226399731031, :index 12726}
{:type :info, :f ["majring" :start], :process :nemesis, :time 226501818010, :value [:isolated {"10.128.0.29" #{"10.128.0.70" "10.128.0.78"}, "10.128.0.92" #{"10.128.0.88" "10.128.0.78"}, "10.128.0.70" #{"10.128.0.29" "10.128.0.88"}, "10.128.0.78" #{"10.128.0.29" "10.128.0.92"}, "10.128.0.88" #{"10.128.0.92" "10.128.0.70"}}], :index 12727}
{:type :info, :f :transfer, :value {:from 4, :to 0, :amount 2}, :process 379, :time 228551244509, :error :timeout, :index 12728}
{:type :info, :f :transfer, :value {:from 0, :to 2, :amount 4}, :process 279, :time 228551450787, :error :timeout, :index 12729}
{:type :invoke, :f :transfer, :value {:from 3, :to 1, :amount 2}, :process 409, :time 228553132511, :index 12730}
{:type :invoke, :f :transfer, :value {:from 3, :to 0, :amount 4}, :process 309, :time 228553133655, :index 12731}
{:type :info, :f ["majring" :stop], :process :nemesis, :time 231502289665, :index 12732}
{:type :info, :f ["majring" :stop], :process :nemesis, :time 231705236185, :value :network-healed, :index 12733}
{:type :info, :f :read, :process 342, :time 232676248887, :error [:psql-exception "ERROR: communication error: rpc error: code = Canceled desc = context canceled"], :index 12734}
{:type :info, :f :read, :process 357, :time 232676303850, :error [:psql-exception "ERROR: communication error: rpc error: code = Canceled desc = context canceled"], :index 12735}
{:type :info, :f :read, :process 422, :time 232677871811, :error [:psql-exception "ERROR: communication error: rpc error: code = Canceled desc = context canceled"], :index 12736}
{:type :invoke, :f :read, :process 387, :time 232677734349, :index 12737}
{:type :invoke, :f :transfer, :value {:from 1, :to 0, :amount 2}, :process 372, :time 232677783855, :index 12738}
{:type :invoke, :f :read, :process 452, :time 232679451919, :index 12739}
{:type :info, :f :read, :process 321, :time 233248572734, :error [:psql-exception "ERROR: communication error: rpc error: code = Canceled desc = context canceled"], :index 12740}
{:type :invoke, :f :transfer, :value {:from 3, :to 4, :amount 4}, :process 351, :time 233250075018, :index 12741}
{:type :ok, :f :transfer, :value {:from 3, :to 4, :amount 4}, :process 351, :time 233267410923, :index 12742}
{:type :invoke, :f :transfer, :value {:from 4, :to 0, :amount 5}, :process 351, :time 233267588090, :index 12743}
{:type :ok, :f :transfer, :value {:from 1, :to 0, :amount 4}, :process 435, :time 233301067125, :index 12744}
{:type :invoke, :f :read, :process 435, :time 233301231876, :index 12745}
{:type :ok, :f :read, :process 452, :time 233308728429, :value [0 11 20 9 13], :index 12746}