cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.07k stars 3.8k forks source link

roachtest: jepsen/3/register/split failed on master #29057

Closed cockroach-teamcity closed 6 years ago

cockroach-teamcity commented 6 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/a6e5d24201c45b49415d95be9b04d9d0a523e44c

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=jepsen/3/register/split PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=859524&tab=buildLog

    test.go:494,jepsen.go:232,jepsen.go:285: /home/agent/work/.go/bin/roachprod run teamcity-859524-jepsen-3:6 -- bash -e -c "\
        cd /mnt/data1/jepsen/cockroachdb && set -eo pipefail && \
         ~/lein run test \
           --tarball file://${PWD}/cockroach.tgz \
           --username ${USER} \
           --ssh-private-key ~/.ssh/id_rsa \
           --os ubuntu \
           --time-limit 300 \
           --concurrency 30 \
           --recovery-time 25 \
           --test-count 1 \
           -n 10.128.0.19 -n 10.128.0.31 -n 10.128.0.12 -n 10.128.0.9 -n 10.128.0.26 \
           --test register --nemesis split \
        > invoke.log 2>&1 \
        ": exit status 1
tbg commented 6 years ago
ERROR [2018-08-25 09:12:06,195] main - jepsen.cli Oh jeez, I'm sorry, Jepsen broke. Here's why:
java.util.concurrent.BrokenBarrierException: null
    at java.util.concurrent.CyclicBarrier.dowait(CyclicBarrier.java:250) ~[na:1.8.0_181]
    at java.util.concurrent.CyclicBarrier.await(CyclicBarrier.java:362) ~[na:1.8.0_181]
    at jepsen.generator.Synchronize.op(generator.clj:664) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
    at jepsen.generator.Concat.op(generator.clj:606) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
    at jepsen.generator$op_and_validate.invokeStatic(generator.clj:34) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
    at jepsen.generator$op_and_validate.invoke(generator.clj:30) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
    at jepsen.core.NemesisWorker.run_worker_BANG_(core.clj:442) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
    at jepsen.core$do_worker_BANG_.invokeStatic(core.clj:175) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
    at jepsen.core$do_worker_BANG_.invoke(core.clj:162) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
    at jepsen.core$run_workers_BANG_$fn__4597$fn__4598.invoke(core.clj:228) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
    at clojure.lang.AFn.applyToHelper(AFn.java:152) ~[clojure-1.8.0.jar:na]
    at clojure.lang.AFn.applyTo(AFn.java:144) ~[clojure-1.8.0.jar:na]
    at clojure.core$apply.invokeStatic(core.clj:646) ~[clojure-1.8.0.jar:na]
    at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1881) ~[clojure-1.8.0.jar:na]
    at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1881) ~[clojure-1.8.0.jar:na]
    at clojure.lang.RestFn.invoke(RestFn.java:425) ~[clojure-1.8.0.jar:na]
    at clojure.lang.AFn.applyToHelper(AFn.java:156) ~[clojure-1.8.0.jar:na]
    at clojure.lang.RestFn.applyTo(RestFn.java:132) ~[clojure-1.8.0.jar:na]
    at clojure.core$apply.invokeStatic(core.clj:650) ~[clojure-1.8.0.jar:na]
    at clojure.core$bound_fn_STAR_$fn__4671.doInvoke(core.clj:1911) ~[clojure-1.8.0.jar:na]
    at clojure.lang.RestFn.invoke(RestFn.java:397) ~[clojure-1.8.0.jar:na]
    at clojure.lang.AFn.run(AFn.java:22) ~[clojure-1.8.0.jar:na]
    at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_181]

Is that a Jepsen problem or a result of a CockroachDB problem? 🤔

tbg commented 6 years ago

Jepsen's invoke log has this:

2018-08-25 09:11:57,701{GMT}    INFO    [jepsen worker 25] jepsen.util: 25  :invoke :read   [178 nil]
2018-08-25 09:11:57,701{GMT}    INFO    [jepsen worker 10] jepsen.util: 10  :invoke :cas    [179 [1 2]]
2018-08-25 09:11:57,702{GMT}    INFO    [jepsen worker 0] jepsen.util: 0    :invoke :cas    [177 [3 4]]
2018-08-25 09:11:57,702{GMT}    INFO    [jepsen worker 24] jepsen.util: 24  :invoke :cas    [178 [2 2]]
2018-08-25 09:11:57,702{GMT}    INFO    [jepsen worker 11] jepsen.util: 11  :invoke :write  [179 4]
2018-08-25 09:11:57,702{GMT}    WARN    [jepsen worker 1] jepsen.cockroach.register: Encountered error with conn [:cockroach "10.128.0.31"]; reopening
2018-08-25 09:11:57,702{GMT}    INFO    [jepsen worker 5] jepsen.util: 5    :invoke :read   [177 nil]
2018-08-25 09:11:57,703{GMT}    WARN    [jepsen worker 9] jepsen.cockroach.register: Encountered error with conn [:cockroach "10.128.0.26"]; reopening
2018-08-25 09:11:57,704{GMT}    WARN    [jepsen worker 3] jepsen.cockroach.register: Encountered error with conn [:cockroach "10.128.0.9"]; reopening
2018-08-25 09:11:57,704{GMT}    WARN    [jepsen worker 16] jepsen.cockroach.register: Encountered error with conn [:cockroach "10.128.0.31"]; reopening
2018-08-25 09:11:57,704{GMT}    WARN    [jepsen worker 13] jepsen.cockroach.register: Encountered error with conn [:cockroach "10.128.0.9"]; reopening
2018-08-25 09:11:57,704{GMT}    WARN    [jepsen worker 27] jepsen.cockroach.register: Encountered error with conn [:cockroach "10.128.0.12"]; reopening
2018-08-25 09:11:57,704{GMT}    WARN    [jepsen worker 17] jepsen.cockroach.register: Encountered error with conn [:cockroach "10.128.0.12"]; reopening
2018-08-25 09:11:57,704{GMT}    WARN    [jepsen worker 29] jepsen.cockroach.register: Encountered error with conn [:cockroach "10.128.0.26"]; reopening
2018-08-25 09:11:57,704{GMT}    WARN    [jepsen worker 26] jepsen.cockroach.register: Encountered error with conn [:cockroach "10.128.0.31"]; reopening
2018-08-25 09:11:57,704{GMT}    WARN    [jepsen worker 25] jepsen.cockroach.register: Encountered error with conn [:cockroach "10.128.0.19"]; reopening
2018-08-25 09:11:57,704{GMT}    WARN    [jepsen worker 0] jepsen.cockroach.register: Encountered error with conn [:cockroach "10.128.0.19"]; reopening
2018-08-25 09:11:57,705{GMT}    WARN    [jepsen worker 10] jepsen.cockroach.register: Encountered error with conn [:cockroach "10.128.0.19"]; reopening
2018-08-25 09:11:57,705{GMT}    WARN    [jepsen worker 11] jepsen.cockroach.register: Encountered error with conn [:cockroach "10.128.0.31"]; reopening
2018-08-25 09:11:57,705{GMT}    WARN    [jepsen worker 24] jepsen.cockroach.register: Encountered error with conn [:cockroach "10.128.0.26"]; reopening
2018-08-25 09:11:57,707{GMT}    WARN    [jepsen worker 5] jepsen.cockroach.register: Encountered error with conn [:cockroach "10.128.0.19"]; reopening
2018-08-25 09:11:57,713{GMT}    INFO    [jepsen worker 7] jepsen.util: 7    :ok :read   [177 3]
2018-08-25 09:11:57,704{GMT}    WARN    [jepsen worker 8] jepsen.cockroach.register: Encountered error with conn [:cockroach "10.128.0.9"]; reopening
2018-08-25 09:11:57,726{GMT}    WARN    [jepsen worker 12] jepsen.core: Process 12 crashed
java.lang.InterruptedException: null
    at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404) ~[na:1.8.0_181]
    at java.util.concurrent.FutureTask.get(FutureTask.java:204) ~[na:1.8.0_181]
    at clojure.core$deref_future.invokeStatic(core.clj:2210) ~[clojure-1.8.0.jar:na]
    at clojure.core$future_call$reify__6962.deref(core.clj:6688) ~[clojure-1.8.0.jar:na]
    at clojure.core$deref.invokeStatic(core.clj:2232) ~[clojure-1.8.0.jar:na]
    at clojure.core$deref.invoke(core.clj:2214) ~[clojure-1.8.0.jar:na]
    at jepsen.cockroach.register.AtomicClient$fn__1852$fn__1859.invoke(register.clj:43) ~[classes/:na]
    at jepsen.cockroach.register.AtomicClient$fn__1852.invoke(register.clj:43) ~[classes/:na]
    at jepsen.cockroach.register.AtomicClient.invoke_BANG_(register.clj:41) ~[classes/:na]
    at jepsen.core$invoke_op_BANG_$fn__4611.invoke(core.clj:260) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
    at jepsen.core$invoke_op_BANG_.invokeStatic(core.clj:260) [jepsen-0.1.9-SNAPSHOT.jar:na]
    at jepsen.core$invoke_op_BANG_.invoke(core.clj:255) [jepsen-0.1.9-SNAPSHOT.jar:na]
    at jepsen.core.ClientWorker.run_worker_BANG_(core.clj:391) [jepsen-0.1.9-SNAPSHOT.jar:na]
    at jepsen.core$do_worker_BANG_.invokeStatic(core.clj:175) [jepsen-0.1.9-SNAPSHOT.jar:na]
    at jepsen.core$do_worker_BANG_.invoke(core.clj:162) [jepsen-0.1.9-SNAPSHOT.jar:na]
    at jepsen.core$run_workers_BANG_$fn__4597$fn__4598.invoke(core.clj:228) [jepsen-0.1.9-SNAPSHOT.jar:na]
    at clojure.lang.AFn.applyToHelper(AFn.java:152) [clojure-1.8.0.jar:na]
    at clojure.lang.AFn.applyTo(AFn.java:144) [clojure-1.8.0.jar:na]
    at clojure.core$apply.invokeStatic(core.clj:646) [clojure-1.8.0.jar:na]
    at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1881) [clojure-1.8.0.jar:na]
    at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1881) [clojure-1.8.0.jar:na]
    at clojure.lang.RestFn.invoke(RestFn.java:425) [clojure-1.8.0.jar:na]
    at clojure.lang.AFn.applyToHelper(AFn.java:156) [clojure-1.8.0.jar:na]
    at clojure.lang.RestFn.applyTo(RestFn.java:132) [clojure-1.8.0.jar:na]
    at clojure.core$apply.invokeStatic(core.clj:650) [clojure-1.8.0.jar:na]
    at clojure.core$bound_fn_STAR_$fn__4671.doInvoke(core.clj:1911) [clojure-1.8.0.jar:na]
    at clojure.lang.RestFn.invoke(RestFn.java:397) [clojure-1.8.0.jar:na]
    at clojure.lang.AFn.run(AFn.java:22) [clojure-1.8.0.jar:na]
    at java.lang.Thread.run(Thread.java:748) [na:1.8.0_181]
2018-08-25 09:11:57,728{GMT}    WARN    [jepsen worker 20] jepsen.core: Process 20 crashed
java.lang.InterruptedException: null
    at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404) ~[na:1.8.0_181]
    at java.util.concurrent.FutureTask.get(FutureTask.java:204) ~[na:1.8.0_181]
    at clojure.core$deref_future.invokeStatic(core.clj:2210) ~[clojure-1.8.0.jar:na]
    at clojure.core$future_call$reify__6962.deref(core.clj:6688) ~[clojure-1.8.0.jar:na]
    at clojure.core$deref.invokeStatic(core.clj:2232) ~[clojure-1.8.0.jar:na]
    at clojure.core$deref.invoke(core.clj:2214) ~[clojure-1.8.0.jar:na]
    at jepsen.cockroach.register.AtomicClient$fn__1852$fn__1859.invoke(register.clj:43) ~[classes/:na]
    at jepsen.cockroach.register.AtomicClient$fn__1852.invoke(register.clj:43) ~[classes/:na]
    at jepsen.cockroach.register.AtomicClient.invoke_BANG_(register.clj:41) ~[classes/:na]
    at jepsen.core$invoke_op_BANG_$fn__4611.invoke(core.clj:260) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
    at jepsen.core$invoke_op_BANG_.invokeStatic(core.clj:260) [jepsen-0.1.9-SNAPSHOT.jar:na]
    at jepsen.core$invoke_op_BANG_.invoke(core.clj:255) [jepsen-0.1.9-SNAPSHOT.jar:na]
    at jepsen.core.ClientWorker.run_worker_BANG_(core.clj:391) [jepsen-0.1.9-SNAPSHOT.jar:na]
    at jepsen.core$do_worker_BANG_.invokeStatic(core.clj:175) [jepsen-0.1.9-SNAPSHOT.jar:na]
    at jepsen.core$do_worker_BANG_.invoke(core.clj:162) [jepsen-0.1.9-SNAPSHOT.jar:na]
    at jepsen.core$run_workers_BANG_$fn__4597$fn__4598.invoke(core.clj:228) [jepsen-0.1.9-SNAPSHOT.jar:na]
    at clojure.lang.AFn.applyToHelper(AFn.java:152) [clojure-1.8.0.jar:na]
    at clojure.lang.AFn.applyTo(AFn.java:144) [clojure-1.8.0.jar:na]
    at clojure.core$apply.invokeStatic(core.clj:646) [clojure-1.8.0.jar:na]
    at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1881) [clojure-1.8.0.jar:na]
    at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1881) [clojure-1.8.0.jar:na]
    at clojure.lang.RestFn.invoke(RestFn.java:425) [clojure-1.8.0.jar:na]
    at clojure.lang.AFn.applyToHelper(AFn.java:156) [clojure-1.8.0.jar:na]
    at clojure.lang.RestFn.applyTo(RestFn.java:132) [clojure-1.8.0.jar:na]
    at clojure.core$apply.invokeStatic(core.clj:650) [clojure-1.8.0.jar:na]
    at clojure.core$bound_fn_STAR_$fn__4671.doInvoke(core.clj:1911) [clojure-1.8.0.jar:na]
    at clojure.lang.RestFn.invoke(RestFn.java:397) [clojure-1.8.0.jar:na]
    at clojure.lang.AFn.run(AFn.java:22) [clojure-1.8.0.jar:na]
    at java.lang.Thread.run(Thread.java:748) [na:1.8.0_181]
cockroach-teamcity commented 6 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/f18466169337ddc4476613a0c324432854062d59

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=jepsen/3/register/split PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=876542&tab=buildLog

bdarnell commented 6 years ago

This is the same as #26279.