cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.22k stars 3.82k forks source link

roachtest: jepsen interruptions java.lang.InterruptedException: null #30527

Closed cockroach-teamcity closed 4 years ago

cockroach-teamcity commented 6 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/18519949df8e717c31a636bb153e526f45375ca8

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=jepsen/1/bank-multitable/subcritical-skews PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=919791&tab=buildLog

The test failed on master:
    test.go:498,jepsen.go:232,jepsen.go:285: /home/agent/work/.go/bin/roachprod run teamcity-919791-jepsen-1:6 -- bash -e -c "\
        cd /mnt/data1/jepsen/cockroachdb && set -eo pipefail && \
         ~/lein run test \
           --tarball file://${PWD}/cockroach.tgz \
           --username ${USER} \
           --ssh-private-key ~/.ssh/id_rsa \
           --os ubuntu \
           --time-limit 300 \
           --concurrency 30 \
           --recovery-time 25 \
           --test-count 1 \
           -n 10.128.0.8 -n 10.128.0.9 -n 10.128.0.6 -n 10.128.0.7 -n 10.128.0.5 \
           --test bank-multitable --nemesis subcritical-skews \
        > invoke.log 2>&1 \
        " returned:
        stderr:

        stdout:
        Error:  exit status 255
        : exit status 1
cockroach-teamcity commented 6 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/18519949df8e717c31a636bb153e526f45375ca8

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=jepsen/1/bank-multitable PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=919791&tab=buildLog

The test failed on master:
cockroach-teamcity commented 6 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/18519949df8e717c31a636bb153e526f45375ca8

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stressrace instead of stress and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
make stress TESTS=jepsen/1 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-stderr=false -maxtime 20m -timeout 10m'

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=919791&tab=buildLog

The test failed on master:
bdarnell commented 6 years ago

Another rare variant of the interrupted-at-shutdown problem (triple-posted). This one's on the nemesis thread, which doesn't really make sense because the nemesis is the thread that interrupts the others; I'm not sure where this interruption is coming from.

java.lang.InterruptedException: null
    at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:404) ~[na:1.8.0_181]
    at java.util.concurrent.FutureTask.get(FutureTask.java:191) ~[na:1.8.0_181]
    at clojure.core$deref_future.invokeStatic(core.clj:2208) ~[clojure-1.8.0.jar:na]
    at clojure.core$future_call$reify__6962.deref(core.clj:6688) ~[clojure-1.8.0.jar:na]
    at clojure.core$deref.invokeStatic(core.clj:2228) ~[clojure-1.8.0.jar:na]
    at clojure.core$deref.invoke(core.clj:2214) ~[clojure-1.8.0.jar:na]
    at clojure.core$map$fn__4785.invoke(core.clj:2644) ~[clojure-1.8.0.jar:na]
    at clojure.lang.LazySeq.sval(LazySeq.java:40) ~[clojure-1.8.0.jar:na]
    at clojure.lang.LazySeq.seq(LazySeq.java:49) ~[clojure-1.8.0.jar:na]
    at clojure.lang.RT.seq(RT.java:521) ~[clojure-1.8.0.jar:na]
    at clojure.core$seq__4357.invokeStatic(core.clj:137) ~[clojure-1.8.0.jar:na]
    at clojure.core.protocols$seq_reduce.invokeStatic(protocols.clj:24) ~[clojure-1.8.0.jar:na]
    at clojure.core.protocols$fn__6738.invokeStatic(protocols.clj:75) ~[clojure-1.8.0.jar:na]
    at clojure.core.protocols$fn__6738.invoke(protocols.clj:75) ~[clojure-1.8.0.jar:na]
    at clojure.core.protocols$fn__6684$G__6679__6697.invoke(protocols.clj:13) ~[clojure-1.8.0.jar:na]
    at clojure.core$reduce.invokeStatic(core.clj:6545) ~[clojure-1.8.0.jar:na]
    at clojure.core$into.invokeStatic(core.clj:6610) ~[clojure-1.8.0.jar:na]
    at clojure.core$into.invoke(core.clj:6604) ~[clojure-1.8.0.jar:na]
    at jepsen.control$on_nodes.invokeStatic(control.clj:373) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
    at jepsen.control$on_nodes.invoke(control.clj:357) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
    at jepsen.control$on_nodes.invokeStatic(control.clj:362) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
    at jepsen.control$on_nodes.invoke(control.clj:357) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
    at jepsen.cockroach.auto$reset_clocks_BANG_.invokeStatic(auto.clj:223) ~[classes/:na]
    at jepsen.cockroach.auto$reset_clocks_BANG_.invoke(auto.clj:220) ~[classes/:na]
    at jepsen.cockroach.nemesis$bump_time$reify__1252.teardown_BANG_(nemesis.clj:255) ~[classes/:na]
    at jepsen.cockroach.nemesis$restarting$reify__1240.teardown_BANG_(nemesis.clj:199) ~[classes/:na]
    at jepsen.nemesis$teardown_compat_BANG_.invokeStatic(nemesis.clj:50) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
    at jepsen.nemesis$teardown_compat_BANG_.invoke(nemesis.clj:44) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
    at jepsen.nemesis$compose$reify__4401$fn__4407.invoke(nemesis.clj:193) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
    at jepsen.util$map_vals$fn__1319.invoke(util.clj:503) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
    at clojure.core.reducers$map$fn__554$fn__555.invoke(reducers.clj:167) ~[knossos-0.3.3.jar:na]
    at clojure.core.protocols$iter_reduce.invokeStatic(protocols.clj:49) ~[clojure-1.8.0.jar:na]
    at clojure.core.protocols$fn__6742.invokeStatic(protocols.clj:75) ~[clojure-1.8.0.jar:na]
    at clojure.core.protocols$fn__6742.invoke(protocols.clj:75) ~[clojure-1.8.0.jar:na]
    at clojure.core.protocols$fn__6684$G__6679__6697.invoke(protocols.clj:13) ~[clojure-1.8.0.jar:na]
    at clojure.core.reducers$folder$reify__537.coll_reduce(reducers.clj:126) ~[knossos-0.3.3.jar:na]
    at clojure.core$reduce.invokeStatic(core.clj:6545) ~[clojure-1.8.0.jar:na]
    at clojure.core$into.invokeStatic(core.clj:6610) ~[clojure-1.8.0.jar:na]
    at clojure.core$into.invoke(core.clj:6604) ~[clojure-1.8.0.jar:na]
    at jepsen.util$map_kv.invokeStatic(util.clj:498) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
    at jepsen.util$map_kv.invoke(util.clj:494) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
    at jepsen.util$map_vals.invokeStatic(util.clj:503) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
    at jepsen.util$map_vals.invoke(util.clj:500) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
    at jepsen.nemesis$compose$reify__4401.teardown_BANG_(nemesis.clj:193) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
    at jepsen.nemesis$teardown_compat_BANG_.invokeStatic(nemesis.clj:50) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
    at jepsen.nemesis$teardown_compat_BANG_.invoke(nemesis.clj:44) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
    at jepsen.core.NemesisWorker.teardown_worker_BANG_(core.clj:452) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
    at jepsen.core$do_worker_BANG_$fn__4574.invoke(core.clj:192) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
    at jepsen.core$do_worker_BANG_.invokeStatic(core.clj:191) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
    at jepsen.core$do_worker_BANG_.invoke(core.clj:162) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
    at jepsen.core$run_workers_BANG_$fn__4597$fn__4598.invoke(core.clj:228) ~[jepsen-0.1.9-SNAPSHOT.jar:na]
    at clojure.lang.AFn.applyToHelper(AFn.java:152) ~[clojure-1.8.0.jar:na]
    at clojure.lang.AFn.applyTo(AFn.java:144) ~[clojure-1.8.0.jar:na]
    at clojure.core$apply.invokeStatic(core.clj:646) ~[clojure-1.8.0.jar:na]
    at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1881) ~[clojure-1.8.0.jar:na]
    at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1881) ~[clojure-1.8.0.jar:na]
    at clojure.lang.RestFn.invoke(RestFn.java:425) ~[clojure-1.8.0.jar:na]
    at clojure.lang.AFn.applyToHelper(AFn.java:156) ~[clojure-1.8.0.jar:na]
    at clojure.lang.RestFn.applyTo(RestFn.java:132) ~[clojure-1.8.0.jar:na]
    at clojure.core$apply.invokeStatic(core.clj:650) ~[clojure-1.8.0.jar:na]
    at clojure.core$bound_fn_STAR_$fn__4671.doInvoke(core.clj:1911) ~[clojure-1.8.0.jar:na]
    at clojure.lang.RestFn.invoke(RestFn.java:397) ~[clojure-1.8.0.jar:na]
    at clojure.lang.AFn.run(AFn.java:22) ~[clojure-1.8.0.jar:na]
    at java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_181]
andreimatei commented 6 years ago

I'm seeing two types of barrier exceptions coming from Jepsen. Just checking - are these both known and equally ignorable? Cause I want to make our roachtest ignore them.

138 ERROR [2018-10-22 20:25:35,556] main - jepsen.cli Oh jeez, I'm sorry, Jepsen broke. Here's why:
139 java.lang.InterruptedException: null
140   at java.util.concurrent.CyclicBarrier.dowait(CyclicBarrier.java:211) ~[na:1.8.0_181]
141   at java.util.concurrent.CyclicBarrier.await(CyclicBarrier.java:362) ~[na:1.8.0_181]
142   at jepsen.generator.Synchronize.op(generator.clj:667) ~[jepsen-0.1.9-SNAPSHOT.jar:na]

ERROR [2018-10-20 03:44:39,113] main - jepsen.cli Oh jeez, I'm sorry, Jepsen broke. Here's why:
java.util.concurrent.BrokenBarrierException: null
        at java.util.concurrent.CyclicBarrier.dowait(CyclicBarrier.java:250) ~[na:1.8.0_181]
        at java.util.concurrent.CyclicBarrier.await(CyclicBarrier.java:362) ~[na:1.8.0_181]
andreimatei commented 6 years ago

Ben said they're both the same thing.

bdarnell commented 6 years ago

Yeah, they're basically the same thing. When one thread is interrupted, any other threads waiting on the same barrier get BrokenBarrierExceptions. These exceptions then race up to the top-level exception logger. Usually the InterruptedException wins because it's the root cause of all the BrokenBarrierExceptions, but sometimes it doesn't.

andreimatei commented 6 years ago

Note for the future: the Jepsen roachtest now has specific code for ignoring these errors that should go away once this issue is fixed: https://github.com/cockroachdb/cockroach/blob/b2ce21d89d8ad3c9fd58e260e011f5ca2eed91f8/pkg/cmd/roachtest/jepsen.go#L230-L231

nvanbenschoten commented 4 years ago

Closing as stale.