cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.07k stars 3.8k forks source link

roachtest: jepsen-batch3/register/majority-ring failed #34567

Closed cockroach-teamcity closed 5 years ago

cockroach-teamcity commented 5 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/af891db5e120ccc272bb9a10482ac42d263a185e

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen-batch3/register/majority-ring PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1124507&tab=buildLog

The test failed on release-2.1:
    test.go:743,jepsen.go:247,jepsen.go:308: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1124507-jepsen-batch3:6 -- bash -e -c "\
        cd /mnt/data1/jepsen/cockroachdb && set -eo pipefail && \
         ~/lein run test \
           --tarball file://${PWD}/cockroach.tgz \
           --username ${USER} \
           --ssh-private-key ~/.ssh/id_rsa \
           --os ubuntu \
           --time-limit 300 \
           --concurrency 30 \
           --recovery-time 25 \
           --test-count 1 \
           -n 10.128.0.60 -n 10.128.0.59 -n 10.128.0.57 -n 10.128.0.58 -n 10.128.0.56 \
           --test register --nemesis majority-ring \
        > invoke.log 2>&1 \
        " returned:
        stderr:

        stdout:
        Error:  exit status 255
        : exit status 1
petermattis commented 5 years ago

@andreimatei, @nvanbenschoten, @bdarnell can one of you triage this issue?

tbg commented 5 years ago

No artifacts, so I doubt they can. The SHA is from release-2.1 which is worrying. OTOH, hopefully this is just a fluke? I wonder if we can tell from the build log output:

[07:44:17]
--- FAIL: jepsen-batch3/register/majority-ring (135.13s)
[07:44:17]
    test.go:743,jepsen.go:247,jepsen.go:308: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1124507-jepsen-batch3:6 -- bash -e -c "\
[07:44:17]
        cd /mnt/data1/jepsen/cockroachdb && set -eo pipefail && \
[07:44:17]
         ~/lein run test \
[07:44:17]
           --tarball file://${PWD}/cockroach.tgz \
[07:44:17]
           --username ${USER} \
[07:44:17]
           --ssh-private-key ~/.ssh/id_rsa \
[07:44:17]
           --os ubuntu \
[07:44:17]
           --time-limit 300 \
[07:44:17]
           --concurrency 30 \
[07:44:17]
           --recovery-time 25 \
[07:44:17]
           --test-count 1 \
[07:44:17]
           -n 10.128.0.60 -n 10.128.0.59 -n 10.128.0.57 -n 10.128.0.58 -n 10.128.0.56 \
[07:44:17]
           --test register --nemesis majority-ring \
[07:44:17]
        > invoke.log 2>&1 \
[07:44:17]
        " returned:
[07:44:17]
        stderr:
[07:44:17]

[07:44:17]
        stdout:
[07:44:17]
        Error:  exit status 255
[07:44:17]
        : exit status 1
bdarnell commented 5 years ago

Artifacts gone so soon?

"Real" jepsen failures have exit status 1 (in both of the "exit status" lines in the build log); I believe exit status 255 is always associated with various kinds of flukes. (such as #30527. We have hacky attempts to suppress these interruption failures with error-message matching, but maybe we're not catching enough). In any case, there's nothing else we can do here.

tbg commented 5 years ago

I think we delete artifacts after ~4 days. This was probably originally due to the large amounts of artifacts we collected, I think in roachtest it's gotten a lot better since we stopped retaining the artifacts for passed tests. If we can up the artifact retention period for roachtests independently, we should do so.