cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.07k stars 3.8k forks source link

jepsen: /root/lein: No such file or directory #37831

Closed cockroach-teamcity closed 5 years ago

cockroach-teamcity commented 5 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/db9c1217a6967fcac2d135cf0f24a4265dc76d77

Parameters:

To repro, try:

# Don't forget to check out a clean suitable branch and experiment with the
# stress invocation until the desired results present themselves. For example,
# using stress instead of stressrace and passing the '-p' stressflag which
# controls concurrency.
./scripts/gceworker.sh start && ./scripts/gceworker.sh mosh
cd ~/go/src/github.com/cockroachdb/cockroach && \
stdbuf -oL -eL \
make stressrace TESTS=jepsen-batch1/bank-multitable/start-kill-2 PKG=roachtest TESTTIMEOUT=5m STRESSFLAGS='-maxtime 20m -timeout 10m' 2>&1 | tee /tmp/stress.log

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=1310296&tab=buildLog

The test failed on branch=master, cloud=gce:
    cluster.go:1516,jepsen.go:149,jepsen.go:173,jepsen.go:324,test.go:1251: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod run teamcity-1310296-jepsen-batch1:6 -- bash -e -c "cd /mnt/data1/jepsen/jepsen && ~/lein install && rm -f /mnt/data1/jepsen/cockroachdb/invoke.log" returned:
        stderr:

        stdout:
        bash: /root/lein: No such file or directory
        Error:  ssh verbose log retained in /root/.roachprod/debug/ssh_35.231.24.25_2019-05-26T10:37:19Z: exit status 1
        : exit status 1
tbg commented 5 years ago

^- this hit all (or at least a lot of) runs that night. Is this just a rare fluke? Were there recent changes that could've caused this? cc @bdarnell There haven't been any recent changes to the jepsen infra that I'm aware of.

bdarnell commented 5 years ago

lein is the clojure package manager/runner/kitchen sink, installed by this line. No changes on our side, nor have there been any recent releases of lein. I think it's probably a rare github/network flake combined with error handling that's not quite right (looks like we need to add the -f flag to curl to have it report failure as expected).

bdarnell commented 5 years ago

Oh, the problem is that if we hit the apt-get failure (which is recently skipped by #37430), we still write the jepsen_initialized file (from a defer), which causes subsequent test runs to proceed on the incompletely-initialized cluster (the apt step comes before the lein installation). I have no idea why that's a defer; it looks like it should only be written on successful initialization.