cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.1k stars 3.8k forks source link

teamcity: failed tests: cluster-recovery #29811

Closed cockroach-teamcity closed 6 years ago

cockroach-teamcity commented 6 years ago

The following tests appear to have failed:

#891733:

--- FAIL: roachtest/acceptance/bank/cluster-recovery (600.028s)
    test.go:498,cluster.go:837,bank.go:224: /go/bin/roachprod start local:3 --args --host=127.0.0.1 returned:
        stderr:
        word.
        * - Any user, connecting as root, can read or write any data in your cluster.
        * - There is no network encryption nor authentication, and thus no confidentiality.
        * 
        * Check out how to secure your cluster: https://www.cockroachlabs.com/docs/v2.2/secure-a-cluster.html
        *
        *
        * ERROR: could not cleanup temporary directories from record file: could not lock temporary directory /home/roach/local/3/data/cockroach-temp417971504, may still be in use: IO error: While lock file: /home/roach/local/3/data/cockroach-temp417971504/TEMP_DIR.LOCK: Resource temporarily unavailable
        *
        Failed running "start"
        *
        * ERROR: exit status 1
        *
        Failed running "start"

        github.com/cockroachdb/roachprod/install.Cockroach.Start.func6
            /go/src/github.com/cockroachdb/roachprod/install/cockroach.go:363
        github.com/cockroachdb/roachprod/install.(*SyncedCluster).Parallel.func1.1
            /go/src/github.com/cockroachdb/roachprod/install/cluster_synced.go:1072
        runtime.goexit
            /usr/local/go/src/runtime/asm_amd64.s:2361: 
        2018/09/07 16:24:57 command failed

        stdout:
        local: starting
        : exit status 1
    test.go:776: test timed out (10m0s)
    test.go:498,bank.go:276,bank.go:353,acceptance.go:59: context canceled

--- FAIL: roachtest/acceptance/bank/cluster-recovery (600.028s)
    test.go:498,cluster.go:837,bank.go:224: /go/bin/roachprod start local:3 --args --host=127.0.0.1 returned:
        stderr:
        word.
        * - Any user, connecting as root, can read or write any data in your cluster.
        * - There is no network encryption nor authentication, and thus no confidentiality.
        * 
        * Check out how to secure your cluster: https://www.cockroachlabs.com/docs/v2.2/secure-a-cluster.html
        *
        *
        * ERROR: could not cleanup temporary directories from record file: could not lock temporary directory /home/roach/local/3/data/cockroach-temp417971504, may still be in use: IO error: While lock file: /home/roach/local/3/data/cockroach-temp417971504/TEMP_DIR.LOCK: Resource temporarily unavailable
        *
        Failed running "start"
        *
        * ERROR: exit status 1
        *
        Failed running "start"

        github.com/cockroachdb/roachprod/install.Cockroach.Start.func6
            /go/src/github.com/cockroachdb/roachprod/install/cockroach.go:363
        github.com/cockroachdb/roachprod/install.(*SyncedCluster).Parallel.func1.1
            /go/src/github.com/cockroachdb/roachprod/install/cluster_synced.go:1072
        runtime.goexit
            /usr/local/go/src/runtime/asm_amd64.s:2361: 
        2018/09/07 16:24:57 command failed

        stdout:
        local: starting
        : exit status 1
    test.go:776: test timed out (10m0s)
    test.go:498,bank.go:276,bank.go:353,acceptance.go:59: context canceled

Please assign, take a look and update the issue accordingly.

petermattis commented 6 years ago

This should be impossible. The test ran roachprod stop local:3; roachprod start local:3 and the start failed with this error. roachprod stop sends the process kill -9. The error message is indicating that we're definitely trying to start a cockroach process in the right directory (.../local/3/...). I'm going to try stopping and restarting a node in a loop to see if I can reproduce this.

petermattis commented 6 years ago

I'm going to try stopping and restarting a node in a loop to see if I can reproduce this.

Nope.

An interesting bit in the above is that the test timed out. I think that might be the key to what is going on here. The test timed out causing the context to be canceled and the next subtest starts and the timed out test hasn't finished cleaning up yet.

petermattis commented 6 years ago

An interesting bit in the above is that the test timed out. I think that might be the key to what is going on here. The test timed out causing the context to be canceled and the next subtest starts and the timed out test hasn't finished cleaning up yet.

I attempted to verify this theory by setting the timeout for these subtests to 10s so that they were continually be canceled and restarted. But I failed at a reproduction. Looking at the code it is clear that the test waits for the "chaos monkey" goroutine to exit before the test is allowed to exit. And this is true no matter how the test exits. Back to the drawing board.

tbg commented 6 years ago

Is there maybe something silly about using kill -9 but also running cockroach start --background which somehow forks in exactly the right moment to avoid the kill?

petermattis commented 6 years ago

Possibly. Something else to try is that I was attempting reproduction on my laptop (Mac). Perhaps the failure is Linux specific. I'll try reproducing on a GCE worker today.

petermattis commented 6 years ago

No luck reproducing on a GCE worker so far. I'll keep trying.

@tschottdorf Have you ever been able to reproduce this error?

tbg commented 6 years ago

Nope. Have been distracted.

petermattis commented 6 years ago

I'm back to attempting reproduction of this test failure.

petermattis commented 6 years ago

No luck in reproducing so far (I'm testing on 9c8037d906633d0c952fe07f49eb792592fe5a8f).

petermattis commented 6 years ago

I've been running the following concurrently on both my laptop and gce-worker:

while [ $? -eq 0 ]; do
    rm -fr artifacts
    ../bin/roachtest run --local --debug acceptance/bank/cluster-recovery
done 2>&1 | tee out

So far, this has gone through 330 runs total without a failure. I'm going to leave this running and will declare victory after another few hours of runs.

petermattis commented 6 years ago

500+ runs without a failure. I wonder if this was fixed by #30990. This is roughly the same test as node-restart (it just restarts more of the nodes). I'm going to close this for now. If we see another failure we'll jump on it.