Closed cockroach-teamcity closed 6 years ago
This should be impossible. The test ran roachprod stop local:3; roachprod start local:3
and the start failed with this error. roachprod stop
sends the process kill -9
. The error message is indicating that we're definitely trying to start a cockroach process in the right directory (.../local/3/...
). I'm going to try stopping and restarting a node in a loop to see if I can reproduce this.
I'm going to try stopping and restarting a node in a loop to see if I can reproduce this.
Nope.
An interesting bit in the above is that the test timed out. I think that might be the key to what is going on here. The test timed out causing the context to be canceled and the next subtest starts and the timed out test hasn't finished cleaning up yet.
An interesting bit in the above is that the test timed out. I think that might be the key to what is going on here. The test timed out causing the context to be canceled and the next subtest starts and the timed out test hasn't finished cleaning up yet.
I attempted to verify this theory by setting the timeout for these subtests to 10s so that they were continually be canceled and restarted. But I failed at a reproduction. Looking at the code it is clear that the test waits for the "chaos monkey" goroutine to exit before the test is allowed to exit. And this is true no matter how the test exits. Back to the drawing board.
Is there maybe something silly about using kill -9
but also running cockroach start --background
which somehow forks in exactly the right moment to avoid the kill?
Possibly. Something else to try is that I was attempting reproduction on my laptop (Mac). Perhaps the failure is Linux specific. I'll try reproducing on a GCE worker today.
No luck reproducing on a GCE worker so far. I'll keep trying.
@tschottdorf Have you ever been able to reproduce this error?
Nope. Have been distracted.
I'm back to attempting reproduction of this test failure.
No luck in reproducing so far (I'm testing on 9c8037d906633d0c952fe07f49eb792592fe5a8f).
I've been running the following concurrently on both my laptop and gce-worker:
while [ $? -eq 0 ]; do
rm -fr artifacts
../bin/roachtest run --local --debug acceptance/bank/cluster-recovery
done 2>&1 | tee out
So far, this has gone through 330 runs total without a failure. I'm going to leave this running and will declare victory after another few hours of runs.
500+ runs without a failure. I wonder if this was fixed by #30990. This is roughly the same test as node-restart
(it just restarts more of the nodes). I'm going to close this for now. If we see another failure we'll jump on it.
The following tests appear to have failed:
#891733:
Please assign, take a look and update the issue accordingly.