cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.07k stars 3.8k forks source link

jepsen: Failure after "everything looks good" #20497

Closed bdarnell closed 6 years ago

bdarnell commented 6 years ago

Jepsen tests are occasionally reporting failure after an apparently successful run, with no apparent cause of failure:

[10:40:11][/majority-ring] Everything looks good! ヽ(‘ー`)ノ
[10:40:11][/majority-ring] +(teamcity-jepsen-run-one.sh) progress Creating archive from controller output
[10:40:11][/majority-ring] +(teamcity-jepsen-run-one.sh) printf '##%s[progressMessage '\''%s'\'']\n' teamcity 'Creating archive from controller output'
[10:40:11][/majority-ring] +(teamcity-jepsen-run-one.sh) ssh -o ServerAliveInterval=60 -o 'StrictHostKeyChecking no' -i /home/agent/.ssh/google_compute_engine ubuntu@104.196.219.45 'tar -chj --ignore-failed-read -f- jepsen/cockroachdb/store/latest jepsen/cockroachdb/invoke.log'
[10:40:17][/majority-ring] +(teamcity-jepsen-run-one.sh) progress Resetting latest run for next test
[10:40:17][/majority-ring] +(teamcity-jepsen-run-one.sh) printf '##%s[progressMessage '\''%s'\'']\n' teamcity 'Resetting latest run for next test'
[10:40:17][/majority-ring] Resetting latest run for next test
[10:40:17][/majority-ring] +(teamcity-jepsen-run-one.sh) ssh -o ServerAliveInterval=60 -o 'StrictHostKeyChecking no' -i /home/agent/.ssh/google_compute_engine ubuntu@104.196.219.45 'rm -f jepsen/cockroachdb/store/latest'
[10:40:18][/majority-ring] +(teamcity-jepsen-run-one.sh) tc Failed Monotonic/majority-ring
[10:40:18][/majority-ring] +(teamcity-jepsen-run-one.sh) printf '##%s[test%s name='\''Jepsen%s'\'']\n' teamcity Failed Monotonic/majority-ring
bdarnell commented 6 years ago

Looking back further in the logs, we see that this was a "silent for too long" failure:

[10:39:04][/majority-ring] Jepsen test was silent for too long, aborting
[10:40:11][/majority-ring] ssh: connect to host 104.196.219.45 port 22: Connection timed out
[10:40:11][/majority-ring] Test failed: exit code 142. Grabbing artifacts from controller...
[10:40:11][/majority-ring] +(teamcity-jepsen-run-one.sh) progress 'Test failed: exit code 142. Grabbing artifacts from controller...'
[10:40:11][/majority-ring] +(teamcity-jepsen-run-one.sh) printf '##%s[progressMessage '\''%s'\'']\n' teamcity 'Test failed: exit code 142. Grabbing artifacts from controller...'

We don't have complete logs available for this run, but the jepsen controller appears to have finished one minute before the "silent too long" failure.

[10:40:11][/majority-ring] INFO [2017-11-20 10:38:03,267] main - jepsen.core {:timeline {:valid? true},
[10:40:11][/majority-ring]  :perf
[10:40:11][/majority-ring]  {:latency-graph {:valid? true},
[10:40:11][/majority-ring]   :rate-graph {:valid? true},
[10:40:11][/majority-ring]   :valid? true},
[10:40:11][/majority-ring]  :g2
[10:40:11][/majority-ring]  {:valid? true,
[10:40:11][/majority-ring]   :key-count 2833,
[10:40:11][/majority-ring]   :legal-count 328,
[10:40:11][/majority-ring]   :illegal-count 0,
[10:40:11][/majority-ring]   :illegal {}},
[10:40:11][/majority-ring]  :valid? true}
[10:40:11][/majority-ring] 
[10:40:11][/majority-ring] 
[10:40:11][/majority-ring] Everything looks good! ヽ(‘ー`)ノ

The ssh timeout at 10:40:11 must be this line; the following scp to grab the logs worked. So this looks like a network fluke. Since it hasn't recurred, I think this issue can be closed.