roachtest: many jepsen failures

cockroach-teamcity commented 6 years ago

SHA: https://github.com/cockroachdb/cockroach/commits/0ed007055c52c396a7474a12387b1de1a7b359c9

Parameters:

Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=666851&tab=buildLog

    jepsen.go:235,jepsen.go:279: timed out

bdarnell commented 6 years ago

Looks like we got unlucky and GCE terminated one of our instances (at 8:25. It was restarted at 8:37, but by then most of the tests had failed).

This resulted in a couple of different failure modes: This one (register/subcritical-skews) was the one running when the instance died, and it ultimately hit its 20-minute timeout. For the others, cluster.InternalIP() returned an empty string, which due to sloppy command-line handling made the tests fail immediately when trying to resolve -n as a hostname.

It's probably more trouble than it's worth to try and handle losing instances gracefully here, although we may want to think about only posting one issue for multiple subtest failures. When multiple jepsen tests fail in the same run it's much more likely to have the same root cause instead of several separate causes.

petermattis commented 6 years ago

It's probably more trouble than it's worth to try and handle losing instances gracefully here, although we may want to think about only posting one issue for multiple subtest failures.

Did you have a heuristic in mind? Always bundle all of the subtest failures into a single issue?

bdarnell commented 6 years ago

Yeah, I was just thinking of bundling all the failures into one issue per top-level test.

tbg commented 6 years ago

Doesn't seem useful to keep this open, please chime in if you disagree @bdarnell.

bdarnell commented 6 years ago

I agree; I was planning to close the old jepsen flakes and start fresh now that it seems healthier.

cockroachdb / cockroach

roachtest: many jepsen failures #25692