Closed cockroach-teamcity closed 6 years ago
Looks like we got unlucky and GCE terminated one of our instances (at 8:25. It was restarted at 8:37, but by then most of the tests had failed).
This resulted in a couple of different failure modes: This one (register/subcritical-skews) was the one running when the instance died, and it ultimately hit its 20-minute timeout. For the others, cluster.InternalIP()
returned an empty string, which due to sloppy command-line handling made the tests fail immediately when trying to resolve -n
as a hostname.
It's probably more trouble than it's worth to try and handle losing instances gracefully here, although we may want to think about only posting one issue for multiple subtest failures. When multiple jepsen tests fail in the same run it's much more likely to have the same root cause instead of several separate causes.
It's probably more trouble than it's worth to try and handle losing instances gracefully here, although we may want to think about only posting one issue for multiple subtest failures.
Did you have a heuristic in mind? Always bundle all of the subtest failures into a single issue?
Yeah, I was just thinking of bundling all the failures into one issue per top-level test.
Doesn't seem useful to keep this open, please chime in if you disagree @bdarnell.
I agree; I was planning to close the old jepsen flakes and start fresh now that it seems healthier.
SHA: https://github.com/cockroachdb/cockroach/commits/0ed007055c52c396a7474a12387b1de1a7b359c9
Parameters:
Failed test: https://teamcity.cockroachdb.com/viewLog.html?buildId=666851&tab=buildLog