cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.07k stars 3.8k forks source link

teamcity: failed tests on master: Jepsen/JepsenSetupCluster #20115

Closed cockroach-teamcity closed 6 years ago

cockroach-teamcity commented 6 years ago

The following tests appear to have failed:

#412373:

--- FAIL: Jepsen/JepsenSetupCluster (5.957s)
None

Please assign, take a look and update the issue accordingly.

bdarnell commented 6 years ago

@nvanbenschoten This is because you had a manual test running at the time the nightly was triggered, right?

nvanbenschoten commented 6 years ago

It looks like it is, yes. I'm not sure how that happened though because my tests were failing pretty quickly last night due to the Unknown option: "--tarball"/--package-url issue. I'm kicking off a run on master now to see if it's affected.

nvanbenschoten commented 6 years ago

Saw this again when I tried to run Jepsen on master. #20125.

bdarnell commented 6 years ago

Looks like a run failed to clean up after itself. Did you cancel a run at some point?

I'm deleting the GCE resources and trying another build.

nvanbenschoten commented 6 years ago

Yes, I think I did cancel a run sometime yesterday, once it became clear that the run wasn't going to succeed.

bdarnell commented 6 years ago

OK, jepsen versioning is very confusing (it looks like a monorepo but doesn't act like one because the subpackages depend on published releases of each other) so my fix wasn't actually getting applied. I've got #20129 to fix this by adapting to the new flag name instead of fixing it upstream.

bdarnell commented 6 years ago

This happened again, with another cancelled manual build. We need to either make sure that this cleanup happens even if a build is cancelled (maybe move from terraform to roachprod?), or at least that we don't leave the orphaned machines around blocking future test runs (and costing money) for a week at a time.

We should probably also move to randomized resource names so we're not limited to one instance of these tests running at a time, but only after we've made sure they'll get cleaned up reliably.

bdarnell commented 6 years ago

While the build linked in my previous comment was manual, its cancellation was not: Canceled with comment: Agent removed. It got caught up in other teamcity operations.

Maybe we should just run the cleanup step at the start of the process in addition to the end. This will ensure that we don't leave the orphaned resources around for more than a day. But I'm not sure if that works, since terraform destroy relies on local state to know which resources have been created.

bdarnell commented 6 years ago

20515