cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.07k stars 3.8k forks source link

roachtest: nightly master fails with errors in roachprod create every night #66184

Closed tbg closed 3 years ago

tbg commented 3 years ago
[Step 2/3]   |  - The zone 'projects/cockroach-ephemeral/zones/us-central1-a' does not have enough resources available to fulfill the request.  '(resource type:compute)'.: exit status 1)

https://teamcity.cockroachdb.com/viewLog.html?buildId=3058235&tab=buildLog&_focus=4077

tbg commented 3 years ago

Also seeing master runs die, but not clearly seeing the out of quota error here so it might be something else:

[10:20:44] : [Step 2/3] Worker 1 returned with error. Quiescing. Error: cloud cpu pool closed: Worker 13 returned with error. Quiescing. Error: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod create teamcity-3051465-1622873101-54-n7cpu16-geo -n 7 --clouds=gce --gce-machine-type=n1-highcpu-16 --gce-zones=us-central1-a,us-central1-b,us-central1-c --geo --lifetime=12h0m0s --local-ssd-no-ext4-barrier returned: exit status 1

https://teamcity.cockroachdb.com/viewLog.html?buildId=3051465&buildTypeId=Cockroach_Nightlies_WorkloadNightly&tab=buildLog&branch_Cockroach_Nightlies=master&filter=debug&_focus=3634

That test has a 21h timeout and had been running for 7 minutes, so it's not anywhere close to timing out and shouldn't have been interrupted like that.

tbg commented 3 years ago

Same in https://teamcity.cockroachdb.com/viewLog.html?tab=buildLog&logTab=tree&filter=debug&expand=all&buildId=3053607&_focus=4677 (the build after that)

so effectively it looks like roachtest hasn't been running fully for two consecutive nights, on master.

jlinder commented 3 years ago

Hrm. The only quota that was even close to being reached in the cockroach-ephemeral project was for local SSD (90% in us-east1: that quota is now raised).

The max CPU quota used in the last 7 days wasn't even 50% in us-central-1.

Screen Shot 2021-06-08 at 12 06 08 PM

tbg commented 3 years ago

Hmm, maybe the message is telling us that they just didn't have enough resources? I.e. it's not about the configured quota, they were simply out of vcpus?

jlinder commented 3 years ago

Possibly. I had the thought earlier but had discounted it because the problem persisted over multiple days. But persisting over multiple days makes sense given it likely would take days or even weeks to install more capacity. So it does fit the problem.

tbg commented 3 years ago

This is still going on, and is costing us each night on master. Here's the error from last night, looks familiar:


Error: in provider: gce: Command: gcloud [compute instances create --subnet default --maintenance-policy MIGRATE --scopes default,storage-rw --image ubuntu-2004-focal-v20210325 --image-project ubuntu-os-cloud --boot-disk-type pd-ssd --service-account 21965078311-compute@developer.gserviceaccount.com --local-ssd interface=NVME --machine-type n1-highcpu-16 --labels lifetime=12h0m0s --metadata-from-file startup-script=/home/agent/temp/buildTmp/gce-startup-script499219989 --project cockroach-ephemeral --boot-disk-size=10GB]: exit status 1
(1) attached stack trace
  -- stack trace:
  | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/vm.ForProvider
  |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/vm/vm.go:285
  | [...repeated from below...]
Wraps: (2) in provider: gce
Wraps: (3) attached stack trace
  -- stack trace:
  | github.com/cockroachdb/cockroach/pkg/cmd/roachprod/vm/gce.(*Provider).Create.func2
  |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/vm/gce/gcloud.go:463
  | golang.org/x/sync/errgroup.(*Group).Go.func1
  |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
  | runtime.goexit
  |     /usr/local/go/src/runtime/asm_amd64.s:1374
Wraps: (4) Command: gcloud [compute instances create --subnet default --maintenance-policy MIGRATE --scopes default,storage-rw --image ubuntu-2004-focal-v20210325 --image-project ubuntu-os-cloud --boot-disk-type pd-ssd --service-account 21965078311-compute@developer.gserviceaccount.com --local-ssd interface=NVME --machine-type n1-highcpu-16 --labels lifetime=12h0m0s --metadata-from-file startup-script=/home/agent/temp/buildTmp/gce-startup-script499219989 --project cockroach-ephemeral --boot-disk-size=10GB]
  | Output: WARNING: Some requests generated warnings:
  |  - The resource 'projects/ubuntu-os-cloud/global/images/ubuntu-2004-focal-v20210325' is deprecated. A suggested replacement is 'projects/ubuntu-os-cloud/global/images/ubuntu-2004-focal-v20210610'.
  |
  | ERROR: (gcloud.compute.instances.create) Could not fetch resource:
  |  - The zone 'projects/cockroach-ephemeral/zones/us-central1-a' does not have enough resources available to fulfill the request.  '(resource type:compute)'.
tbg commented 3 years ago

This corresponds to roachprod create teamcity-3087613-1623823848-59-n7cpu16-geo -n 7 --clouds=gce --local-ssd=true --gce-machine-type=n1-highcpu-16 --gce-zones=us-central1-a,us-central1-b,us-central1-c --geo --lifetime=12h0m0s --local-ssd-no-ext4-barrier. This also fails just the same locally:

$ roachprod create tobias-3087613-1623823848-59-n7cpu16-geo -n 7 --clouds=gce --local-ssd=true --gce-machine-type=n1-highcpu-16 --gce-zones=us-central1-a,us-central1-b,us-central1-c --geo --lifetime=12h0m0s --local-ssd-no-ext4-barrier
Creating cluster tobias-3087613-1623823848-59-n7cpu16-geo with 7 nodes
Cleaning up partially-created cluster (prev err: in provider: gce: Command: gcloud [compute instances create --subnet default --maintenance-policy MIGRATE --scopes default,storage-rw --image ubuntu-2004-focal-v20210325 --image-project ubuntu-os-cloud --boot-disk-type pd-ssd --service-account 21965078311-compute@developer.gserviceaccount.com --local-ssd interface=NVME --machine-type n1-highcpu-16 --labels lifetime=12h0m0s --metadata-from-file startup-script=/tmp/gce-startup-script121523923 --project cockroach-ephemeral --boot-disk-size=10GB]
Output: WARNING: Some requests generated warnings:
 - The resource 'projects/ubuntu-os-cloud/global/images/ubuntu-2004-focal-v20210325' is deprecated. A suggested replacement is 'projects/ubuntu-os-cloud/global/images/ubuntu-2004-focal-v20210610'.
ERROR: (gcloud.compute.instances.create) Could not fetch resource:
 - The zone 'projects/cockroach-ephemeral/zones/us-central1-a' does not have enough resources available to fulfill the request.  '(resource type:compute)'.: exit status 1)
tbg commented 3 years ago

The same creation works when I set GCE_PROJECT=andrei-jepsen. What is different about that project? It's the same zone.

erikgrinaker commented 3 years ago

This appears to be specific to --local-ssd interface=NVME in us-central1-a. However, it works in a different project, so that's weird.

tbg commented 3 years ago

Erik confirmed this - we can't create a single such node in us-central-1a, but only in the cockroach-ephemeral project. It does work under the cockroach-jepsen project. It also works when we drop the --local-ssd interface=NVME flag.

So the zone does have the resources, it's just refusing to give more NVME to the cockroach-ephemeral project, it seems.

We also checked - the zone is empty. Nothing in there hogging resources.

tbg commented 3 years ago

We're working around this in roachtest now (using us-central1-f) but I think dev-inf should reach out to google support to learn what's up with this. We don't want this to happen on other zones.

tbg commented 3 years ago

I am also now unable to create some clusters in us-east1-b under the andrei-jepsen project:

 - The resource 'projects/ubuntu-os-cloud/global/images/ubuntu-2004-focal-v20210603' is deprecated. A suggested replacement is 'projects/ubuntu-os-cloud/global/images/ubuntu-2004-focal-v20210610'.
ERROR: (gcloud.compute.instances.create) Could not fetch resource:
 - The zone 'projects/andrei-jepsen/zones/us-east1-b' does not have enough resources available to fulfill the request.  Try a different zone, or try again later.: exit status 1)
Cleaning up OK
Error: in provider: gce: Command: gcloud [compute instances create --subnet default --maintenance-policy MIGRATE --scopes default,storage-rw --image ubuntu-2004-focal-v20210603 --image-project ubuntu-os-cloud --boot-disk-type pd-ssd --local-ssd interface=NVME --machine-type n1-highcpu-16 --labels lifetime=12h0m0s --metadata-from-file startup-script=/tmp/gce-startup-script792645345 --project andrei-jepsen --boot-disk-size=10GB]: exit status 1

This fails reliably. The andrei-jepsen project has no compute running right now. Some more experiments:

Create single machine: fails

GCE_PROJECT=andrei-jepsen roachprod create tobias-1624357819-04-n5cpu16 -n 5 --clouds=gce --local-ssd=true --gce-machine-type=n1-highcpu-16 --lifetime=12h0m0s --local-ssd-no-ext4-barrier

Create single machine without local ssd: fails

GCE_PROJECT=andrei-jepsen roachprod create tobias-x -n 1 --clouds=gce --local-ssd=false --gce-machine-type=n1-highcpu-16

So I think we're ... just completely unable to use this zone too, at least with that machine type? But then! This works under the cockroach-ephemeral project so clearly the zone has the resources but is just unwilling to give it to andrei-jepsen. It's the same behavior as above, just with an even tighter restriction, and hitting a different project.

This is more evidence that we need to reach out to Google Support ASAP to figure out what's going on here. We can't run our nightly tests in such an environment.

tbg commented 3 years ago

The quotas look fine here too (as they should, there's nothing running in this proj) https://console.cloud.google.com/compute/instances?project=andrei-jepsen

tbg commented 3 years ago

Btw there's an outage going on regarding ssd creation, https://status.cloud.google.com/incidents/YtfZu9rttTf5zDYGe57n, but then why does it work in the other project? Maybe I got lucky?

rickystewart commented 3 years ago

I didn't want to file a support ticket last week because there was the ongoing SSD creation incident and I didn't have enough evidence that we were encountering a unique issue that had nothing to do with the incident. This week, the nightlies look fine and I can't find a recent instance of this error in the build logs.

I think the most appropriate course of action at this point is to close this for now. If this pops up again we can bump this.