EngineerBetter / concourse-up

Deprecated - used Control Tower instead
https://github.com/EngineerBetter/control-tower
Apache License 2.0
203 stars 28 forks source link

Timeout when trying to provision new workers on GCP #106

Open ashea-code opened 5 years ago

ashea-code commented 5 years ago

I know this repo is becoming out of date, bit I'm trying to re-run the concourse-up command on an existing deployment in GCP. I have exported WORKERS=2 to add an extra worker to the pool.

However, I get this as an error:

Task 1175

Task 1175 | 23:02:09 | Preparing deployment: Preparing deployment (00:02:16)
                    L Error: worker/ed0abfe3-0867-49e1-9092-d242f833bd74: Timed out sending 'get_state' to instance: 'worker/ed0abfe3-0867-49e1-9092-d242f833bd74', agent-id: 'be09322d-de63-4f1a-9d55-54925a64270a' after 45 seconds
Task 1175 | 23:04:26 | Error: worker/ed0abfe3-0867-49e1-9092-d242f833bd74: Timed out sending 'get_state' to instance: 'worker/ed0abfe3-0867-49e1-9092-d242f833bd74', agent-id: 'be09322d-de63-4f1a-9d55-54925a64270a' after 45 seconds

Task 1175 Started  Mon Jun  3 23:02:09 UTC 2019
Task 1175 Finished Mon Jun  3 23:04:26 UTC 2019
Task 1175 Duration 00:02:17
Task 1175 error

Updating deployment:
  Expected task '1175' to succeed but state is 'error'

Exit code 1

Any idea on what is timing out here? GCP is known to be a bit slow on provisioning.

DanielJonesEB commented 5 years ago

Hmm, I'm not sure. We've seen the GCP CPI time out regularly and intermittently (we need to bump the version of the CPI to fix it) but normally with different errors:

Task 10 | 12:28:32 | Error: CPI error 'Bosh::Clouds::CloudError' with message 'Creating vm: Failed to find Google Image 'stemcell-e5d99deb-c5b4-4f5f-53ad-87ef7e71d15a': Get https://www.googleapis.com/compute/v1/projects/ps-amcginlay/global/images/stemcell-e5d99deb-c5b4-4f5f-53ad-87ef7e71d15a?alt=json: oauth2: cannot fetch token: Post https://accounts.google.com/o/oauth2/token: dial tcp 108.177.111.84:443: i/o timeout' in 'create_vm' CPI method (CPI request ID: 'cpi-653711')

Is the issue intermittent for you?

ashea-code commented 5 years ago

This issue isn't intermittent, and trying to make a fresh install also presents me with:

Task 10

Task 10 | 21:46:23 | Preparing deployment: Preparing deployment (00:00:01)
Task 10 | 21:46:24 | Preparing deployment: Rendering templates (00:00:02)
Task 10 | 21:46:26 | Preparing package compilation: Finding packages to compile (00:00:00)
Task 10 | 21:46:26 | Creating missing vms: web/ecade835-5d6e-416e-830c-fb8d648e99ef (0)
Task 10 | 21:46:26 | Creating missing vms: worker/1115b42a-0163-487f-8938-0e23fd19f6c8 (0)
Task 10 | 21:46:26 | Creating missing vms: worker/83502149-1c15-41eb-8838-0af800d0a49f (2)
Task 10 | 21:46:26 | Creating missing vms: worker/dc3fdd8f-26d6-4cbf-88a5-678e21a5dddf (1)
Task 10 | 21:47:28 | Creating missing vms: web/ecade835-5d6e-416e-830c-fb8d648e99ef (0) (00:01:02)
Task 10 | 21:47:49 | Creating missing vms: worker/1115b42a-0163-487f-8938-0e23fd19f6c8 (0) (00:01:23)
Task 10 | 21:47:50 | Creating missing vms: worker/dc3fdd8f-26d6-4cbf-88a5-678e21a5dddf (1) (00:01:24)
Task 10 | 21:47:50 | Creating missing vms: worker/83502149-1c15-41eb-8838-0af800d0a49f (2) (00:01:24)
Task 10 | 21:47:51 | Updating instance web: web/ecade835-5d6e-416e-830c-fb8d648e99ef (0) (canary)
Task 10 | 21:47:51 | Updating instance worker: worker/1115b42a-0163-487f-8938-0e23fd19f6c8 (0) (canary) (00:01:07)
Task 10 | 21:48:58 | Updating instance worker: worker/83502149-1c15-41eb-8838-0af800d0a49f (2)
Task 10 | 21:48:58 | Updating instance worker: worker/dc3fdd8f-26d6-4cbf-88a5-678e21a5dddf (1) (00:01:02)
Task 10 | 21:50:01 | Updating instance worker: worker/83502149-1c15-41eb-8838-0af800d0a49f (2) (00:01:03)
Task 10 | 21:59:49 | Updating instance web: web/ecade835-5d6e-416e-830c-fb8d648e99ef (0) (canary) (00:11:58)
                   L Error: 'web/ecade835-5d6e-416e-830c-fb8d648e99ef (0)' is not running after update. Review logs for failed jobs: atc, grafana
Task 10 | 21:59:49 | Error: 'web/ecade835-5d6e-416e-830c-fb8d648e99ef (0)' is not running after update. Review logs for failed jobs: atc, grafana

Task 10 Started  Tue Jun  4 21:46:23 UTC 2019
Task 10 Finished Tue Jun  4 21:59:49 UTC 2019
Task 10 Duration 00:13:26
Task 10 error

Updating deployment:
  Expected task '10' to succeed but state is 'error'

Exit code 1

Has something changed on GCP?