EngineerBetter / concourse-up

Deprecated - used Control Tower instead
https://github.com/EngineerBetter/control-tower
Apache License 2.0
203 stars 28 forks source link

renew-cert fails #94

Closed phynias closed 5 years ago

phynias commented 5 years ago

renew-cert is suddenly no longer working concourse: version: v4.0.0

+ cd concourse-up-release
+ chmod +x concourse-up-linux-amd64
+ ./concourse-up-linux-amd64 deploy wizr

WARNING: adding record ci.wizr.com to Route53 hosted zone wizr.com ID: ZDNOZ43D75NB8

aws_eip.nat: Refreshing state... (ID: eipalloc-057b63ab844ef0a17)
aws_eip.atc: Refreshing state... (ID: eipalloc-0f64eb3e16a756bd7)
aws_vpc.default: Refreshing state... (ID: vpc-0c6847dba480011a9)
aws_s3_bucket.blobstore: Refreshing state... (ID: concourse-up-wizr-us-east-2-blobstore)
aws_key_pair.default: Refreshing state... (ID: concourse-up-wizr20180817015833012200000001)
aws_iam_user.blobstore: Refreshing state... (ID: concourse-up-wizr-us-east-2-blobstore)
aws_eip.director: Refreshing state... (ID: eipalloc-0bf53a0578941a26a)
aws_iam_user.bosh: Refreshing state... (ID: concourse-up-wizr-us-east-2-bosh)
aws_iam_user_policy.bosh: Refreshing state... (ID: concourse-up-wizr-us-east-2-bosh:concourse-up-wizr-us-east-2-bosh)
aws_iam_access_key.bosh: Refreshing state... (ID: AKIAJEAGHHOFSDDMQ6NA)
aws_iam_access_key.blobstore: Refreshing state... (ID: AKIAJ7PZRVL7D3OXHVRQ)
aws_route53_record.concourse: Refreshing state... (ID: ZDNOZ43D75NB8_ci_A)
aws_subnet.private: Refreshing state... (ID: subnet-0f330eb34d099cbf5)
aws_subnet.rds_a: Refreshing state... (ID: subnet-07abd765f0a0415b9)
aws_subnet.rds_b: Refreshing state... (ID: subnet-0fc084561ba5e5d19)
aws_security_group.rds: Refreshing state... (ID: sg-0e35eb7f2b150e518)
aws_internet_gateway.default: Refreshing state... (ID: igw-0b379a40dbfba10eb)
aws_security_group.vms: Refreshing state... (ID: sg-00ec7cb724c8143b8)
aws_subnet.public: Refreshing state... (ID: subnet-0ca4fbd59935a5c6b)
aws_route_table.rds: Refreshing state... (ID: rtb-04c171e5a89c6314e)
aws_route.internet_access: Refreshing state... (ID: r-rtb-07fa8825f067808db1080289494)
aws_nat_gateway.default: Refreshing state... (ID: nat-005abba6360c79e63)
aws_route_table_association.rds_b: Refreshing state... (ID: rtbassoc-0ff21f37e060b97c0)
aws_db_subnet_group.default: Refreshing state... (ID: concourse-up-wizr)
aws_route_table_association.rds_a: Refreshing state... (ID: rtbassoc-02f036c671fdd88b3)
aws_security_group.director: Refreshing state... (ID: sg-0a49842457a887881)
aws_route_table.private: Refreshing state... (ID: rtb-06f334b91af4d5298)
aws_route_table_association.private: Refreshing state... (ID: rtbassoc-0382d0556bb5785e4)
aws_security_group.atc: Refreshing state... (ID: sg-0725811022c0cf1a0)
aws_db_instance.default: Refreshing state... (ID: terraform-20180817015841858400000002)
aws_iam_user_policy.blobstore: Refreshing state... (ID: concourse-up-wizr-us-east-2-blobstore:concourse-up-wizr-us-east-2-blobstore)
aws_db_instance.default: Modifying... (ID: terraform-20180817015841858400000002)
  engine_version: "9.6.11" => "9.6.6"

Error: Error applying plan:

1 error(s) occurred:

* aws_db_instance.default: 1 error(s) occurred:

* aws_db_instance.default: Error modifying DB Instance terraform-20180817015841858400000002: InvalidParameterCombination: Cannot upgrade postgres from 9.6.11 to 9.6.6
    status code: 400, request id: 9a17b2f0-3dec-487d-bbd7-37f3d74a8d10

Terraform does not automatically rollback in the face of errors.
Instead, your Terraform state file has been partially updated with
any resources that successfully completed. Please address the error
above and apply again to incrementally change your infrastructure.

exit status 1
crsimmons commented 5 years ago

It looks like terraform is trying to downgrade the RDS Postgres version from 9.6.11 to 9.6.6. I'm not sure why you would be running 9.6.11 because we pin to 9.6.6 in our terraform (and have done since March 2018).

What version of Concourse-up are you using?

crsimmons commented 5 years ago

I've done a bit more research and have found that one of our deployments is using 9.6.11. I haven't figured out why or how yet though.

crsimmons commented 5 years ago

It seems that RDS will do a version upgrade in the maintenance window unless you tell it not to. In terraform we aren't specifying the option but it defaults to true. I think we've deployed against our 9.6.11 deployment before so I'm not sure why your terraform is breaking in this way. Does it work if you manually run concourse-up deploy <your deployment>?

sureshgoli81 commented 5 years ago

Hi, Today we have started updating the concourse-up and found same issue as described above. BTW, AWS is stopped supporting 9.6.6 and now the minor version for postgresql is 9.6.11.

I would suggest in terraform engine_version keep as 9.6. auto_minor_version_upgrade is by default true not not set to false explicitly.

As of now our concourse-up deployment become unstable due to upgrade failure

DanielJonesEB commented 5 years ago

@sureshgoli81 Thanks for reporting - we're looking into this currently.

Can you please explain exactly what you mean by

As of now our concourse-up deployment become unstable due to upgrade failure

Is your Concourse still running, and can builds execute? Does the deployment fail?

sureshgoli81 commented 5 years ago

Hi, No concourse is not running, because our build is failed with error "failed to obtain lock on concourse deployment". When we tried again with concourse-up deploy <project name> we are getting terraform error related to engine_version.

DanielJonesEB commented 5 years ago

Thanks @sureshgoli81.

This should be a recoverable situation. failed to obtain lock on concourse deployment is a BOSH error, whereby there was one deployment ongoing on the concourse deployment (which the renew-certs job will have started), and presumably another deployment was attempted.

If you have BOSH CLI access, I'd recommend doing a bosh vms -d concourse, see if any are unhappy, and consider a bosh restart on the web VM. Then pause the renew-certs pipeline until we get the issue fixed.

sureshgoli81 commented 5 years ago

Hi, Yes. That was our initial Idea however we couldn't able to login to bosh director due to X509 Error related to bosh ca-cert and after looking at the bosh creds file in our bucket. We noticed it was still holding the old bosh director ca cert values. Now, we have destroyed the setup and recreating the concourse.

sureshgoli81 commented 5 years ago

Just noticed, while provisioning postgesql in AWS RDS. It allows to create DB with 9.6.6. Since auto_minor_version_upgrade is true by default. The RDS gets updated to 9.6.11 in second run of concourse-up deploy and then in third run of concourse-up deploy we started getting error with engine version error while checking the provisioned resource by terraform. So there are two options either by setting auto_minor_version_upgrade to false or only specify 9.6 in engine_version

crsimmons commented 5 years ago

My understanding is that the auto_minor_version_upgrade option being true means that RDS will update your engine in the instance's maintenance window when Amazon detects there is a meaningful minor update. I would expect the first deploy to set 9.6.6 then every subsequent deploy until your first maintenance window will work. After RDS does maintenance and updates your instance then you will see the terraform error.

We're currently considering bumping the engine version to 9.6.11 and setting auto_minor_version_upgrade to false so this doesn't happen again. I've put a separate story in our backlog for increasing the hardcoded version when new versions become available.

I wouldn't expect this terraform issue to negatively impact BOSH. In Concourse-Up the terraform runs first and therefore errors out before anything happens with BOSH. I'm not sure why you would have different bosh director ca cert values. If they are being updated then terraform failing would not stop eval "$(concourse-up info --env <deployment>)" from working. I agree with @DanielJonesEB that your problem sounds unrelated to this issue.

DanielJonesEB commented 5 years ago

Hi, Yes. That was our initial Idea however we couldn't able to login to bosh director due to X509 Error related to bosh ca-cert and after looking at the bosh creds file in our bucket. We noticed it was still holding the old bosh director ca cert values.

I'm not sure why that would have happened, we'll keep an eye out for similar issues. For future reference, another way around this would have been to forcibly terminate the web VM through the AWS console and wait for BOSH's resurrector to recreate it.

sureshgoli81 commented 5 years ago

After re-pavement of concourse-up. Now we are seeing another strange error related to credhub. We are seeing below error while logging credhub. credhub api Setting the target url: https://<domain name>:8844/ Error connecting to the targeted API: "Get https://<domain name>:8844/info: x509: certificate is not valid for any names, but wanted to match <<domain Name>>". Please validate your target and retry your request. Below steps i have followed while during deployment and post deployment of concourse-up:

concourse-up deploy apci --region eu-central-1 --domain <domain name> --workers 3 --web-size xlarge --db-size medium

Post deployment Steps:
Login to concourse for setting-up pipeline: fly --target apci login --insecure --concourse-url https://<domain name>> --username admin --password <<PWD>>
Here login is success and i am able to list workers as well

Login to credhub :
eval "$(concourse-up info --region eu-central-1  --iaas AWS --env apci)"
credhub api
This is failed with above said error.
I am unable to set the parameters in the credhub as well
phynias commented 5 years ago

@crsimmons when i manually try to do a manual deploy i get the following:


  engine_version: "9.6.11" => "9.6.6"

Error: Error applying plan:

1 error(s) occurred:

* aws_db_instance.default: 1 error(s) occurred:

* aws_db_instance.default: Error modifying DB Instance terraform-20180817015841858400000002: InvalidParameterCombination: Cannot upgrade postgres from 9.6.11 to 9.6.6
    status code: 400, request id: 4a35af32-14ef-4ec4-8481-9a663d83ad73

Terraform does not automatically rollback in the face of errors.
Instead, your Terraform state file has been partially updated with
any resources that successfully completed. Please address the error
above and apply again to incrementally change your infrastructure.
DanielJonesEB commented 5 years ago

Hi @phynias, we've got a fix for this incoming which is just awaiting PM acceptance. If you're feeling brave, you can build from commit 98030bad595ac8704b7f2e6c120d6e2550af6ae1 and use that to deploy. It's passed all system tests.

evadinckel commented 5 years ago

Hi everyone,

Thank you for reporting the issue to us. This is to let you know that a patch release has just been published for this fix (@phynias) https://github.com/EngineerBetter/concourse-up/releases/tag/0.20.1

Closing this thread now as the conversation on the other issue reported on this page has been carried out separately: [https://github.com/EngineerBetter/concourse-up/issues/97]

Best regards, Eva