Concourse UI certificate expired "pinned version is not available"

RichardBradley commented 4 years ago

The HTTPS certificate on my control-tower Concourse UI is expired. It looks like there is a job which ought to auto-renew this called "renew-https-cert", but it is hanging with the following output:

[cog] preparing build
[tick] checking pipeline is not paused
[tick] checking job is not paused
[spins forever] discovering any new versions of control-tower-release
[tick] discovering any new versions of every-day
[spins forever] waiting for a suitable set of input versions
* control-tower-release - pinned version {"tag":"0.8.1"} is not available
[spins forever] checking max-in-flight is not reached

Can anyone help me understand what's wrong here? What does the "pinned version is not available" message mean? How can I fix?

After I fix that, will the permanently spinning task "discovering any new versions of control-tower-release" unblock, or do I have two problems?

Thanks!

Rich

crsimmons commented 4 years ago

The pinned version X not available error happens when the resource a pipeline job is pinned to a version that it can't get for some reason.

On your update pipeline is the control-tower-release resource checking successfully? If you click on it in the UI you should see a list of available versions. We don't pin versions in this pipeline but someone could have pinned it manually. If it is pinned the resource box will be purple in the UI. You should be able to force Concourse to detect new versions with fly -t <target> check-resource -r control-tower-self-update/control-tower-release.

RichardBradley commented 4 years ago

Thanks for your reply!

Nothing appears to be pinned in the UI and the "control-tower-release" resource looks good to me. I will poke about in fly and see if I can unstick anything.

Here's my "control-tower-release" resource in the UI:

And here's my "self-update" job (build?) in the UI:

crsimmons commented 4 years ago

Self update is successfully discovering all its inputs (see the check mark next to that line in your screenshot). I would guess that the self-update job is paused.

RichardBradley commented 4 years ago

I would guess that the self-update job is paused.

I think it was, thanks!

I ran fly -t xxx unpause-job -j control-tower-self-update/self-update and things have definitely changed.

I think this is related to https://github.com/concourse/concourse/issues/1915 -- "Paused jobs should indicate that they are, well, paused". After reading that bug I now know how to tell if a job is paused or not. It turns out that a "pause" symbol on a job details page (that page is hidden behind a not obviously clickable job title) means that it is live (you can click to pause), and a "play" symbol means that it is paused (you can click to play). I /think/. Who would have guessed that?

I suppose the self update jobs start paused on a new deployment, so that everyone gets a nice surprise in 3 months (6 months?) when their server stops working as the cert expires? ;-)

Thanks for your help with this! `

DanielJonesEB commented 4 years ago

It turns out that a "pause" symbol on a job details page (that page is hidden behind a not obviously clickable job title) means that it is live (you can click to pause), and a "play" symbol means that it is paused (you can click to play). I /think/. Who would have guessed that?

Ha - I remember being flummoxed by this particular UI design choice when making a website about 20 years ago that played background MIDI files...

I suppose the self update jobs start paused on a new deployment, so that everyone gets a nice surprise in 3 months (6 months?) when their server stops working as the cert expires? ;-)

We did this to avoid users having downtime that wasn't under their control. Do you think it would be more valuable to have it enabled by default?

crsimmons commented 4 years ago

The self-update pipeline shouldn't be paused - just the self-update job. The renew-cert job is supposed to trigger every day.

On the UI you can tell if something is paused because it will be light blue.

RichardBradley commented 4 years ago

On the UI you can tell if something is paused because it will be light blue.

As you can see from the above screenshots, the paused job appeared grey in the UI on my instance. I don't know why it was not blue.

RichardBradley commented 4 years ago

I ran the self-update job and it killed the whole instance :-)

I got the following output and the hostname no longer resolves in DNS. I will poke about and see if I can rebuild it and report back here if it is interesting.

waiting for docker to come up...
Pulling engineerbetter/pcf-ops@sha256:7cab6efb45f85bb59eafe31b6107b73e78c668eda857c20cd5326dfca90fcc36...
sha256:7cab6efb45f85bb59eafe31b6107b73e78c668eda857c20cd5326dfca90fcc36: Pulling from engineerbetter/pcf-ops
4d65b6a51407: Pulling fs layer
007bb40a3d29: Pulling fs layer
....
d528521d0fe2: Pull complete
Digest: sha256:7cab6efb45f85bb59eafe31b6107b73e78c668eda857c20cd5326dfca90fcc36
Status: Downloaded newer image for engineerbetter/pcf-ops@sha256:7cab6efb45f85bb59eafe31b6107b73e78c668eda857c20cd5326dfca90fcc36

Successfully pulled engineerbetter/pcf-ops@sha256:7cab6efb45f85bb59eafe31b6107b73e78c668eda857c20cd5326dfca90fcc36.

+ cd control-tower-release
+ chmod +x control-tower-linux-amd64
+ ./control-tower-linux-amd64 deploy ci.xxx

USING PREVIOUS DEPLOYMENT CONFIG

WARNING: adding record ci.xxx to DNS zone ci.xxx with name Z14SSNJQU991QA

aws_iam_user.blobstore: Refreshing state... (ID: control-tower-ci.xxx-eu-west-1-blobstore)
aws_s3_bucket.blobstore: Refreshing state... (ID: control-tower-ci.xxx-eu-west-1-blobstore)
aws_vpc.default: Refreshing state... (ID: vpc-030402d397718204d)
aws_key_pair.default: Refreshing state... (ID: control-tower-ci.xxx20191104001031012700000001)
aws_iam_user.bosh: Refreshing state... (ID: control-tower-ci.xxx-eu-west-1-bosh)
data.aws_availability_zones.available: Refreshing state...
aws_iam_access_key.blobstore: Refreshing state... (ID: AKIAVD4WQFCT5R2NJCGV)
aws_iam_user_policy.bosh: Refreshing state... (ID: control-tower-ci.xxx-eu-west...tower-ci.xxx-eu-west-1-bosh)
aws_iam_access_key.bosh: Refreshing state... (ID: AKIAVD4WQFCT4SL2ZFXD)
aws_iam_user_policy.blobstore: Refreshing state... (ID: control-tower-ci.xxx-eu-west...-ci.xxx-eu-west-1-blobstore)
aws_subnet.public: Refreshing state... (ID: subnet-0f18ca3d8a54bd9e2)
aws_route_table.rds: Refreshing state... (ID: rtb-0eeab1acf99d9a74d)
aws_security_group.rds: Refreshing state... (ID: sg-078bb5da6f4af3c0e)
aws_subnet.rds_a: Refreshing state... (ID: subnet-05ee1739a5c667359)
aws_security_group.vms: Refreshing state... (ID: sg-09ca977c374da12ef)
aws_internet_gateway.default: Refreshing state... (ID: igw-0b187a2dcaddfed38)
aws_subnet.private: Refreshing state... (ID: subnet-0ed40eb9df7900ecd)
aws_subnet.rds_b: Refreshing state... (ID: subnet-0ef7d833f9a51c5f5)
aws_route_table_association.rds_a: Refreshing state... (ID: rtbassoc-091b31a9d74b3c998)
aws_eip.director: Refreshing state... (ID: eipalloc-0aa6085c8905747ca)
aws_eip.atc: Refreshing state... (ID: eipalloc-03400ea3546e80921)
aws_route.internet_access: Refreshing state... (ID: r-rtb-0e7198a8281ed49f71080289494)
aws_eip.nat: Refreshing state... (ID: eipalloc-014f4204a13dbd0fc)
aws_db_subnet_group.default: Refreshing state... (ID: control-tower-ci.xxx)
aws_route_table_association.rds_b: Refreshing state... (ID: rtbassoc-0d61c1f742b3b798f)
aws_nat_gateway.default: Refreshing state... (ID: nat-091321c149c88f2f1)
aws_db_instance.default: Refreshing state... (ID: terraform-20191104001038363100000002)
aws_route53_record.concourse: Refreshing state... (ID: Z14SSNJQU991QA_ci.xxx_A)
aws_security_group.director: Refreshing state... (ID: sg-0a8f426dd911b42f1)
aws_route_table.private: Refreshing state... (ID: rtb-0ca454bef44939b85)
aws_route_table_association.private: Refreshing state... (ID: rtbassoc-0e688e067a960001f)
aws_security_group.atc: Refreshing state... (ID: sg-05389c3dd5a020c21)
aws_route53_record.concourse: Destroying... (ID: Z14SSNJQU991QA_ci.xxx_A)
aws_eip.atc: Modifying... (ID: eipalloc-03400ea3546e80921)
  tags.Name: "" => "control-tower-ci.xxx-atc"
  tags.name: "control-tower-ci.xxx-atc" => ""
aws_eip.director: Modifying... (ID: eipalloc-0aa6085c8905747ca)
  tags.Name: "" => "control-tower-ci.xxx-director"
  tags.name: "control-tower-ci.xxx-director" => ""
aws_eip.nat: Modifying... (ID: eipalloc-014f4204a13dbd0fc)
  tags.Name: "" => "control-tower-ci.xxx-nat"
  tags.name: "control-tower-ci.xxx-nat" => ""
aws_eip.nat: Modifications complete after 0s (ID: eipalloc-014f4204a13dbd0fc)
aws_eip.director: Modifications complete after 0s (ID: eipalloc-0aa6085c8905747ca)
aws_eip.atc: Modifications complete after 0s (ID: eipalloc-03400ea3546e80921)
aws_nat_gateway.default: Modifying... (ID: nat-091321c149c88f2f1)
  tags.Name: "" => "control-tower-ci.xxx"
  tags.name: "control-tower-ci.xxx" => ""
aws_nat_gateway.default: Modifications complete after 0s (ID: nat-091321c149c88f2f1)
aws_route53_record.concourse: Still destroying... (ID: Z14SSNJQU991QA_ci.xxx_A, 10s elapsed)
aws_route53_record.concourse: Still destroying... (ID: Z14SSNJQU991QA_ci.xxx_A, 20s elapsed)
aws_route53_record.concourse: Still destroying... (ID: Z14SSNJQU991QA_ci.xxx_A, 30s elapsed)
aws_route53_record.concourse: Still destroying... (ID: Z14SSNJQU991QA_ci.xxx_A, 40s elapsed)
aws_route53_record.concourse: Destruction complete after 48s

Error: Error applying plan:

1 error(s) occurred:

* aws_route53_record.concourse: aws_route53_record.concourse: diffs didn't match during apply. This is a bug with Terraform and should be reported as a GitHub Issue.

Please include the following information in your report:

    Terraform Version: 0.11.11
    Resource ID: aws_route53_record.concourse
    Mismatch reason: attribute mismatch: name
    Diff One (usually from plan): *terraform.InstanceDiff{mu:sync.Mutex{state:0, sema:0x0}, Attributes:map[string]*terraform.ResourceAttrDiff{"zone_id":*terraform.ResourceAttrDiff{Old:"Z14SSNJQU991QA", New:"Z14SSNJQU991QA", NewComputed:false, NewRemoved:false, NewExtra:interface {}(nil), RequiresNew:false, Sensitive:false, Type:0x0}, "records.#":*terraform.ResourceAttrDiff{Old:"1", New:"1", NewComputed:false, NewRemoved:false, NewExtra:interface {}(nil), RequiresNew:false, Sensitive:false, Type:0x0}, "records.4246710445":*terraform.ResourceAttrDiff{Old:"63.35.140.121", New:"63.35.140.121", NewComputed:false, NewRemoved:false, NewExtra:interface {}(nil), RequiresNew:false, Sensitive:false, Type:0x0}, "name":*terraform.ResourceAttrDiff{Old:"ci.xxx", New:"", NewComputed:false, NewRemoved:false, NewExtra:"", RequiresNew:true, Sensitive:false, Type:0x0}, "fqdn":*terraform.ResourceAttrDiff{Old:"ci.xxx", New:"", NewComputed:true, NewRemoved:false, NewExtra:interface {}(nil), RequiresNew:false, Sensitive:false, Type:0x0}, "type":*terraform.ResourceAttrDiff{Old:"A", New:"A", NewComputed:false, NewRemoved:false, NewExtra:interface {}(nil), RequiresNew:false, Sensitive:false, Type:0x0}, "ttl":*terraform.ResourceAttrDiff{Old:"60", New:"60", NewComputed:false, NewRemoved:false, NewExtra:interface {}(nil), RequiresNew:false, Sensitive:false, Type:0x0}, "allow_overwrite":*terraform.ResourceAttrDiff{Old:"true", New:"true", NewComputed:false, NewRemoved:false, NewExtra:interface {}(nil), RequiresNew:false, Sensitive:false, Type:0x0}}, Destroy:false, DestroyDeposed:false, DestroyTainted:false, Meta:map[string]interface {}(nil)}
    Diff Two (usually from apply): *terraform.InstanceDiff{mu:sync.Mutex{state:0, sema:0x0}, Attributes:map[string]*terraform.ResourceAttrDiff{"records.#":*terraform.ResourceAttrDiff{Old:"", New:"1", NewComputed:false, NewRemoved:false, NewExtra:interface {}(nil), RequiresNew:false, Sensitive:false, Type:0x0}, "records.4246710445":*terraform.ResourceAttrDiff{Old:"", New:"63.35.140.121", NewComputed:false, NewRemoved:false, NewExtra:interface {}(nil), RequiresNew:false, Sensitive:false, Type:0x0}, "type":*terraform.ResourceAttrDiff{Old:"", New:"A", NewComputed:false, NewRemoved:false, NewExtra:interface {}(nil), RequiresNew:false, Sensitive:false, Type:0x0}, "allow_overwrite":*terraform.ResourceAttrDiff{Old:"", New:"true", NewComputed:false, NewRemoved:false, NewExtra:interface {}(nil), RequiresNew:false, Sensitive:false, Type:0x0}, "fqdn":*terraform.ResourceAttrDiff{Old:"", New:"", NewComputed:true, NewRemoved:false, NewExtra:interface {}(nil), RequiresNew:false, Sensitive:false, Type:0x0}, "zone_id":*terraform.ResourceAttrDiff{Old:"", New:"Z14SSNJQU991QA", NewComputed:false, NewRemoved:false, NewExtra:interface {}(nil), RequiresNew:true, Sensitive:false, Type:0x0}, "ttl":*terraform.ResourceAttrDiff{Old:"", New:"60", NewComputed:false, NewRemoved:false, NewExtra:interface {}(nil), RequiresNew:false, Sensitive:false, Type:0x0}}, Destroy:false, DestroyDeposed:false, DestroyTainted:false, Meta:map[string]interface {}(nil)}

Also include as much context as you can about your config, state, and the steps you performed to trigger this error.

Terraform does not automatically rollback in the face of errors.
Instead, your Terraform state file has been partially updated with
any resources that successfully completed. Please address the error
above and apply again to incrementally change your infrastructure.

exit status 1

I was planning to rebuild this instance in a different account soon anyway. Maybe that will be easier than fixing this one now. I will remember to turn on the auto-update on the new instance.

RichardBradley commented 4 years ago

Fixed now. I reached the server by IP address (adding the hostname to my hosts file to trick HTTPS) and was able to re-run the self-update job, which worked the second time.

Then GitHub auth failed, but re-applying the same team permissions seemed to fix that.

Thanks for your help with this

EngineerBetter / control-tower

Concourse UI certificate expired "pinned version is not available" #56