cloud-gov / product

Program-level artifacts, workflow and issues for cloud.gov
Creative Commons Zero v1.0 Universal
31 stars 15 forks source link

plan-cf-production fails with "Unable to render instance groups" #2378

Closed jameshochadel closed 1 year ago

jameshochadel commented 1 year ago

Previous discussion, possibly of same issue: https://gsa-tts.slack.com/archives/C0ENP71UG/p1531176154000006

The thread contains the same error, and was prompted by hosts being online for longer than they should & not picking up new stemcells.

jameshochadel commented 1 year ago

Solved. The logs for the failing plan were:

  instance_groups:
  - name: api
    jobs:
    - name: route_registrar
      properties:
        route_registrar:
          routes:
+         - name: "<redacted>"
+           registration_interval: "<redacted>"
+           server_cert_domain_san: "<redacted>"
+           tls_port: "<redacted>"
+           uris:
+           - "<redacted>"
  - name: router
    jobs:
    - name: gorouter
      properties:
        router:
          backends:
+           ca: "<redacted>"
          ca_certs:
+         - "<redacted>"

- manifest_version: v24.0.0

+ manifest_version: v24.6.0
Task 20653525

Task 20653525 | 16:56:27 | Preparing deployment: Preparing deployment (00:00:37)
Task 20653525 | 16:57:04 | Preparing deployment: Rendering templates (00:02:42)
                         L Error: Unable to render instance groups for deployment. Errors are:
  - Unable to render jobs for instance group 'router'. Errors are:
    - Unable to render templates for job 'gorouter'. Errors are:
      - Failed to find variable '<redacted for posting on GitHub>' from config server: HTTP Code '404', Error: 'The request could not be completed because the credential does not exist or you do not have sufficient authorization.'
Task 20653525 | 16:59:46 | Error: Unable to render instance groups for deployment. Errors are:
  - Unable to render jobs for instance group 'router'. Errors are:
    - Unable to render templates for job 'gorouter'. Errors are:
      - Failed to find variable '<redacted for posting on GitHub>' from config server: HTTP Code '404', Error: 'The request could not be completed because the credential does not exist or you do not have sufficient authorization.'

Note the changes to the manifest. The cause of the problem was a change to the upstream routing-release manifests. We have an opsfile which removed a certificate from the CA list, but when the upstream manifest changed the list from being a comma-separated string to a yaml array, our opsfile no longer applied. The fix was to change the opsfile to stop trying to remove the value. This allowed BOSH to create the certificate in credhub based on the options key in the upstream manifest.

We only saw this error in production because the certificate had been removed from production credhub, but was never removed from dev or staging back when our opsfile was originally implemented. So when the yaml was reformatted in 24.6.0 and BOSH tried interpolating the variable once again, it found the value in dev and staging, but not in production.