Closed jameshochadel closed 1 year ago
Solved. The logs for the failing plan were:
instance_groups:
- name: api
jobs:
- name: route_registrar
properties:
route_registrar:
routes:
+ - name: "<redacted>"
+ registration_interval: "<redacted>"
+ server_cert_domain_san: "<redacted>"
+ tls_port: "<redacted>"
+ uris:
+ - "<redacted>"
- name: router
jobs:
- name: gorouter
properties:
router:
backends:
+ ca: "<redacted>"
ca_certs:
+ - "<redacted>"
- manifest_version: v24.0.0
+ manifest_version: v24.6.0
Task 20653525
Task 20653525 | 16:56:27 | Preparing deployment: Preparing deployment (00:00:37)
Task 20653525 | 16:57:04 | Preparing deployment: Rendering templates (00:02:42)
L Error: Unable to render instance groups for deployment. Errors are:
- Unable to render jobs for instance group 'router'. Errors are:
- Unable to render templates for job 'gorouter'. Errors are:
- Failed to find variable '<redacted for posting on GitHub>' from config server: HTTP Code '404', Error: 'The request could not be completed because the credential does not exist or you do not have sufficient authorization.'
Task 20653525 | 16:59:46 | Error: Unable to render instance groups for deployment. Errors are:
- Unable to render jobs for instance group 'router'. Errors are:
- Unable to render templates for job 'gorouter'. Errors are:
- Failed to find variable '<redacted for posting on GitHub>' from config server: HTTP Code '404', Error: 'The request could not be completed because the credential does not exist or you do not have sufficient authorization.'
Note the changes to the manifest. The cause of the problem was a change to the upstream routing-release manifests. We have an opsfile which removed a certificate from the CA list, but when the upstream manifest changed the list from being a comma-separated string to a yaml array, our opsfile no longer applied. The fix was to change the opsfile to stop trying to remove the value. This allowed BOSH to create the certificate in credhub based on the options
key in the upstream manifest.
We only saw this error in production because the certificate had been removed from production credhub, but was never removed from dev or staging back when our opsfile was originally implemented. So when the yaml was reformatted in 24.6.0 and BOSH tried interpolating the variable once again, it found the value in dev and staging, but not in production.
Previous discussion, possibly of same issue: https://gsa-tts.slack.com/archives/C0ENP71UG/p1531176154000006
The thread contains the same error, and was prompted by hosts being online for longer than they should & not picking up new stemcells.