Closed ChrisMcGowan closed 3 years ago
We have created an issue in Pivotal Tracker to manage this:
https://www.pivotaltracker.com/story/show/176786535
The labels on this github issue will be updated when the story is started.
We are seeing the same issue after updating from BOSH DNS 1.27.0 => 1.28.0. As well as the warnings above we see
==> bosh_dns_health.stderr.log <==
2021/02/04 08:38:19 http: TLS handshake error from 192.168.2.20:34388: remote error: tls: bad certificate
2021/02/04 08:38:22 http: TLS handshake error from 192.168.2.201:30496: remote error: tls: bad certificate
2021/02/04 08:38:24 http: TLS handshake error from 192.168.2.211:41362: remote error: tls: bad certificate
2021/02/04 08:38:25 http: TLS handshake error from 192.168.2.16:39920: remote error: tls: bad certificate
2021/02/04 08:38:25 http: TLS handshake error from 192.168.2.202:6492: remote error: tls: bad certificate
2021/02/04 08:38:25 http: TLS handshake error from 192.168.2.212:14834: remote error: tls: bad certificate
Update: I've confirmed our DNS cert config does not include SANs so I will try updating the certs so the SAN matches the common name and report back
Current (broken) cert config:
- type: replace
path: /variables?/-
value:
name: dns_healthcheck_tls_ca
options:
common_name: dns-healthcheck-tls-ca
is_ca: true
duration: 365
type: certificate
- type: replace
path: /variables?/-
value:
name: dns_healthcheck_server_tls
options:
ca: dns_healthcheck_tls_ca
common_name: health.bosh-dns
duration: 365
extended_key_usage:
- server_auth
type: certificate
- type: replace
path: /variables?/-
value:
name: dns_healthcheck_client_tls
options:
ca: dns_healthcheck_tls_ca
common_name: health.bosh-dns
duration: 365
extended_key_usage:
- client_auth
type: certificate
Update:
Based on testing, adding a SAN to the certs does solve the issue and restore functionally without adding a GODEBUG
line like: https://github.com/cloudfoundry/loggregator-agent-release/blob/ae131388bdaff57493967088e3fc9438837a185c/jobs/loggr-forwarder-agent-windows/monit#L31
variables:
- name: "/dns_healthcheck_tls_ca"
options:
common_name: dns-healthcheck-tls-ca
is_ca: true
type: certificate
- name: "/dns_healthcheck_server_tls"
options:
ca: "/dns_healthcheck_tls_ca"
common_name: health.bosh-dns
alternative_names: #NEW LINE
- health.bosh-dns #NEW LINE
extended_key_usage:
- server_auth
type: certificate
- name: "/dns_healthcheck_client_tls"
options:
ca: "/dns_healthcheck_tls_ca"
common_name: health.bosh-dns
alternative_names: #NEW LINE
- health.bosh-dns #NEW LINE
extended_key_usage:
- client_auth
type: certificate
- name: "/dns_api_tls_ca"
options:
common_name: dns-api-tls-ca
is_ca: true
type: certificate
- name: "/dns_api_server_tls"
options:
ca: "/dns_api_tls_ca"
common_name: api.bosh-dns
alternative_names: #NEW LINE
- api.bosh-dns #NEW LINE
extended_key_usage:
- server_auth
type: certificate
- name: "/dns_api_client_tls"
options:
ca: "/dns_api_tls_ca"
common_name: api.bosh-dns
alternative_names: #NEW LINE
- api.bosh-dns #NEW LINE
extended_key_usage:
- client_auth
type: certificate
For my testing my/dns_api_tls_ca
& /dns_healthcheck_tls_ca
certificates are still valid, so I deleted the api
and health
tls certs then updated the variables:
section and then re-deployed. This caused credhub to create new TLS certs but from the same CAs so as it rolled through the clients still trusted the old and new cert. After the certs where updated I moved bosh-dns
back to 1.28.0
and the errors where gone and clients could resolve bosh-dns based records again.
This issue was marked as Stale
because it has been open for 21 days without any activity. If no activity takes place in the coming 7 days it will automatically be close. To prevent this from happening remove the Stale
label or comment below.
This issue was closed because it has been labeled Stale
for 7 days without subsequent activity. Feel free to re-open this issue at any time by commenting below.
This AM we rolled the 1.28.0 release into our dev environment and it broke the latest cf-deployment release. Rolling back to 1.27.0 solved the issue for now and we have pinned out bosh-deployment pipeline back a release to stay on 1.27.0
The issue was the bosh-dns client was not able to resolve any internal service names as the various bosh-dns clients running on the CF VMs could not talk to each other due to TLS cert errors. The only change was rolling to 1.28.0 - the DNS health/api certs in credhub have not changed for some time and still had valid date ranges.
Effected clients had this error in their
bosh_dns.stdout.log
-<internal ip>
to redact internal VM IP scheme:This appears to be related to the 1.28.0 release bumping Go to 1.15.
Checking the current bosh-deployment repo - there are no changes to the runtime-config for Bosh DNS short of the release bump and no credhub variable definition changes for bosh-dns certs.
I will test later creating some new bosh-dns certs with SANs using the same common name field to see if that works or not.