cloudfoundry / bosh-dns-release

BOSH DNS release
Apache License 2.0
18 stars 37 forks source link

bosh-dns 1.28.0 - dns cert errors #76

Closed ChrisMcGowan closed 3 years ago

ChrisMcGowan commented 3 years ago

This AM we rolled the 1.28.0 release into our dev environment and it broke the latest cf-deployment release. Rolling back to 1.27.0 solved the issue for now and we have pinned out bosh-deployment pipeline back a release to stay on 1.27.0

The issue was the bosh-dns client was not able to resolve any internal service names as the various bosh-dns clients running on the CF VMs could not talk to each other due to TLS cert errors. The only change was rolling to 1.28.0 - the DNS health/api certs in credhub have not changed for some time and still had valid date ranges.

Effected clients had this error in their bosh_dns.stdout.log - <internal ip> to redact internal VM IP scheme:

[HealthChecker] 2021-02-03T15:55:38.564150000Z WARN - network error connecting to <internal ip>: Performing GET request: Get "https://<internal ip>:8853/health": x509: certificate relies on legacy Common Name field, use SANs or temporarily enable Common Name matching with GODEBUG=x509ignoreCN=0

This appears to be related to the 1.28.0 release bumping Go to 1.15.

Checking the current bosh-deployment repo - there are no changes to the runtime-config for Bosh DNS short of the release bump and no credhub variable definition changes for bosh-dns certs.

I will test later creating some new bosh-dns certs with SANs using the same common name field to see if that works or not.

cf-gitbot commented 3 years ago

We have created an issue in Pivotal Tracker to manage this:

https://www.pivotaltracker.com/story/show/176786535

The labels on this github issue will be updated when the story is started.

peterellisjones commented 3 years ago

We are seeing the same issue after updating from BOSH DNS 1.27.0 => 1.28.0. As well as the warnings above we see

==> bosh_dns_health.stderr.log <==
2021/02/04 08:38:19 http: TLS handshake error from 192.168.2.20:34388: remote error: tls: bad certificate
2021/02/04 08:38:22 http: TLS handshake error from 192.168.2.201:30496: remote error: tls: bad certificate
2021/02/04 08:38:24 http: TLS handshake error from 192.168.2.211:41362: remote error: tls: bad certificate
2021/02/04 08:38:25 http: TLS handshake error from 192.168.2.16:39920: remote error: tls: bad certificate
2021/02/04 08:38:25 http: TLS handshake error from 192.168.2.202:6492: remote error: tls: bad certificate
2021/02/04 08:38:25 http: TLS handshake error from 192.168.2.212:14834: remote error: tls: bad certificate
peterellisjones commented 3 years ago

Update: I've confirmed our DNS cert config does not include SANs so I will try updating the certs so the SAN matches the common name and report back

Current (broken) cert config:

- type: replace
  path: /variables?/-
  value:
    name: dns_healthcheck_tls_ca
    options:
      common_name: dns-healthcheck-tls-ca
      is_ca: true
      duration: 365
    type: certificate

- type: replace
  path: /variables?/-
  value:
    name: dns_healthcheck_server_tls
    options:
      ca: dns_healthcheck_tls_ca
      common_name: health.bosh-dns
      duration: 365
      extended_key_usage:
      - server_auth
    type: certificate

- type: replace
  path: /variables?/-
  value:
    name: dns_healthcheck_client_tls
    options:
      ca: dns_healthcheck_tls_ca
      common_name: health.bosh-dns
      duration: 365
      extended_key_usage:
      - client_auth
    type: certificate
ChrisMcGowan commented 3 years ago

Update:

Based on testing, adding a SAN to the certs does solve the issue and restore functionally without adding a GODEBUG line like: https://github.com/cloudfoundry/loggregator-agent-release/blob/ae131388bdaff57493967088e3fc9438837a185c/jobs/loggr-forwarder-agent-windows/monit#L31

variables:
- name: "/dns_healthcheck_tls_ca"
  options:
    common_name: dns-healthcheck-tls-ca
    is_ca: true
  type: certificate
- name: "/dns_healthcheck_server_tls"
  options:
    ca: "/dns_healthcheck_tls_ca"
    common_name: health.bosh-dns
    alternative_names:  #NEW LINE
    - health.bosh-dns  #NEW LINE
    extended_key_usage:
    - server_auth
  type: certificate
- name: "/dns_healthcheck_client_tls"
  options:
    ca: "/dns_healthcheck_tls_ca"
    common_name: health.bosh-dns
    alternative_names:  #NEW LINE
    - health.bosh-dns  #NEW LINE
    extended_key_usage:
    - client_auth
  type: certificate
- name: "/dns_api_tls_ca"
  options:
    common_name: dns-api-tls-ca
    is_ca: true
  type: certificate
- name: "/dns_api_server_tls"
  options:
    ca: "/dns_api_tls_ca"
    common_name: api.bosh-dns
    alternative_names:  #NEW LINE
    - api.bosh-dns  #NEW LINE
    extended_key_usage:
    - server_auth
  type: certificate
- name: "/dns_api_client_tls"
  options:
    ca: "/dns_api_tls_ca"
    common_name: api.bosh-dns
    alternative_names: #NEW LINE
    - api.bosh-dns   #NEW LINE
    extended_key_usage:
    - client_auth
  type: certificate

For my testing my/dns_api_tls_ca & /dns_healthcheck_tls_ca certificates are still valid, so I deleted the api and health tls certs then updated the variables: section and then re-deployed. This caused credhub to create new TLS certs but from the same CAs so as it rolled through the clients still trusted the old and new cert. After the certs where updated I moved bosh-dns back to 1.28.0 and the errors where gone and clients could resolve bosh-dns based records again.

bosh-admin-bot commented 3 years ago

This issue was marked as Stale because it has been open for 21 days without any activity. If no activity takes place in the coming 7 days it will automatically be close. To prevent this from happening remove the Stale label or comment below.

bosh-admin-bot commented 3 years ago

This issue was closed because it has been labeled Stale for 7 days without subsequent activity. Feel free to re-open this issue at any time by commenting below.