Openshift DNS resolution issues detected on all clusters due to DNS server migration to newer OS

Describe the issue Internal monitoring for Openshift clusters SILVER and GOLD noted an increased failure rate for DNS resolution starting last Saturday morning (February 25th 2023). While not considered to be impactful at the time, that decision was changed when a user noted issues (61d198-prod on SILVER on February 27th) regarding failed services, which troubleshooting discovered was a specific image issue due to the baseline OS for that image and how it was handling the DNS failures.

Further investigation by Platform Operations found that the primary DNS server for all of the Openshift managed clusters in Private Cloud had been upgraded this past weekend by the OCIO DNS team to a newer Operating System. This newer server changed how missing IPV6 DNS queries are handled. Prior to the change, missing IPV6 DNS queries would return an NXDOMAIN response, which is an error but one that would allow operations to continue. The upgraded DNS server now returns a SERVFAIL response, which is handled differently and doesn't allow the resolver to continue and try the IPV4 DNS query which for functioning DNS records, would normally work.

Due to the nature of the technology, Openshift is a native IPV6 system, and thus we cannot turn off the IPV6 portion of the DNS queries. We will need to work with OCIO DNS team to get to the bottom of the change for missing IPV6 queries and have them restore the original functionality.

For now, we have applied a work-around since only one of the two DNS resolvers has been upgraded, so the un-upgraded DNS resolver is now the primary/first DNS resolver used.

Blocked By Right now the ball is firmly in the court of OCIO DNS management team who have agreed to halt server upgrades involving DNS resolver services until this issue is addressed and are actively investigating why the server upgrade causes the bad DNS resolution behavior.

Additional context

Rocket Chat thread from user reporting issues caused by DNS resolution: https://chat.developer.gov.bc.ca/channel/devops-sos?msg=DALTXJMuhtHFscFjL
INC0076535: P2 Incident ticket created for the errors noted by the GOLD and SILVER Openshift clusters. Remained open until Emergency RFCs were successfully applied to the SILVER and GOLD clusters. Incident ticket closed on February 28th and a follow-up Problem ticket opened.
CHG0046143, CHG0046145. Emergency RFCs for remediating SILVER and GOLD clusters. Both scheduled and completed on February 28th.
PRB0040825. Problem ticket opened after the incident ticket was closed. Will be used to track ongoing investigation to the issue at hand (upgraded DNS servers for OCIO DNS team handle missing IPV6 DNS queries differently, needs to be re-configured to act as before if possible).
CHG TBD . RFC to be created for remediating GOLDDR this Thursday.

How does this benefit the users of our platform? Ensuring that consumed shared services such as DNS work in an expected fashion, thus ensuring maximum stability for our managed Openshift clusters.

Definition of done

[x] Create INC ticket and escalate from P3 to P2.
[x] Identify issue (failed IPV6 DNS now handled differently) and devise work-around (change DNS resolver lists so that un-upgraded DNS servers are listed first)
[x] Create emergency RFCs and execute for SILVER and GOLD cluster.
[x] Create PRB ticket for follow-up investigation work as well as acting as the driver for any more non-emergency remediation RFCs
[x] Create and execute RFC for remediating GOLDDR, aiming for a change window this Thursday morning.
[x] Execute RFC for remediating EMERALD the same week as the OCP upgrade RFC for EMERALD. OCP upgrade will drive the DNS change refresh.
[x] Follow up with OCIO DNS to determine next-steps to resolving the DNS issue introduced by upgrading the DNS servers.

BCDevOps / developer-experience

Openshift DNS resolution issues detected on all clusters due to DNS server migration to newer OS #3565