Closed wmhutchison closed 1 year ago
Issue is now resolved, copy/pasting the last email regarding this below.
Sent: March 13, 2023 9:36 AM Subject: RE: DNS response for AAAA
Good morning all,
After reviewing the DNS configuration on the two upgraded DCs, it was noted that a registry setting was missed in the initial configuration. This registry setting (redacteed ) was added over the weekend and DNS restarted. This additional configuration step has been added to our 2012 – 2019 build documentation and will be carried across all remaining DC upgrades.
Describe the issue Internal monitoring for Openshift clusters SILVER and GOLD noted an increased failure rate for DNS resolution starting last Saturday morning (February 25th 2023). While not considered to be impactful at the time, that decision was changed when a user noted issues (61d198-prod on SILVER on February 27th) regarding failed services, which troubleshooting discovered was a specific image issue due to the baseline OS for that image and how it was handling the DNS failures.
Further investigation by Platform Operations found that the primary DNS server for all of the Openshift managed clusters in Private Cloud had been upgraded this past weekend by the OCIO DNS team to a newer Operating System. This newer server changed how missing IPV6 DNS queries are handled. Prior to the change, missing IPV6 DNS queries would return an NXDOMAIN response, which is an error but one that would allow operations to continue. The upgraded DNS server now returns a SERVFAIL response, which is handled differently and doesn't allow the resolver to continue and try the IPV4 DNS query which for functioning DNS records, would normally work.
Due to the nature of the technology, Openshift is a native IPV6 system, and thus we cannot turn off the IPV6 portion of the DNS queries. We will need to work with OCIO DNS team to get to the bottom of the change for missing IPV6 queries and have them restore the original functionality.
For now, we have applied a work-around since only one of the two DNS resolvers has been upgraded, so the un-upgraded DNS resolver is now the primary/first DNS resolver used.
Blocked By Right now the ball is firmly in the court of OCIO DNS management team who have agreed to halt server upgrades involving DNS resolver services until this issue is addressed and are actively investigating why the server upgrade causes the bad DNS resolution behavior.
Additional context
How does this benefit the users of our platform? Ensuring that consumed shared services such as DNS work in an expected fashion, thus ensuring maximum stability for our managed Openshift clusters.
Definition of done