Make NetworkPolicy FQDN rule work with applications that cache DNS resolutions with hard-coded TTL

tnqn commented 3 months ago

Describe what you are trying to solve

I was working with @scoobed to debug an issue of NetworkPolicy FQDN rule in a cluster that the Pod failed to connect to the FQDN intermittently. After realizing the application was based on Java, I found that in many cases JVM enabled a DNS cache which uses a configured TTL as below, instead of respecting the TTL value in the DNS response.

networkaddress.cache.ttl

Specified in java.security to indicate the caching policy for successful name lookups from the name service.. The value is specified as integer to indicate the number of seconds to cache the successful lookup. A value of -1 indicates "cache forever". The default behavior is to cache forever when a security manager is installed, and to cache for an implementation specific period of time, when a security manager is not installed.

How the problem typically happened:

Pod made a DNS request of a FQDN
Antrea inspected the DNS response and associated the FQDN with the IPs in the response
Pod connected one of the IPs successfully because Antrea was aware of the IP.
Antrea refreshed the FQDN resolution, found the previous IPs were no longer present in the response, so it removed the IPs when it reached TTL set in the previous response.
Pod tried to connect the FQDN another time, but it skipped querying the FQDN's IP due to its own cache (with a fixed TTL), it failed due to the IP was no longer allowed by datapath.

@scoobed also confirmed that the problem was gone when using nodelocal dns, which should be due to the special handling in the buildpack that it disabled JVM DNS cache when it detects the DNS server is a link-local address: https://github.com/paketo-buildpacks/libjvm/blob/79182aa17fa3e49424f511dd0070dd66bdc1a3ec/helper/link_local_dns.go#L34-L64

As this may affect many Java based applications and not all clusters enable NodeLocal DNS, I have been thinking how to better support this scenario without requiring all application developers to disable their DNS cache or to respect TTL in DNS response (which is even harder than the former). One solution I come up with is to provide a configuration like minTTL, which determines the minimal TTL the DNS resolutions will be cached. If a DNS response's TTL is less than minTTL, the actual TTL in datapath will be minTTL. Note that the TTL cache is not per Pod, so the minTTL will be a global configuration which applies to all Pods (I don't think of any actual defect caused by it except for a few more memory consumption). Even different Pods can have different hard-coded DNS cache TTL, the minTTL can just be the maximum value of them. And typically it could just be set to the default value of JVM DNS TTL or bigger value.

Note that this still require application DNS cache not to cache forever.

Describe how your solution impacts user flows

The cluster admin should configure minTTL to be equal or larger than the maximum TTL values of application DNS caches.

Alternative solutions that you considered

Require users to disable application-level DNS cache.

Test plan

e2e: validate applications with DNS cache can stably access the target FQDN while FQDN resolution frequently changes.

tnqn commented 3 months ago

@jianjuns @antoninbas @Dyanngg please let me know how you think about the proposal.

jianjuns commented 3 months ago

The proposal sounds good to me.

aerosouund commented 8 hours ago

Hello everyone, i would be interested in working on this as a part of LFX mentorship in the upcoming term. I will investigate the problem further using the available resources and come back if i have any questions

antoninbas commented 2 hours ago

We have submitted this issue as a project idea for the LFX mentorship program: https://github.com/cncf/mentoring/pull/1278.

See https://docs.linuxfoundation.org/lfx/mentorship for more information on the program.

Assuming our proposal is accepted, we will publish instructions for candidates here (as a new issue comment) with a list to a test task to be completed as part of the application. The test task helps us in several different ways: 1) it ensures that applicants have read this issue and got familiar with the goals, 2) it ensures that applicants showed some interest in the project and have the basic skills required to build Antrea and contribute to it, and 3) it helps us (the mentors) with candidate selection as we can look at the overall quality of submissions. Note that we do not expect the test task to take more than 1 or 2 hours, as our goal is not to impose a big burden on all applicants.

We hope that there will be a lot of interest in this issue and the mentorship program. However, we ask that candidate mentees do not comment on this issue just to express their interest / desire to work on the issue. This can create a lot of noise. An upstream issue like this is primarily meant for technical discussion around the issue and the proposed solution. We want to keep the discussion thread relevant and easy to navigate for maintainers and contributors. Please post any questions about the LFX program and how to apply on the mentorship discussion forums.

antrea-io / antrea

Make NetworkPolicy FQDN rule work with applications that cache DNS resolutions with hard-coded TTL #6229