Kong / lua-resty-dns-client

Lua DNS client, load balancer, and utility library
http://kong.github.io/lua-resty-dns-client/topics/README.md.html
Apache License 2.0
152 stars 52 forks source link

Kong upstream health flip-flopping due to TTL=0 handling #131

Open onematchfox opened 3 years ago

onematchfox commented 3 years ago

Sorry, this is a long one!

I am in the process of upgrading from Kong 1.5.0 to Kong 2.4.1.. In load testing we noticed a regression that results in upstream health status flip-flopping between HEALTHY and UNHEALTHY. In further testing, we noticed that this behaviour is not only occurring when under load but was also observed when Kong was idling (local container left running overnight). This was tracked down to the ttl=0 handling in this library - which seems to have changed (regressed?) between versions 4.1.2 and 6.0.0.

This issue was previously reported in https://github.com/Kong/kong/issues/3641 and https://github.com/Kong/lua-resty-dns-client/issues/51. Also related is https://github.com/Kong/kong/issues/5477.

One important(?) observation I have made in relation to the above issues, is that obtaining multiple ttl=0 records in succession can be reproduced relatively easily using the DNS for any public AWS load balancer. Running the script:

#!/bin/bash

timeout=`date -d "+60 seconds" +%s`
while [ "`date +%s`" -lt "$timeout" ]; do
  dig myalb-9999999999.us-east-1.elb.amazonaws.com
  sleep 0.1
done

produces

dig.txt

; <<>> DiG 9.16.1-Ubuntu <<>> myalb-9999999999.us-east-1.elb.amazonaws.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 10029
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;myalb-9999999999.us-east-1.elb.amazonaws.com. IN A

;; ANSWER SECTION:
myalb-9999999999.us-east-1.elb.amazonaws.com. 1 IN A 0.0.0.0
myalb-9999999999.us-east-1.elb.amazonaws.com. 1 IN A 0.0.0.0

;; Query time: 4 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: tor maj 27 13:55:55 CEST 2021
;; MSG SIZE  rcvd: 113

; <<>> DiG 9.16.1-Ubuntu <<>> myalb-9999999999.us-east-1.elb.amazonaws.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 211
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;myalb-9999999999.us-east-1.elb.amazonaws.com. IN A

;; ANSWER SECTION:
myalb-9999999999.us-east-1.elb.amazonaws.com. 1 IN A 0.0.0.0
myalb-9999999999.us-east-1.elb.amazonaws.com. 1 IN A 0.0.0.0

;; Query time: 0 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: tor maj 27 13:55:55 CEST 2021
;; MSG SIZE  rcvd: 113

; <<>> DiG 9.16.1-Ubuntu <<>> myalb-9999999999.us-east-1.elb.amazonaws.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 4801
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;myalb-9999999999.us-east-1.elb.amazonaws.com. IN A

;; ANSWER SECTION:
myalb-9999999999.us-east-1.elb.amazonaws.com. 0 IN A 0.0.0.0
myalb-9999999999.us-east-1.elb.amazonaws.com. 0 IN A 0.0.0.0

;; Query time: 0 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: tor maj 27 13:55:55 CEST 2021
;; MSG SIZE  rcvd: 113

.................
.................
.................

; <<>> DiG 9.16.1-Ubuntu <<>> myalb-9999999999.us-east-1.elb.amazonaws.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 15075
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;myalb-9999999999.us-east-1.elb.amazonaws.com. IN A

;; ANSWER SECTION:
myalb-9999999999.us-east-1.elb.amazonaws.com. 0 IN A 0.0.0.0
myalb-9999999999.us-east-1.elb.amazonaws.com. 0 IN A 0.0.0.0

;; Query time: 0 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: tor maj 27 13:55:56 CEST 2021
;; MSG SIZE  rcvd: 113

; <<>> DiG 9.16.1-Ubuntu <<>> myalb-9999999999.us-east-1.elb.amazonaws.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 51043
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;myalb-9999999999.us-east-1.elb.amazonaws.com. IN A

;; ANSWER SECTION:
myalb-9999999999.us-east-1.elb.amazonaws.com. 60 IN A 52.22.203.100
myalb-9999999999.us-east-1.elb.amazonaws.com. 60 IN A 0.0.0.0.0

;; Query time: 60 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: tor maj 27 13:55:56 CEST 2021
;; MSG SIZE  rcvd: 113

; <<>> DiG 9.16.1-Ubuntu <<>> myalb-9999999999.us-east-1.elb.amazonaws.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 64411
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 65494
;; QUESTION SECTION:
;myalb-9999999999.us-east-1.elb.amazonaws.com. IN A

;; ANSWER SECTION:
myalb-9999999999.us-east-1.elb.amazonaws.com. 59 IN A 0.0.0.0
myalb-9999999999.us-east-1.elb.amazonaws.com. 59 IN A 0.0.0.0

;; Query time: 0 msec
;; SERVER: 127.0.0.53#53(127.0.0.53)
;; WHEN: tor maj 27 13:55:56 CEST 2021
;; MSG SIZE  rcvd: 113

So, it would seem like this is not exactly an edge case and that Amazon DNS resolution if following more of a floor (instead of round) algorithm when reporting TTL. Thus, I also believe that the fix in #56 (checking for two responses in a row with ttl=0) is brittle and likely not sufficient in the current version of the library.

I am speculative about whether the change in timing of the refresh interval introduced in #112 (reduced from 1 - which would allow time for the 0 to roll over to n - to 0.05 and then later increased to 0.5 in #116) is the cause of this regression. Given the above, it would seem plausible that 2 queries occurred within the same second resulting in two consecutive ttl=0 records. However, attempting to increase the interval to 1 second still did not resolve this issue for me so I wonder if something else is at play. A number of fairly material changes seem to have been made to the way this library operates between these versions so perhaps one of the maintainers has an idea of why I would see multiple lookups occurring within the space of 1 second despite the change above?

@Tieske (Hi! Been a while! Sorry to drag you into this mess again). In the issue above you mention that this should be reported with Amazon. I'd be happy to try doing this but not being any form of expert on DNS I feel a little bit like this might be a battle of David vs Goliath. If you have any hard reference that I could use as ammunition to make the case that Amazon is in the wrong, then I would be grateful for your input.

To that end, I have tried reading up on DNS and was quite surprised to find how contentious the question of how to interpret ttl=0 is. However, it did also lead me to question the implementation of this handling in this library. It would seem like in general this library is in agreement with the RTFC1034#section-3.6 which states:

While short TTLs can be used to minimize caching, and a zero TTL prohibits caching

However, the implementation of this (the "'abuse[d]' virtual SRV record") seems to be problematic and can even be seen to conflict with rfc1123#section-6.1.2.1:

All DNS name servers and resolvers MUST properly handle RRs with a zero TTL: return the RR to the client but do not cache it

 DISCUSSION:
        Zero TTL values are interpreted to mean that the RR can
        only be used for the transaction in progress, and
        should not be cached; they are useful for extremely
        volatile data.

I acknowledge that this library is perhaps not strictly a resolver. However, in some ways I think it could be viewed as one in that it provides an optimised caching layer over DNS lookup. The reason I say that it conflicts with the above statement is that when an existing non-SRV record is returned with a TTL of 0, this library does not return the RR to the client. Instead, it creates a new record type which in effect disregards the response that was received and, because of the change in type, it results in the results of the existing, valid record being evicted and the upstream being marked as UNHEALTHY in Kong. This resonated with me having read an opinion expressed in an old blog post, Never use DNS TTL of zero:

worst was Microsoft's DNS server, which apparently thought zero meant don't give this out as this answer is expired

Thus, I made another attempt to resolve this issue. In this approach, I maintained the existing functionality for a record that is explicitly set with a ttl=0 (https://github.com/Kong/kong/issues/5477) (i.e. caching the ttl=0 for TTL_0_RETRY seconds whilst return a SRV record to ensure lookup on every request). However, in that case, it is expected that the first time the record is resolved it will have a ttl=0. The change comes in how an existing record which drops to ttl=0 is treated. The changes allows the code to proceed as normal with the record being subsequently rescheduling for resolution in 0.5 seconds time. Testing this change shows that it works for me/has resolved my issue. However, this change does mean that if a record were to change from having a TTL (that reduces over time) to not having a TTL (permanently ttl=0) then that this would introduce additional CPU load on the Kong server as that record would not receive the benefit of the TTL_0_RETRY caching. Or, I guess it could even introduce issues if that record is changing within the ~0.5 seconds between refresh. In my mind changing a record from having a TTL to not having a TTL would seem like more of an edge case than a record dropping to 0 only to go back up again, but I could be completely wrong here.

So, I was hoping to get some input from the maintainers here. I would be happy to submit my suggestion as a PR if this would be considered. But as is hopefully clear, I am clearly not a DNS expert and so I am open to discussion about options here (including trying to take this up with Amazon - despite my preconceptions that this is likely to be a fruitless discussion).

Finally, I am attaching a sanitised log from a test run with refresh interval set to 1 where container was subsequently left running idly over night.

kong.load-test-ttl=0-refresh=1.log

DNS entries have been adjusted to reflect whether these are AWS (Route53 vs Public LB) records or registered with another solution.

Issue can be observed: