Closed jwardle closed 3 years ago
Further to the above I have found that one of the nameservers in /etc/resolv.conf
within the containers does not expose a DNS server, the second nameserver does (it's the service fabric host node, with the DNSService running). When altering the resolv.conf to remove the faulty nameserver results in no requests being successful, not even intermittently. Setting the Kong resolver IP to the 'correct' nameserver IP via KONG_DNS_RESOLVER
Env var also has the same effect. I find this strange, as when I remove the faulty nameserver from resolve.conf Dig will return correct results, nslookup continues to function - the only thing that cannot resovle the URL is Kong! With the faulty nameserver left in the resolve.conf file then Dig will return SERVFAIL
(as it hits the first nameserver which does not actually expose the DNS service I assume).
/etc/resolv.conf
nameserver 10.14.2.101 <—- Working DNS server able to resolve upstream.hostname
nameserver 172.20.48.1 <--- Faulty IP without a DNS server exposed. Have removed/reordered resolv.conf however no luck
search searchdomain.local
Successful Dig results for the upstream domain
/ # dig upstream.hostname
; <<>> DiG 9.11.6-P1 <<>> upstream.hostname
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 50344
;; flags: qr aa rd ra ad; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 8192
;; QUESTION SECTION:
;upstream.hostname. IN A
;; ANSWER SECTION:
upstream.hostname. 1 IN A 10.14.2.101
upstream.hostname. 1 IN A 10.14.2.100
;; Query time: 3 msec
;; SERVER: 10.14.2.101#53(10.14.2.101)
;; WHEN: Sun Jun 23 16:41:09 UTC 2019
;; MSG SIZE rcvd: 154
I'm really struggling to understand why Dig/nslookup is different from what Kong is seeing. Is there a way to log the DNS resolve requests/responses, or at least the IP it is challenging?
log line in your original post:
[
"(short)upstream.hostname:(na) - cache-miss",
"upstream.hostname:1 - cache-miss/querying/dns server error: 2 server failure",
"upstream.hostname.searchdomain.local:1 - cache-miss/querying/41: removed/dns server error: 3 name error",
"upstream.hostname:33 - cache-miss/querying/dns server error: 2 server failure",
"upstream.hostname.searchdomain.local:33 - cache-miss/querying/dns server error: 2 server failure",
"upstream.hostname:5 - cache-miss/querying/41: removed/dns client error: 101 empty record received",
"upstream.hostname.searchdomain.local:5 - cache-miss/querying/dns server error: 2 server failure"
]
This is a mixed bag of errors.
2 server failure
clearly indicates that the dns server failed (server indicated this)3 name error
usually the name is not found (server indicated this)101 empty record
means the server responded, but with an empty answer41: removed
i'd need to lookup.Kong will retry if it hits a bad server or server failure. Kong will not retry on a 3 name error
since that is the dns server telling Kong the name doesn't exist, hence that is a valid answer afa Kong knows.
Since there are a lot of server errors, you should probably check the dns server logs.
this is weird:
2019/06/22 13:45:53 [debug] 1#0: [lua] client.lua:453: init(): [dns-client] noSynchronisation = true
Since you have:
"dns_no_sync":false,
in the yaml.
It seems the logs and the yaml do not match?
Is there a way to see which DNS server IP Kong is making the lookup request to?
Will Kong retry against the same server (I.e. let’s say first server in the resolver list) or will it fail and then retry on the next nameserver on the list? As per my second post, one of the nameserver ips is not responding (we’re looking int the “Docker on Windows networking issue” here) however the second nameserver responds perfectly every time we have tested via nslookup/dig/ping from within the Kong container. When setting the healthy DNS server as the KONG_DNS_RESOLVER however, no DNS requests from Kong appear to succeed. Deleting the faulty DNS server from resolv.conf and reloading Kong has the same effect. Strange, and trying to understand.
We’re trying to track down decent logging of DNS requests in a Service Fabric to enlighten us.
So if you go into your container, you should be able to find the file /usr/local/share/lua/5.1/resty/dns/client.lua
, edit that file and replace all occurences of --[[
by ---[[
.
Then do a kong reload
to effectuate the change. This should now provide a very verbose logging of all dns queries. Recreate the problem and show the logs.
Trying to figure out the 41: removed
in the logs
This code generates that message: https://github.com/Kong/lua-resty-dns-client/blob/master/src/resty/dns/client.lua#L633-L637
So it tells me that the record type is 41
and the name
field is an empty string. Checking that reveals that 41 is an "option" pseudo record used for EDNS (see https://en.wikipedia.org/wiki/List_of_DNS_record_types#Other_types_and_pseudo_resource_records)
So it seems to me you are using EDNS, which Kong does not support.
Hi Thijs, I attached verbose kong-lua-dbg-logs.txt log output. if you could please have a quick look if there is anything else apart from the record type 41. Thanks (I work with James)
@psrank @jwardle just got pinged on this, apologies for the long delay. Did you ever resolve the issue?
I had a look at the logs, but couldn't make any sense of it. Still seems to me that EDNS is expected?
closing due to no repsonse
Summary
Kong 1.2.0 intermittently fails to resolve the upstream service's hostname, and therefore fails to process the request resulting in either an
An unexpected error occurred
orname resolution failure
message being returned to the client. The resolution appears to happen for 5-10 seconds after which requests are serviced as expected for ~30 seconds, and then it fails again. This cycle is continuous and I do not understand what is driving it yet.The kong Docker image 1.2.0-alpine is being used, deployed within a Standalone Azure Service Fabric cluster of 3 nodes. The DnsService is running within the cluster, and when queried from the localhost or within the Kong container the upstream service's hostname is successfully resolved. i.e. calling
nslookup upstream.host
always successfully retrieves, and the same for acurl upstream.host:8100
. Theresolve.conf
of the Kong container has the correct nameservers inherited from Service Fabric - as expected given the nslookup etc. succeeds every time.I am using a
kong.yml
(below) declarative configuration in DB-less mode with a basic configuration.The error I am repeatedly seeing is (also in the logs below):
Steps To Reproduce
kong.yml
(in Service Fabric, however don't believe this is the cause currently as DNS settings appear to be working consistently whilst this issue is being experienced from a Kong perspective)Additional Details & Logs
1.2.0
)1.2.0-alpine
), running on Azure SF Windows 2019Note: The below configuration & logs has been sanitised for sensistive info.
Kong YML
Kong Configuration from Admin endpoint
Kong Logs
Note: When you see a 404 response in the below logs this is a successful response from the upstream backend server being returned. Disregard the fact it's actually a 404 error.