coredns / coredns

CoreDNS is a DNS server that chains plugins
https://coredns.io
Apache License 2.0
12.32k stars 2.13k forks source link

Forward plugin sometimes returns FORMERR #1899

Closed zloo closed 6 years ago

zloo commented 6 years ago

Hi there,

we're using coredns in our kubernetes clusters as the resolver. We have a rather basic setup:

    . {
      errors
      health
      prometheus 0.0.0.0:9153
      log . "{remote}:{port} - [{when}] {>id} {type} {class} {name} {proto} {size} {>do} {>bufsize} {rcode} {>rflags} {rsize} {duration}" {
        class error
      }
      cache 30
      kubernetes tt-k8s1.ko.seznam.cz 172.16.0.0/12 {
        tls /coredns/tls/coredns.pem /coredns/tls/coredns-key.pem /coredns/tls/ca.pem
        upstream 10.250.20.222 10.250.50.222
      }
      forward . 10.250.20.222 10.250.50.222
    }

with 10.250.20.222 and 10.250.50.222 being our internal DNS servers handling recursive DNS queries for the rest of the domains, including internal domains as well as all outgoing queries.

We're using coredns 1.1.2 at the moment.

Sometimes FORMERR errors keep showing up for a few minutes, and they disappear after less than 10 minutes. They show up like this in the log:

10.64.38.104:59837 - [27/Jun/2018:12:13:14 +0200] 56625 A IN www.mapy.cz. udp 29 false 512 FORMERR qr,rd 29 205.372µs
10.66.206.49:37658 - [27/Jun/2018:12:12:51 +0200] 31387 AAAA IN grafana.com. udp 29 false 512 FORMERR qr,rd 29 219.374µs

It seems that the requests are served from cache because of the µs latencies - if they would try to connect to the upstream DNS servers, it would take at least a millisecond.

I managed to do a raw dig for a failing domain when the issue was happening:

$ dig puzzle.ng.seznam.cz @10.65.142.169

; <<>> DiG 9.10.3-P4-Ubuntu <<>> puzzle.ng.seznam.cz @10.65.142.169
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: FORMERR, id: 46786
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; WARNING: EDNS query returned status FORMERR - retry with '+noedns'

;; QUESTION SECTION:
;puzzle.ng.seznam.cz.       IN  A

;; Query time: 1 msec
;; SERVER: 10.65.142.169#53(10.65.142.169)
;; WHEN: Wed Jun 27 12:12:56 CEST 2018
;; MSG SIZE  rcvd: 37

and I tried a subsequent query with +noedns:

$ dig +noedns puzzle.ng.seznam.cz @10.65.142.169

; <<>> DiG 9.10.3-P4-Ubuntu <<>> +noedns puzzle.ng.seznam.cz @10.65.142.169
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: FORMERR, id: 47898
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;puzzle.ng.seznam.cz.       IN  A

;; Query time: 1 msec
;; SERVER: 10.65.142.169#53(10.65.142.169)
;; WHEN: Wed Jun 27 12:13:05 CEST 2018
;; MSG SIZE  rcvd: 37

I'm 100% certain the upstream DNS resolvers were working correctly, because I tried issuing the same query and they returned the answer correctly.

I also don't see anything else in the log, just the FORMERR errors. I would expect to see some upstream connection errors if coredns would expect the upstream servers to be unavailable.

I've only seen this happen to a single coredns instance in the cluster at the same time.

It seems to me that the coredns instance thinks that the upstream DNS servers are unavailable, when in reality they are. I'd appreciate any help or advice :) Thank you

miekg commented 6 years ago

[ Quoting notifications@github.com in "[coredns/coredns] Forward plugin so..." ]

I've only seen this happen to a single coredns instance in the cluster at the same time.

It seems to me that the coredns instance thinks that the upstream DNS servers are unavailable, when in reality they are. I'd appreciate any help or advice :) Thank you

That shouldn't result in a FORMERR though; it is possible to capture a tcpdump?

/Miek

-- Miek Gieben

zloo commented 6 years ago

I'll try to, though this usually stops happening faster than I can react and dumping all of the DNS traffic all the time is impossible. I'll let you know once I have the dump.

miekg commented 6 years ago

[ Quoting notifications@github.com in "Re: [coredns/coredns] Forward plugi..." ]

I'll try to, though this usually stops happening faster than I can react and dumping all of the DNS traffic all the time is impossible. I'll let you know once I have the dump.

pondering if CoreDNS could automatically write a pcap in such cases

zloo commented 6 years ago

that would be an awesome feature!

I forgot to mention that this usually happens only once or twice per week, so it might take some time to catch the tcpdump.

also, thank you and the whole team for this awesome project :)

miekg commented 6 years ago

[ Quoting notifications@github.com in "Re: [coredns/coredns] Forward plugi..." ]

that would be an awesome feature!

I forgot to mention that this usually happens only once or twice per week, so it might take some time to catch the tcpdump.

per week! OK.

also, thank you and the whole team for this awesome project :)

Thanks!

miekg commented 6 years ago

[ Quoting notifications@github.com in "[coredns/coredns] Forward plugin so..." ]

$ dig puzzle.ng.seznam.cz @10.65.142.169

; <<>> DiG 9.10.3-P4-Ubuntu <<>> puzzle.ng.seznam.cz @10.65.142.169 ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: FORMERR, id: 46786 ;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 ;; WARNING: recursion requested but not available

;; WARNING: EDNS query returned status FORMERR - retry with '+noedns'

;; QUESTION SECTION: ;puzzle.ng.seznam.cz. IN A

;; Query time: 1 msec ;; SERVER: 10.65.142.169#53(10.65.142.169) ;; WHEN: Wed Jun 27 12:12:56 CEST 2018 ;; MSG SIZE rcvd: 37


and I tried a subsequent query with +noedns:

$ dig +noedns puzzle.ng.seznam.cz @10.65.142.169

; <<>> DiG 9.10.3-P4-Ubuntu <<>> +noedns puzzle.ng.seznam.cz @10.65.142.169 ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: FORMERR, id: 47898 ;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 ;; WARNING: recursion requested but not available

;; QUESTION SECTION: ;puzzle.ng.seznam.cz. IN A

;; Query time: 1 msec ;; SERVER: 10.65.142.169#53(10.65.142.169) ;; WHEN: Wed Jun 27 12:13:05 CEST 2018 ;; MSG SIZE rcvd: 37

What should this return in the normal case, btw? I'm getting NXDOMAIN, so I'm assuming it is not publically available.

zloo commented 6 years ago

yes, that's an internal domain, which is actually a cname to another domain. But the same happens with all of the other domains as well, some of them internal, some of them publicly available, some of them from the open internet / outside of our company.

I've combed through the logs and most (95%+) of the domains in logs are from these upstream DNS servers, but sometimes there's even an internal kubernetes domain. But, all of the internal kubernetes domains that had the FORMERR log line were non-existing domains.

It seems that this is not related to the forward plugin after all.

miekg commented 6 years ago

[ Quoting notifications@github.com in "Re: [coredns/coredns] Forward plugi..." ]

yes, that's an internal domain, which is actually a cname to another domain. But the same happens with all of the other domains as well, some of them internal, some of them publicly available, some of them from the open internet / outside of our company.

I've combed through the logs and most (95%+) of the domains in logs are from these upstream DNS servers, but sometimes there's even an internal kubernetes domain. But, all of the internal kubernetes domains that had the FORMERR log line were non-existing domains.

It seems that this is not related to the forward plugin after all.

Hmmm. Odd. The cache doesn't cache error responses: https://github.com/coredns/coredns/blob/master/plugin/cache/cache.go#L67

So that would imply the original response is correct from that perspective, but then is garbled. The EDNS error from dig; points in to something like DNS Cookies or some other EDNS0 rcode that we not completely handle correctly.

zloo commented 6 years ago

It might also be something weird with our upstream DNS servers. There is a new initiative - https://dnsflagday.net/ - that seeks to eliminate workarounds for broken dns servers, we might be hitting something like that with our upstream dns servers. I have no idea how to debug that though.

miekg commented 6 years ago

[ Quoting notifications@github.com in "Re: [coredns/coredns] Forward plugi..." ]

It might also be something weird with our upstream DNS servers. There is a new initiative - https://dnsflagday.net/ - that seeks to eliminate workarounds for broken dns servers, we might be hitting something like that with our upstream dns servers. I have no idea how to debug that though.

Yeah, I know. Might as well be coredns who's broken here, although I don't suspect that.

miekg commented 6 years ago

[ Quoting notifications@github.com in "Re: [coredns/coredns] Forward plugi..." ]

It might also be something weird with our upstream DNS servers. There is a new initiative - https://dnsflagday.net/ - that seeks to eliminate workarounds for broken dns servers, we might be hitting something like that with our upstream dns servers. I have no idea how to debug that though.

But coming back to what you see; coredns hands out an answer that is formerr for the client (at least in the edns0 part, if interpret that correctly). The EDNS0 record is not cached, but added when we reply. I'm implying to say this should never happen; but obviously does. Extremely hard to tell with a tcpdump.

zloo commented 6 years ago

you mean "Extremely hard to tell without a tcpdump", right?

I agree with everything you said though.

rdrozhdzh commented 6 years ago

@zloo, do you see the FORMERR for each and every query during several minutes, or you get the error for some queries and other queries work fine?

chrisohaver commented 6 years ago

Not related at all to the FORMERR issue you are seeing... But If I recall correctly, the tls directive in kubernetes is ignored unless you are running CoreDNS from outside the cluster. In this case, you are running CoreDNS in a pod in the cluster, so it is using the certificates set up in the pod. So, you should be able to remove that and shorten the config a bit.

zloo commented 6 years ago

@rdrozhdzh that's what i'm going to find out next time i catch it. I was only able to try the two digs above, and then the instance healed itself somehow.

@chrisohaver thanks :)

miekg commented 6 years ago

ha, wait, (was just scanning the code for something unrelated), we do this:

        // Check if the reply is correct; if not return FormErr.
        if !state.Match(ret) {
            formerr := state.ErrorMessage(dns.RcodeFormatError)
            w.WriteMsg(formerr)
            return 0, nil
        }

This means the reply isn't what we are expecting, meaning someone is doing something nasty.

zloo commented 6 years ago

That's interesting. I stumbled upon this part before. I couldn't find what truncated actually does though.

It would also be helpful to log upstreamErr - if the formerrs actually come from this code, we could use a bit more logging here :) If I understand the code correctly, then the code at lines 149-151 are never called.

miekg commented 6 years ago

No, there is a continue way up, so we can fall out of the for loop when the deadline is exceeded and execute this bit.

krzysztof-bronk commented 6 years ago

I've stumbled upon the exact same issue, except in one case, the error was persistent. I had 5 identical CoreDNS pods and 1 of them was consistently throwing FORMERR. I also thought it's related to forward/proxy. Nothing else in the system/chain seemed bugged. Pod recreation fixed the issue...

miekg commented 6 years ago

[ Quoting notifications@github.com in "Re: [coredns/coredns] Forward plugi..." ]

I've stumbled upon the exact same issue, except in one case, the error was persistent. I had 5 identical CoreDNS pods and 1 of them was consistently throwing FORMERR. I also thought it's related to forward/proxy. Nothing else in the system/chain seemed bugged. Pod restart fixed the issue...

OK so we should log here (problem is that this opens a dos, but ww can probably live with that for now). And maybe we should retry the query when we see this.

It's really odd that caching seems involved here, though. I try to setup a test to replicated it.

miekg commented 6 years ago

As said previous a tcpdump would help, you can potentially patch in #1902

miekg commented 6 years ago

I've merged #1902, if you just compile a new binary (from master) and enable debug, something should - hopefully - show up.

zloo commented 6 years ago

I just returned to this issue, and I see that you already released version 1.2.0 today. I assume that it contains #1902 as well?

chrisohaver commented 6 years ago

I assume that [1.2.0] contains #1902 as well?

yes

krzysztof-bronk commented 6 years ago

I've just had it happen on Production, but I did not have any dump prepared so all I got for you is similar to what zloo posted.

Again, a single pod out of 5 was having this issue. The domain resolution was pretty much randomly returning FORMERR. Note: it was working fine sometimes.

`; <<>> DiG 9.9.4-RedHat-9.9.4-51.amzn2.2 <<>> my.internal.domain @172.30.80.1 ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: FORMERR, id: 36471 ;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 ;; WARNING: recursion requested but not available

;; WARNING: EDNS query returned status FORMERR - retry with '+noedns'

;; QUESTION SECTION: ;my.internal.domain. IN A

;; Query time: 2 msec ;; SERVER: 172.30.80.1#53(172.30.80.1) ;; WHEN: Thu Jul 12 15:42:41 CEST 2018 ;; MSG SIZE rcvd: 42

=============================================================

; <<>> DiG 9.9.4-RedHat-9.9.4-51.amzn2.2 <<>> my.internal.domain @172.30.80.1 ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 22325 ;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4000 ;; QUESTION SECTION: ;my.internal.domain. IN A

;; ANSWER SECTION: my.internal.domain. 0 IN CNAME lb.service.consul. lb.service.consul. 0 IN A 10.147.0.152 lb.service.consul. 0 IN A 10.147.0.62 lb.service.consul. 0 IN A 10.147.0.103

;; Query time: 2 msec ;; SERVER: 172.30.80.1#53(172.30.80.1) ;; WHEN: Thu Jul 12 15:42:50 CEST 2018 ;; MSG SIZE rcvd: 159`

miekg commented 6 years ago

[ Quoting notifications@github.com in "Re: [coredns/coredns] Forward plugi..." ]

I've just had it happen on Production, but I did not have any dump prepared so all I got for you is similar to what zloo posted.

Again, a single pod out of 5 was having this issue. The domain resolution was pretty much randomly returning FORMERR. Note: it was working fine sometimes.

My gut feeling is that some upstream is acting up sometimes or something. Again hard to say without some more evidence. Can you run with debug on for a couple of days - should not really impact performance are antything.

/Miek

-- Miek Gieben

zloo commented 6 years ago

we're currently running debug on 1.2.0 on all our instances. when the issue pops up again, i'll supply the logs.

miekg commented 6 years ago

[ Quoting notifications@github.com in "Re: [coredns/coredns] Forward plugi..." ]

we're currently running debug on 1.2.0 on all our instances. when the issue pops up again, i'll supply the logs.

should have added which upstream replies here. Anyway lets hope something turns up soon, so we can get to the bottom of this.

krzysztof-bronk commented 6 years ago

I got inside the pod and run a bunch of nslookups against 127.0.0.1 and the upstreams in /etc/resolv.conf

127.0.0.1 (CoreDNS) was throwing FORMERR about 60% of the time. Sometimes nslookup would just straight away FORMERR, sometimes it would return the CNAME and even A's and then still end up with FORMERR. Sometimes there would be a warning thrown about ID mismatch.

Resolution against upstreams directly was 100% successful at the same time...

Not terribly useful but it might be telling that upstreams are fine, and also other pods are 100% fine at the same time.

miekg commented 6 years ago

[ Quoting notifications@github.com in "Re: [coredns/coredns] Forward plugi..." ]

I got inside the pod and run a bunch of nslookups against 127.0.0.1 and the upstreams in /etc/resolv.conf

127.0.0.1 (CoreDNS) was throwing FORMERR about 60% of the time. Sometimes nslookup would just straight away FORMERR, sometimes it would return the CNAME and even A's and then still end up with FORMERR. Sometimes there would be a warning thrown about ID mismatch.

Resolution against upstreams directly was 100% successful at the same time...

Can you tcpdump?

miekg commented 6 years ago

[ Quoting notifications@github.com in "Re: [coredns/coredns] Forward plugi..." ]

we're currently running debug on 1.2.0 on all our instances. when the issue pops up again, i'll supply the logs.

If I'm suspecting anything (hard to say without more evidence) it could possibly the connection cleanup - it seems we're getting out of sync. Again not sure if this is the case!

Can you set expire in config to an insanely high value? expire 3000h or something?

/Miek

-- Miek Gieben

miekg commented 6 years ago

I've got a PR that simplifies this code quite a bit.

On Fri, 13 Jul 2018, 15:42 Miek Gieben, miek@miek.nl wrote:

[ Quoting notifications@github.com in "Re: [coredns/coredns] Forward plugi..." ]

we're currently running debug on 1.2.0 on all our instances. when the issue pops up again, i'll supply the logs.

If I'm suspecting anything (hard to say without more evidence) it could possibly the connection cleanup - it seems we're getting out of sync. Again not sure if this is the case!

Can you set expire in config to an insanely high value? expire 3000h or something?

/Miek

-- Miek Gieben

zloo commented 6 years ago

Let me wait until monday, to see if the issue isn't magically fixed somehow, so we know that it manifests after the upgrade.

Regarding logging which upstream sent the reply - it really doesn't matter in our case, as they all use the same IP addresses.

miekg commented 6 years ago

[ Quoting notifications@github.com in "Re: [coredns/coredns] Forward plugi..." ]

Let me wait until monday, to see if the issue isn't magically fixed somehow, so we know that it manifests after the upgrade.

Regarding logging which upstream sent the reply - it really doesn't matter in our case, as they all use the same IP addresses.

See https://github.com/coredns/coredns/pull/1973

So non-conclusive local testing seems to show an improvement.

I wonder how I can properly unit tests this

zloo commented 6 years ago

Just following up - the issue has not popped up since upgrading to 1.2.0. I'll let you know as soon as it manifests.

miekg commented 6 years ago

Odd... Can't recall any major changes. From what version did you upgrade again?

On Mon, 23 Jul 2018, 09:25 Stefan Safar, notifications@github.com wrote:

Just following up - the issue has not popped up since upgrading to 1.2.0. I'll let you know as soon as it manifests.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/coredns/coredns/issues/1899#issuecomment-406979638, or mute the thread https://github.com/notifications/unsubscribe-auth/AAVkWy9qGoG6slR--NlnJgz9mxsNLiZ7ks5uJYh5gaJpZM4U5dnc .

zloo commented 6 years ago

we were using 1.1.2

I didn't see any major updates either, maybe the bug hasn't manifested yet. Let's give it some more time, it usually manifested once per week, so it's not that uncommon to not see it this long.

rdrozhdzh commented 6 years ago

I suspect the #1889 might fix the problem

miekg commented 6 years ago

If that is true, we are probably wrong in returning FORMERR for such replies.

rdrozhdzh commented 6 years ago

just found a link to https://kevinlocke.name/bits/2017/01/20/formerr-from-microsoft-dns-server-for-dig/ Can this be related to our problem?

miekg commented 6 years ago

Thanks for the details!

FORMERR on edns can be a possibility, but we don't use funky options, just the normal standard ones-this could still be an issue, but would weird.

Glad 1.2.0 didn't fix it, because they would be super strange.

On Tue, 31 Jul 2018, 12:39 Ruslan Drozhdzh, notifications@github.com wrote:

just found a link to https://kevinlocke.name/bits/2017/01/20/formerr-from-microsoft-dns-server-for-dig/ Can this be related to our problem?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/coredns/coredns/issues/1899#issuecomment-409190313, or mute the thread https://github.com/notifications/unsubscribe-auth/AAVkWxC9sol1kVWTe0i8ZqXF1yrErFjvks5uMEH-gaJpZM4U5dnc .

miekg commented 6 years ago

We should probably retry without edns if we get FORMERR from an upstream.

On Tue, 31 Jul 2018, 13:42 Miek Gieben, miek@miek.nl wrote:

Thanks for the details!

FORMERR on edns can be a possibility, but we don't use funky options, just the normal standard ones-this could still be an issue, but would weird.

Glad 1.2.0 didn't fix it, because they would be super strange.

On Tue, 31 Jul 2018, 12:39 Ruslan Drozhdzh, notifications@github.com wrote:

just found a link to https://kevinlocke.name/bits/2017/01/20/formerr-from-microsoft-dns-server-for-dig/ Can this be related to our problem?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/coredns/coredns/issues/1899#issuecomment-409190313, or mute the thread https://github.com/notifications/unsubscribe-auth/AAVkWxC9sol1kVWTe0i8ZqXF1yrErFjvks5uMEH-gaJpZM4U5dnc .

zloo commented 6 years ago

I'm not sure if working around upstream server issues is the correct thing to do. That could lead down a rabbit hole, which lead to some other DNS servers dropping support for these workaround, as I linked before - https://dnsflagday.net/ . If you still want to workaround these issues, i think it shouldn't be the default, but something explicit in the configuration, like workaroundBrokenUpstreamServers

miekg commented 6 years ago

[ Quoting notifications@github.com in "Re: [coredns/coredns] Forward plugi..." ]

I'm not sure if working around upstream server issues is the correct thing to do. That could lead down a rabbit hole, which lead to some other DNS servers dropping support for these workaround, as I linked before - https://dnsflagday.net/ . If you still want to workaround these issues, i think it shouldn't be the default, but something explicit in the configuration, like workaroundBrokenUpstreamServers

If this FORMERR thing tied to Windows upstreams? Or is it a more general problem?

The DNS spec actually has text that you should retry without EDNS on FORMERR, so it's not super whacky.

zloo commented 6 years ago

our upstream resolvers run on linux, altough they do resolve anything on the internet, so they could be reaching out to windows servers. the first hop is always linux, so the issue should not be tied to windows upstreams

krzysztof-bronk commented 6 years ago

Ours are Windows indeed

miekg commented 6 years ago

this will be fun then: https://ednscomp.isc.org/compliance/summary.html

Can some of you run through that last and report back what these servers do?

zloo commented 6 years ago

hello again,

my last followup, as I'm leaving the company tomorrow: since running 1.2.0, the issue never manifested

my colleagues are monitoring the issue and if it resurfaces, they should report here.

we're also in the process of migrating our upstream dns servers to other software for unrelated reasons, so if our upstream DNS was causing these issues, it should also disappear soon.

thank you everyone for your help.

miekg commented 6 years ago

[ Quoting notifications@github.com in "Re: [coredns/coredns] Forward plugi..." ]

hello again,

my last followup, as I'm leaving the company tomorrow: since running 1.2.0, the issue never manifested

my colleagues are monitoring the issue and if it resurfaces, they should report here.

we're also in the process of migrating our upstream dns servers to other software for unrelated reasons, so if our upstream DNS was causing these issues, it should also disappear soon.

thank you everyone for your help.

thanks for pinging back, appreciated.

miekg commented 6 years ago

Haven't seen this coming back. closing.