Closed jelu closed 5 years ago
On first contact with a server, BIND will advertise an EDNS buffer size of 512, as this is expected to have the best chance of success on the first try. If successful, then it increases the advertised buffer sizes on subsequent contacts, to 1232, then 1432, then 4096 (but never exceeding the configured value of edns-udp-size
).
It used to drop to plain DNS if it failed with EDNS@512. That step was removed after the 2019 flag day but the rest of the code is still there.
Willem Toorop wrote on 2019-09-05 01:26:
@vttale https://github.com/vttale Incorrect. See @ralphdolmans https://github.com/ralphdolmans earlier message:
Unbound currently starts, by default, with 4096 and if that fails will try 1232 for IPv6 and 1472 for IPv4
what's the reasoning? 1232 allows space for IPv6 extension headers, but 1472 does not allow space for IPv4 options.
shouldn't our numbers for both IPv4 and IPv6 be based on either the largest-possible or smallest-possible option sizing?
in IPv6 with no extension headers, we'd pick 1452. in IPv4 with maximum option size, we'd pick 1432.
so i could understand 1452 (V6) and 1472 (V4), or 1232 (V6) and 1432 (V4), but i am struggling with 1232 (V6) and 1472 (V4). help?
-- P Vixie
Well, that's the past. I'd focus on the future, which is currently proposed to use the same default limit for both (1232 or close to that).
Vladimír Čunát wrote on 2019-09-05 10:26:
Well, that's the past. I'd focus on the future, ...
i'd like to understand the mistakes of the past often enough to avoid repeating them. sorry to distract!
which is currently proposed to use the same /default/ limit for both (1232 or close to that).
that's an arbitrary limit which by definition is not merit-based. i'm not going to argue for a merit-based approach, but i will say that if we're going to be fully arbitrary, we could just as easily pick 1200 or 1024, either of which is easier to remember. remember, CF's soft limit on responses is 512, regardless of the initiator's bufsize.
i'm fine with evan's text, except i'd like the limit to be described as arbitrary, and except the part where this is wrongly referred to as a flag day, which it is not. if we're going to change EDNS to set DF=1, then we could call this a flag day. and if we continue to call these "flag days" when they aren't, we'll be "crying 'wolf!'".
-- P Vixie
i'm fine with evan's text, except i'd like the limit to be described as arbitrary
I don't think it is arbitrary. The IPv6 standard requires networks to support at least 1280-byte frames, so we can assume at least that much will be available on modern networks. Subtracting the IPv6 and UDP headers leaves 1232. This might tax some V4-only networks with more TCP fallback, but IMHO those are the right ones to bear the burden.
and except the part where this is wrongly referred to as a flag day, which it is not.
I am increasingly sympathetic to the argument against calling this a flag day, specifically because so many people seem to have assumed we're doing something that we aren't. This isn't like last year's event; very few domains will fail to resolve as a result of anything we're doing. However, I also think "DNS Flag Day" and the dnsflagday.net website, having been so successful last year, are a familiar and useful brand.
In my proposed revision I made a few feints in the direction of treating it that way - capitalizing it, referring to it by the proper name DNS Flag Day rather than generically as a flag day, etc. I'd like to suggest we go further in that direction, and refer to the 2020 event, not as a second flag day, but as what it really is - a set of recommendations and default configuration settings that have been agreed upon through continued consultation between the participants in the 2019 flag day. I think we can spin it to take advantage of the familiarity of "DNS Flag Day" without inducing unnecessary blood pressure spikes in people who know what "flag day" normally means.
However, that conversation probably belongs in a different github issue; this one is supposed to be about the recommended default buffer size.
On Thursday, 5 September 2019 20:25:48 UTC Evan Hunt wrote:
i'm fine with evan's text, except i'd like the limit to be described as arbitrary I don't think it is arbitrary. The IPv6 standard requires networks to support at least 1280 byte frames, so we can assume at least that much will available on modern networks. Subtracting the IPv6 and UDP headers leaves 1232.
understood. this is not arbitrary, though it is misleading (*).
This might tax some V4-only networks with more TCP fallback, but IMHO those are the right ones to bear the burden.
if you're going to use an ipv6-derived value for ipv4 packets, it will be arbitrary. in IPv4 there are two limits, 68 (minimum link MTU) and 576 (minimum reassembly capability). the 1280 number from ipv6 corresponds to the 68 number from ipv4.
i'm not saying don't be arbitrary. i'm saying be honest about it. we could just as easily subtract 48 from 1500 and tell everyone to use that default -- since 40+8 is larger than 20+8 and we're trying to avoid using different numbers for V4 and V6 since that's not a common-enough config knob.
(*) i say that 1232 is misleading because there often will be IP6 extension headers or IP options, which are not accounted for in this calculation, but because most link MTU's are 1500 or larger, the gap between 1232 and 1500 is available as "slop" to contain IP4 options or IP6 extension headers. so, 1232 is not arbitrary, since we know how it's derived. but it is misleading, because the reason it works has nothing to do with how it's derived. if there were a significant number of MTU 1280 links, then 1232 would fail, often.
again, i'm not argung for merit. i'm arguing for honesty about arbitrariness.
-- Paul
If this change will result in little breakage, it will in part be because pressure is already being exerted on authoritative DNS services to fix themselves.
1.1.1.1 uses 1452.
As Habbie said earlier, PowerDNS recently started using 1232. (This has already been deployed to 9.9.9.9's PowerDNS Recursor fleet, but they use other implementations too.)
IPv6 with extension headers have problems passing through the internet, similarly to fragmented packets, I've heard. Trying to minimize these problems is among the primary motivations of "Flag Day 2020".
On Friday, 6 September 2019 05:54:27 UTC Vladimír Čunát wrote:
IPv6 with extension headers have problems passing through the internet, similarly to fragmented packets, I've heard. Trying to minimize these problems is among the primary motivations of "Flag Day 2020".
this reverses causality in one way, and overstates the case in another way.
one reason why some packets with extension headers aren't delivered is because of assumptions people make about extension headers being bad, and assumptions other people make about the size of an ipv6 header or full ipv6 packet. in other words, well intentioned teams like this one, sometimes fulfill their own prophecies and the prophecies of others, by believing in them too soon or without critical thought.
not all extension headers are equally dangerous nor equally mistreated. we should not say more than we know about whether an extension header or all extension headers are useful or are used or will be useful or will be used.
if CF's 1.1.1.1 is using ~1450 for its cache miss transactions, that teaches us something. but since CF enforces ~512 on its authoritative answers, that teaches us something else.
it's clear that we're going to use an arbitrary value. if we're voting, i like 1200 and 1024, because extremism of this kind will motivate a better fix.
-- Paul
For what it's worth: tinydnssec (a fork of djbdns with dnssec support) has a current upper limit of 4000 bytes and a minimum of 1220 bytes.
That is, if the resolver sends out an EDNS message of whatever buffer size and doesn't get a response (say, because a frag was dropped so the OS couldn't reassemble the datagram) then it's just treated like a normal DNS timeout and the resolver moves on to the next server, if any.
PowerDNS never had buffer size fallbacks; if the configured size timed out, before the flag day we would retry without EDNS, and since the flag day we just consider that server down and indeed move on to the next one.
min_len = do_dnssec ? 1220 : 512;
Does this ignore the client bufsize if it's lower than 1220?
1232 allows space for IPv6 extension headers
It does not: 1280 - 40 - 8 = 1232
edit: I see now that this was answered already
gdnsd already caps it to 1024:
https://github.com/gdnsd/gdnsd/blob/d076ff6ea15e3a49fc0bcd8bf73421e08a0c34ed/src/dnswire.h#L30
// However, we only advertise a buffer size of 1024, to be absolutely sure that // even in the face of an IPv6 min-MTU link and lots of extra headers and // whatnot, it will always be a single fragment. // We use this size as our recvmsg() limit as well, discarding anything larger // to save ourselves processing it. And in the TCP case, we immediately close // if a size greater than this is sent as the message length field.
And in the TCP case, we immediately close if a size greater than this (1024) is sent as the message length field.
I can't immediately see this as a real problem for authoritative-only servers, but even so the sentence makes me a little anxious.
Loganaden Velvindron wrote on 2019-09-07 10:00:
gdns already caps it to 1024: ... // However, we only advertise a buffer size of 1024, to be absolutely sure that // even in the face of an IPv6 min-MTU link and lots of extra headers and // whatnot, it will always be a single fragment. // We use this size as our recvmsg() limit as well, discarding anything larger // to save ourselves processing it. And in the TCP case, we immediately close // if a size greater than this is sent as the message length field. the important limit is what you impose on the answers you send. are you using an effective buffer size which is MIN(1024, query_bufsize) to guide the construction of your answers?
i think your described reasoning for 1024 is perfect, since the 1280 limit doesn't really exist, but there will be IP options often, and encapsulations sometimes.
-- P Vixie
@vixie, not my reasoning. It's the comment from the gdns developer :-)
1024 default: I'd be really careful not to switch a notable amount of traffic to TCP, and around this value I'd be afraid.
Details: many zones (e.g. .com TLD) will need around a thousand bytes for NXDOMAIN answers with +dnssec
, and NXDOMAIN surely isn't an exceptional condition (7-8% in .cz last month). On resolver side this might even provide (further) incentive against validating DNSSEC. After most of the world switches to elliptic algorithms, it might be fine, but I don't think we'll be there in 2020 yet. (For example, in .cz all answers are well below 800 bytes, except ANY.)
On Sunday, 8 September 2019 08:30:02 UTC Vladimír Čunát wrote:
1024 default: I'd be really careful not to switch a notable amount of traffic to TCP, and around this value I'd be afraid.
i think you're right and i withdraw my vote for 1024. my second choice was 1200 and is now my first choice. unless (like TCP) we know the size of the IP header including options that will be used on a datagram, we have to leave room for some options, which 1232 (1280 - 40 - 8) does not do, if we believe that there are MTU 1280 links in actual existence.
(noting, if we don't believe there are actual MTU 1280 links in existence, we would be subtracting 40, 8, and some options space from 1500, not 1280, and in that case i'd be comfortable with 1400 as the capped bufsize if the request specifies something larger than that.)
-- Paul
@Habbie no it doesn't.
@vcunat Would that much TCP actually be a problem?
No one would like it, but if deployments can handle it, what's the big deal?
There are a number of TLDs that use 2048-bit RSA ZSKs, producing ~1550 byte NXDOMAINs. Not a lot of them! But their operators obviously find it possible.
On Sunday, 8 September 2019 19:55:57 UTC Matt Nordhoff wrote:
@vcunat Would that much TCP actually be a problem?
No one would like it, but if deployments can handle it, what's the big deal?
state has mass. we don't know if the global tcp protocol control block capacity is in the same order of magnitude as potential demand, and there's only one way (the hard way) to find out. so, we avoid obvious paths toward lots-of-tcp. eventually the combination of DoT and TCPFO will reduce the state load, but we are a lot of years short of being able to count on that.
There are a number of TLDs that use 2048-bit RSA ZSKs, producing ~1550 byte NXDOMAINs. Not a lot of them! But their operators obviously find it possible.
it's possible that EDNS0 @4096 is working often enough to avoid such pain, even though it's clearly failing often enough to cause other pain. since we're rebuilding an airplane in flight, we should predict and manage and balance the pain. it's always possible that such TLDs will just continue to respect 4096 from clients, without clamping it to 1200 or whatever we end up recommending.
-- Paul
@mnordhoff: I'd prefer to avoid incentives against validation on resolvers, against following the flag-day default, etc. And so far I can't see what we expect to gain from pushing the limit so far.
If you turn off DNSSEC validation, this extra cost will disappear and normal stuff will all fit into UDP. Given that validation already needs some additional expense (latency on cold cache, bandwidth, memory, processing power), I wouldn't want to increase it too much... because I'm afraid that realistically validation doesn't improve common end-user experience or even security that much (in presence of https). Ballpark: 7-8% traffic gets switched to TCP which is reportedly several-times more expensive than UDP, and you're getting quite a significant fraction of "whole expense". There are also people "obsessed" with latency, though probably more on the web-browsing side.
@mnordhoff I've tested, and it's indeed capped for 1.1.1.1. However, I'm not sure about Quad9: ;; ANSWER SECTION: rs.dns-oarc.net. 54 IN CNAME rst.x4090.rs.dns-oarc.net. rst.x4090.rs.dns-oarc.net. 54 IN CNAME rst.x4058.x4090.rs.dns-oarc.net. rst.x4058.x4090.rs.dns-oarc.net. 55 IN CNAME rst.x4064.x4058.x4090.rs.dns-oarc.net. rst.x4064.x4058.x4090.rs.dns-oarc.net. 57 IN TXT "Tested at 2019-09-09 12:09:57 UTC" rst.x4064.x4058.x4090.rs.dns-oarc.net. 57 IN TXT "196.49.9.220 DNS reply size limit is at least 4090" rst.x4064.x4058.x4090.rs.dns-oarc.net. 57 IN TXT "196.49.9.220 sent EDNS buffer size 4096"
Do you have a url where quad9 announces that it's implementing this ?
@loganaden
Try sending several queries to 9.9.9.9. You might not have hit a PowerDNS node.
Do you have a url where quad9 announces that it's implementing this ?
No! It's just an observation of their deployments; I don't know what their official plans are. They're using default settings for now.
@mnordhoff Indeed. After several attempts, it does work for quad9: ;; ANSWER SECTION: rs.dns-oarc.net. 60 IN CNAME rst.x1188.rs.dns-oarc.net. rst.x1188.rs.dns-oarc.net. 59 IN CNAME rst.x1198.x1188.rs.dns-oarc.net. rst.x1198.x1188.rs.dns-oarc.net. 58 IN CNAME rst.x1204.x1198.x1188.rs.dns-oarc.net. rst.x1204.x1198.x1188.rs.dns-oarc.net. 57 IN TXT "196.49.9.220 DNS reply size limit is at least 1204" rst.x1204.x1198.x1188.rs.dns-oarc.net. 57 IN TXT "196.49.9.220 sent EDNS buffer size 1232" rst.x1204.x1198.x1188.rs.dns-oarc.net. 57 IN TXT "Tested at 2019-09-09 12:33:05 UTC"
Not to derail the fascinating conversation about the specific sizes, but I'd like to get a bit of clarification about where the final consensus size(s) should be applied. Which of the following are considered to be in scope for flag day 2020 (or whatever name is preferred)?
(1) For authoritative servers, a maximum DNS message size applied to all responses (regardless of the EDNS0 UDP buffer size offered by the client) with truncation to fit in that size.
(2) For recursive resolvers, a maximum EDNS0 UDP buffer size sent to authoritative servers.
(3) For recursive resolvers, a maximum DNS message size applied to all responses (regardless of the EDNS0 UDP buffer size offered by the client) with truncation to fit in that size.
(4) For stub resolvers, a maximum EDNS0 UDP buffer size sent to recursive resolvers.
The first two seem to be pretty clearly in scope, but it is unclear whether (3) is covered by the text "Requirements on the resolver side are more or less the same as for authoritative" and the section about authoritative servers that says "You should also configure your servers to negotiate an EDNS buffer size that will not cause fragmentation."
A viewpoint that says this flag day (just like the last one) only applies to communications between recursive resolvers and authoritative name servers would conclude that (3) is out of scope and that no recommendation or requirement is made. For what it's worth, the "Test your resolver" function only tests the behavior between last resolver and authoritative name servers.
As far as (4) the section for DNS software vendors would imply that the default should be 1232 (or whatever number it is) but there's no specific call for stub resolver clients to change (and in any case, I think little possibility of significantly affecting the installed base in any way besides the software defaults).
What do others think?
On Wednesday, 11 September 2019 17:45:11 UTC Alexander Dupuy wrote:
Not to derail the fascinating conversation about the specific sizes, but I'd like to get a bit of clarification about where the final consensus size(s) should be applied. Which of the following are considered to be in scope for flag day 2020 (or whatever name is preferred)?
...
initiators (clients) should change nothing. this includes stubs, recursives, and authorities when doing SOA lookups. bufsize should indicate the size above which a recvfrom() syscall will truncate network data to fit in the landing zone.
responders (servers) should change to clamp the response size to the minimum of the desired complete size, or the offered buffer size, or the effective PMTU. the effective PMTU can be the discovered actual PMTU, or the interface MTU of the name server's sending socket, or the interface MTU of the outbound route's interface, or the route MTU of the outbound route, or ~1200 (or ~1400).
the client will never send a large enough packet to the server to discover PMTU, so there's no reason to expect that PMTUD will ever function in that direction, if indeed PMTUD ever functions at all, which would be a surprise.
the goal is avoid fragmentation. the initiator can have no knowledge of the responder's various limitations, and so, should not alter its offered buffer size in order to fit under some ceiling (any ceiling).
-- Paul
I think the recommended upper bound should apply to everyone sending or receiving DNS messages over UDP, by default. I edited my unclear wording.
I am with Vláďa here, both server and client are in the scope.
There won’t be any sudden deployment for all the servers, so even client defaults needs to be changed to have better impact and coverage.
I agree with the above. The client can't know whether the auth server conforms to our anti-fragmentation recommendations. Since the goal is to avoid fragmentation either way, the client should request a buffer size commensurate with MTU (or 1200ish if it doesn't have access to that, as discussed above).
<<but ATM I see no motivation not to try applying the default "everywhere".>> i'm using language very carefully here. dns is not a simple client/server protocol; a NOTIFY initiator will often be an IXFR responder. since EDNS buffer size is an initiator parameter, it's rare to see the responder's buffer size used -- although some UPDATE initiators use a QUERY to discover the responder's buffer size before deciding whether to use UDP vs. TCP on the actual UPDATE.
my motives for suggesting no change in initiator behaviour is as follows.
there are a lot of initiators -- every stub resolver for example. if we're going to touch them all, that is a very high-scale cost, and we should get something better than "offer a smaller EDNS buffer size" in return for that cost. for example, i'd prefer to teach them all DoT and (where available) TCPFO, so they could use persistent encrypted TCP/853 connections opportunistically. and i'd like to see DNS cookies across the stub population.
changes to the initiator population take a long time to roll out, especially for IoT and ICS. this means two things if we change the offered buffer size to a fragmentation-avoiding value: first, that change will be effectively permanent, regardless of inevitable changes that will come to link MTU over the coming decades, and will thus complicate DNS key roll events where more than one RRSIG has to be present simultaneously long after we all learn to think of 1500 the way we currently see the IBM PC's memory limit of 640Kbytes, and second, the reduction in fragmentation would still take quite a long time. so: long cost and slow benefit.
an initiator might be in a fragmentation-favorable environment like a campus or enterprise network where all of its responders are either on fragmentation-tolerant paths. we won't be adding proposed EDNS buffer size as a DHCP parameter, so the initiator's offered buffer size is mostly going to be set in code rather than configuration. that code is in the best position to know the response size it expects, which is what the EDNS protocol defines the buffer size parameter to actually mean.
responders are upgraded far more regularly than initiators, and are far more likely to be monitored by experts. they can also make informed tradeoffs about TC=1, TCP state mass, and additional data policy. a responder who chooses for reasons of its own to send a smaller UDP payload than the initiator's offered buffer size can be continuously improved. if the initiators select an offered buffer size to avoid fragmentation, the responder's options are more limited.
responders are by definition stateful, and any PMTU discovery logic that evolves later, or PMTU configuration information statically input by a system or network administrator, will appear first for responders and perhaps never for initiators. the scheme of "use this conservative estimate for your offered buffer size unless you have better information available" only makes sense if there is some theoretical future timeline on which better information will be available. it won't be for the vast majority of initiators.
i hope this helps explain my earlier comments.
Thanks all for the feedback. The motivation for my question was whether case 3 (recursive resolver as a server) was in scope for the proposed changes, and there was unanimous response that it was (along with case 1 for authoritative servers) so I now have an answer.
As far as the question of cases 2 and 4 (clients), my feeling is that changes should be made by the more centralized/homogenous group. Between stub resolver clients and recursive resolver servers it seems clear that the burden of mitigating fragmentation problems should fall on the recursive resolver servers, so I would give stub resolver clients a free pass and say that case 4 is out of scope.
Between recursive resolver clients and authoritative name servers it is not so clear, but I suspect that the total population of authoritative name servers has much more implementation diversity if for no other reason that writing authoritative (non-DNSSEC-signing and non-delegating) DNS server requires much less understanding of DNS protocol and operation than writing a recursive resolver, which has to handle delegation referral responses, query for name server IP addresses, and perform some sort of sensible name server selection process. I wouldn't give authoritative servers a free pass (they should support truncation and TCP for responses larger than ~1x00 bytes) but again the larger burden of mitigating fragmentation problems should fall on the recursive resolvers (this time as clients). This implies that DNS client software might want to have different default behaviors depending on whether it is talking to a recursive resolver or an authoritative name server, which may make this position less popular.
Paul's terminology of initiators and responders also usefully points out that my taxonomy missed two other cases: NOTIFY / XFR, and UPDATE (as initiators and responders). For the NOTIFY / XFR case, since both sides are authoritative name servers, my centralized/homogenous criteria has no preference and Paul’s argument to put the burden on responders seems compelling enough to me. For UPDATE, I am really not familiar enough with the range of implementations to have an informed opinion.
Looks like we are coming to the end of this discussion, currently 4 suggestions for 1232 and 1 for 1200. I apologize if I've missed any suggestion, please restate it if so.
I'd like to close this thread on 4th Oct noon CEST, two weeks from today, and if you'd like to voice your suggestion please do it by then.
Thanks to everyone for participating in this discussion!
[ Quoting notifications@github.com in "Re: [dns-violations/dnsflagday] fla..." ]
Looks like we are coming to the end of this discussion, currently 4 suggestions for 1232 and 1 for 1200. I apologize if I've missed any suggestion, please restate it if so.
To make things explicit, this states setting the TC bit on replies that exceed those sizes in the server?
To make things explicit, this states setting the TC bit on replies that exceed those sizes in the server?
Yes.
[ Quoting notifications@github.com in "Re: [dns-violations/dnsflagday] fla..." ]
To make things explicit, this states setting the TC bit on replies that exceed those sizes in the server?
Yes.
OK, I will not implement that behavior in CoreDNS
We're still dealing with crazy musl libc clients that doesn't even implement retry over tcp. And the server is often so far detached from the actual network that there is no way of getting an MTU.
On Fri, Sep 20, 2019, 09:05 Peter van Dijk notifications@github.com wrote:
To make things explicit, this states setting the TC bit on replies that exceed those sizes in the server?
Yes.
That is actually not specified, what is said is:
Authoritative DNS Operators:
Authoritative DNS servers MUST NOT send answers larger than the requested EDNS buffer size!
DNS Resolver Operators:
Resolvers MUST resend queries over TCP if they receive a truncated UDP response (with TC=1 set)!
If you'd like to discuss this further, please make a new issue.
I submitted draft-fujiwara-dnsop-avoid-fragmentation-01. https://tools.ietf.org/html/draft-fujiwara-dnsop-avoid-fragmentation-01 Differences are:
This issue serves as a public, open to all, discussion forum for what the recommended EDNS buffer size should be for DNS Flag Day 2020.
Note that most of the text on dnsflagday.net mentions 1220 bytes.
EDNS buffer size suggestions:
1232 @ralphdolmans @pspacek @oerdnj @Habbie
1200 @vixie
References & Software:
Presentation by Fujiwara-san (slide 19) recommended 1220 which comes from RFC 4035 section 3:
CoreDNS uses these values, which come from NSD:
PowerDNS
BIND's current buffer size negotiation uses 512, 1232, 1432 and 4096.
libc uses 1200
Unbound
Knot Resolver
tinydnssec upper limit of 4000 bytes and a minimum of 1220 bytes
gdnsd caps at 1024
Other Proposals:
@vixie EDNS BUFSIZE should be calculated exactly the same way as TCP MSS (comment)
@fweimer generating atomic fragments by default, to support stateless IPv6 UDP service (comment)
@mnordhoff Dual-stack resolvers should use 512 bytes for IPv4 and 1232 bytes for IPv6, and weight their server selection algorithm to prefer IPv6 (comment)