flag day 2020: Recommended EDNS buffer size

jelu commented 4 years ago

This issue serves as a public, open to all, discussion forum for what the recommended EDNS buffer size should be for DNS Flag Day 2020.

Please note that the exact recommended EDNS buffer sizes have not been agreed upon, the current ballpark around 1200 (1220, 1232, …) is to limit the risk of fragmentation in IPv6.

Note that most of the text on dnsflagday.net mentions 1220 bytes.

EDNS buffer size suggestions:

1232 @ralphdolmans @pspacek @oerdnj @Habbie
1200 @vixie

References & Software:

Presentation by Fujiwara-san (slide 19) recommended 1220 which comes from RFC 4035 section 3:

A security-aware name server MUST support the EDNS0 ([RFC2671]) message size extension, MUST support a message size of at least 1220 octets, ...
CoreDNS uses these values, which come from NSD:

1480 (EDNS/IPv4), 1220 (EDNS/IPv6), or the advertised EDNS buffer size if that is smaller than the EDNS default.
PowerDNS

PowerDNS Recursor 4.2 and Authoritative 4.2 (both released recently) have lowered the relevant options from 1680 (which is a nonsensical number no matter which way you look at it) to 1232
BIND's current buffer size negotiation uses 512, 1232, 1432 and 4096.

The values 1232 and 1432 are chosen to allow for an IPv4/IPv6 encapsulated UDP message to be sent without fragmentation at the minimum MTU sizes for Ethernet and IPv6 networks.
libc uses 1200
Unbound

currently starts, by default, with 4096 and if that fails will try 1232 for IPv6 and 1472 for IPv4.
Knot Resolver

Knot Resolver never had this auto-adjustment of EDNS buffer sizes (default 4096 from the RFC, so far), but it tries TCP after four UDP packets without a reply (those may all go to different IPs).

tinydnssec upper limit of 4000 bytes and a minimum of 1220 bytes

min_len = do_dnssec ? 1220 : 512;
max_response_len = size > 4000 ? 4000 : size;

gdnsd caps at 1024

However, we only advertise a buffer size of 1024, to be absolutely sure that even in the face of an IPv6 min-MTU link and lots of extra headers and whatnot, it will always be a single fragment. We use this size as our recvmsg() limit as well, discarding anything larger to save ourselves processing it. And in the TCP case, we immediately close if a size greater than this is sent as the message length field.

Other Proposals:

@vixie EDNS BUFSIZE should be calculated exactly the same way as TCP MSS (comment)
@fweimer generating atomic fragments by default, to support stateless IPv6 UDP service (comment)
@mnordhoff Dual-stack resolvers should use 512 bytes for IPv4 and 1232 bytes for IPv6, and weight their server selection algorithm to prefer IPv6 (comment)

pspacek commented 4 years ago

Last presentation by Fujiwara-san recommended 1220: Slide 19 https://indico.dns-oarc.net/event/31/contributions/692/attachments/660/1115/fujiwara-5.pdf

Why not use that? I think it is better to use value from researchers instead of randomly selecting another value "because we can".

miekg commented 4 years ago

[ Quoting notifications@github.com in "Re: [dns-violations/dnsflagday] fla..." ]

Last presentation by Fujiwara-san recommended 1220: Slide 19 https://indico.dns-oarc.net/event/31/contributions/692/attachments/660/1115/fujiwara-5.pdf

Why not use that? I think it is better to use value from researchers instead of randomly selecting another value "because we can".

CoreDNS uses these values, which come from NSD:

//    .., 1480 (EDNS/IPv4), 1220 (EDNS/IPv6), or the advertised EDNS buffer size if that is
//    smaller than the EDNS default.
// See: https://open.nlnetlabs.nl/pipermail/nsd-users/2011-November/001278.html

Although this doesn't trigger anything like setting TC or some such.

/Miek

-- Miek Gieben

Habbie commented 4 years ago

PowerDNS Recursor 4.2 and Authoritative 4.2 (both released recently) have lowered the relevant options from 1680 (which is a nonsensical number no matter which way you look at it) to 1232 (which we got from the ISC Knowledge Base, described as 'to allow for a IPv4/IPv6 encapsulated UDP message to be sent without fragmentation at Ethernet and IPv6 network minimum MTU sizes').

Slide 19 https://indico.dns-oarc.net/event/31/contributions/692/attachments/660/1115/fujiwara-5.pdf

Which actually comes from 4035 3:

A security-aware name server MUST support the EDNS0 ([RFC2671]) message size extension, MUST support a message size of at least 1220 octets, ...

mnordhoff commented 4 years ago

For reference, BIND's current buffer size negotiation uses 512, 1232, 1432 and 4096.

The values 1232 and 1432 are chosen to allow for an IPv4/IPv6 encapsulated UDP message to be sent without fragmentation at the minimum MTU sizes for Ethernet and IPv6 networks.

https://ftp.isc.org/isc/bind9/9.15.3/doc/arm/Bv9ARM.ch05.html

vixie commented 4 years ago

no fixed number is appropriate, any more than 512 was appropriate in DNS over IPv4 UDP. EDNS BUFSIZE should be calculated exactly the same way as TCP MSS, and for identical reasons. i operate networks with MTU 8192 and i will not thank you if you make me use TCP for DNS payloads in the size range between ~1200 and the EDNS recommended default (4096). please, please, please, remember the lesson of 640K in the IBM PC, and avoid unneccessary hard limits.

TCP MSS is calculated by looking at the routing table to find the endpoint-specific MTU (if known) or else the interface MTU (if local) or else the default route MTU (which is often 1500) and then subtracting a fudge factor (40 octets for IPv4, ~300 for IPv6) to allow for transport and protocol headers. this is working -- TCP avoids fragmentation. this is future-proof -- any network that has a larger MTU gets to use it, and any endpoint-specific knowledge (such as PMTUD) is communicated to fragmentation-sensitive transports like TCP, or applications like DNS, via the local routing table.

we should not needlessly invent something that's different from TCP MSS. just use what works.

edit:

fweimer commented 4 years ago

We used 1200 in glibc, but I can switch to 1220 if that's the consensus.

What about generating atomic fragments by default, to support stateless IPv6 UDP service? Or not generating atomic fragments under any circumstances? Would that be a separate discussion?

mnordhoff commented 4 years ago

My wild proposal:

Dual-stack resolvers should use 512 bytes for IPv4 and 1232 bytes for IPv6, and weight their server selection algorithm to prefer IPv6.

If it proves necessary, resolvers that only support IPv4 could use 1232 to reduce the amount of TCP fallback.

When configuring software that does not support IP version-specific buffer size settings, users could be recommended to choose 1232, to reduce the amount of TCP fallback (if it proves necessary, or for the sake of IPv6).

jelu commented 4 years ago

@vixie Just a note, the 2020 work is to set a recommended configurable default that helps to try to avoid fragmentation on the most common networks. You're free to configure the software on your network as you want.

@fweimer IPv6 atomic fragments seems more a network stack/kernel feature, how do you suggest that this would be used in user-land DNS software/services?

vixie commented 4 years ago

On Monday, 2 September 2019 09:08:16 UTC Jerry Lundström wrote:

@vixie Just a note, the 2020 work is to set a recommended configurable default that helps to try to avoid fragmentation on the most common networks. You're free to configure the software on your network as you want.

i must have misspoken. when speaking to hosts on my LAN i want the LAN's MTU to be used, minus the per-protocol fudge factor. this might be 9000.

when speaking to hosts elsewhere on my campus i want the campus's MTU to be used, again minus some UDP and V4/V6 header estimate (fudge factor). this may be 1500, 4K, 8K, or 9000, depending on what part of the campus i'm reaching.

when speaking through the default gateway i want that gateway route's MTU to be used, it will usually be 1500, minus the same fudge factor TCP MSS would subtract.

when speaking to certain more distant locations (same ISP) i will want the POS MTU to be used, this is usually around 4K.

the way i expect to be able to tell an application what nonfragmentable packet size i want it to use is to enter routes (perhaps static, perhaps dynamic) in my routing table.

TCP will use this information to compute its MSS, on a per-destination basis.

so should DNS for UDP. because all nonfragmenting protocols should use the same basic method of deciding what maximum packet size to offer.

there is no one number i can enter into my name server configuration which will serve me. and indeed, it would be wrong to hard code any such number, for the same reason it was wrong to hardcode 640K in the original IBM PC.

please ack that you understand why your suggestion is a nonsequitur, and why DNS has no basis for using a method that differs from TCP's for MSS.

-- Paul

jelu commented 4 years ago

@vixie

please ack that you understand why your suggestion is a nonsequitur

If you're directing this at me personally, please note that this is a community effort and I (OARC) are only acting in a supporting role.

ralphdolmans commented 4 years ago

For reference; Unbound currently starts, by default, with 4096 and if that fails will try 1232 for IPv6 and 1472 for IPv4.

@pspacek, afaik the 1220 from Fujiwara is not based on his own research but chosen because it is the minimum in RFC4035.

fweimer commented 4 years ago

@fweimer IPv6 atomic fragments seems more a network stack/kernel feature, how do you suggest that this would be used in user-land DNS software/services?

@jelu The expectation is that the kernel keeps track which destinations requested atomic fragments and produces atomic fragments for them. It requires the kernel to keep per-destination state, so it has scalability issues. It is hard to make sure that the ICMP responses from the IPv6 network are routed to the appropriate node in a clustered configuration, so that they trigger atomic fragments in future replies.

All this is somewhat similar to path MTU discovery in a UDP context, which is not generally implemented for DNS, of course.

pspacek commented 4 years ago

Personally I do not care that much if it is 1220 or 1232. The only reason why I like 1220 is that it has a reference we can quote.

If consensus leans toward 1232 then Knot Resolver will follow but we really need a public record why it is 1232 - preferably something better than "a magic number pulled out of my hat".

In general using same algorithm as for initial TCP MSS value sounds good to me, but I do not have implementation experience with it. I think this can be implemented later as an optimization. Purpose of DNS flag day 2020 is to fix particular interoperability problem with lost fragments, and optimizations can come later if there is need/interest in them.

ralphdolmans commented 4 years ago

@pspacek, 1232 is the minimum IPv6 MTU (1280) - IPv6 header (40) - UDP header (8).

I do not have strong feelings about the exact number either, but I find 1232 easier to justify than 1220.

pspacek commented 4 years ago

Fine with me!

wtoorop commented 4 years ago

I looked into Path MTU discovery for DNS in 2012 & 2013. At the time I found it doable for IPv6, see:

blogpost at https://medium.com/nlnetlabs/using-pmtud-for-a-higher-dns-responsiveness-60e129917665
Presentation at 5th CENTR R&D: https://www.nlnetlabs.nl/downloads/presentations/pmtud4dns.pdf

I did not measure how much ICMP was dropped at authoritatives. It would perhaps be good to redo the work (but structurally within dnsthought) and also a non ATLAS based measurement (maybe with OpenINTEL) for ICMP dropping.

oerdnj commented 4 years ago

@vixie Could you describe a practical scenario (and not some hypothetical) that’s common enough that would justify the additional complexity brought by the MTU negotiation. E.g. under what configuration you need 9k UDP at the local network and/or campus that at the same time can’t be served by TCP fallback without performance degradation?

You seem to talk a lot about what do you want, but the goal of the DNS Flag Day is to best serve the Internet (users) at large and the general consensus is that the users would benefit from eliminating the fragmentation in the default configurations.

oerdnj commented 4 years ago

I am not sure where the justification for 1220 comes from as VPNs and/or GRE adds more than 12 octets. Thus making the default 1232 makes most sense to me.

fweimer commented 4 years ago

@wtoorop To me, this work looks like it assumes that the problem with reflective attack is the amplification. But I'm not sure if that is actually true. Even if there is no amplification, reflecting the traffic off a popular server (traffic from which cannot be blocked) could be problematic. The proposed hack of extracting destination addresses from the ICMPv6 payload (instead of the outer IPv6 header) breaks BCP38 filters, if I'm reading this proposal correctly.

vixie commented 4 years ago

On Monday, 2 September 2019 09:35:31 UTC Jerry Lundström wrote:

@vixie

please ack that you understand why your suggestion is a nonsequitur

If you're directing this at me personally, ...

i was responding to text you sent. was it your text or were you repeating someone else's views?

-- Paul

EDITED: i apologize for the rudeness of my earlier reply in which i used the word "parrotting". thank you to ondrej sury for pointing out my error. i will strive to avoid similar errors in the future.

vixie commented 4 years ago

On Monday, 2 September 2019 10:35:13 UTC Willem Toorop wrote:

I looked into Path MTU discovery for DNS in 2012 & 2013. At the time I found it doable for IPv6, see: - blogpost at https://medium.com/nlnetlabs/using-pmtud-for-a-higher-dns-responsiveness-60 e129917665 - Presentation at 5th CENTR R&D: https://www.nlnetlabs.nl/downloads/presentations/pmtud4dns.pdf

...

please note that i am not asking for PMTUD(6), which involves as you point out ICMPv6, which is an unsolved problem. i am not asking this task force to take on an unsolved problem.

rather, i am noting that the communications fabric between any and all sources of MTU information is the routing table. this is where the sysadmin puts some static information, it's where PMTUD(6) would put information if discovered, and it is where TCP MSS looks for endpoint-specific MTU information.

i am not asking the DNS community to help discover this information. rather, i am asking the DNS community to look for this information in the place where TCP MSS would look for it. so, read-only access to information whose utility and purpose has already been demonstrated.

any much older network like SLIP whose MTU is lower than 1220 will thank us. but more importantly, we would be seen as cooperative with metadata that the Internet System already has a place for.

we can heighten the utility of some day fixing PMTUD(6) if we add a consumer to the output it would produce. and we will heighten the utility of using larger MTU's if we would opportunistically transmit larger unfragmentable IP
DNS messages.

this is what EDNS0 should have defined for bufsize originally. history is watching us as we consider hardcoding more limits based on IEEE-803 parameters which have not been appropriate since FastE first arrived.

-- Paul

vixie commented 4 years ago

On Monday, 2 September 2019 10:56:22 UTC Ondřej Surý wrote:

@vixie Could you describe a practical scenario (and not some hypothetical) that’s common enough that would justify the additional complexity brought by the MTU negotiation. E.g. under what configuration you need 9k UDP at the local network and/or campus that at the same time can’t be served by TCP fallback without performance degradation?

i have VLANs whose MTU is 9000 ("jumbograb") because it speeds up NFS, whether over UDP or TCP, to be able to fit page-aligned 4K or 8K pages as payloads. on those VLANs, no device which can't have its MTU told to it (so, IoT) is permitted.

when i operated a global ISP backbone in 2001, we had some customers who used an MTU of 4470 when speaking to us, because we had a SONET/SDH backhaul. see:

https://www.juniper.net/documentation/en_US/junos/topics/reference/general/ interfaces-media-mtu-size-by-interface-type.html

and they were gluing their enterprise together using transit from my company.

You seem to talk a lot about what do you want, but the goal of the DNS Flag Day is to best serve the Internet (users) at large and the general consensus is that the users would benefit from eliminating the fragmentation in the default configurations.

if what you want is the simplest thing for members of DNS-OARC to do for today, and with no thought of either the constraints you'll be placing on the larger (unrepresented) community, or on the (yet unborn) future community, then by all means use 640K, or whatever 640K means in the situation you're in.

i am asking that we avoid wholly unnecessary constraints, and follow the same reasoning as was used in RFC 6691, which is what EDNS0 should have specified originally. we're fixing a mistake in how EDNS0 was specified. let's fix it correctly.

the goers who were in the room in bangkok or who are on this mailing list today are neither all seeing (with regard to the future) nor all knowing (with regard to the present). when we reason about the unknown, we have a responsibility to keep track of what already is known and what need not be known.

-- Paul

vixie commented 4 years ago

@fweimer All this is somewhat similar to path MTU discovery in a UDP context, which is not generally implemented for DNS, of course.

i think the presumption has to be DF (don't fragment), unless this setting has been overridden by the system administrator (perhaps through name server configuration, or perhaps through static route assignments) or has been overridden by automation (for example, if PMTUD(6) is someday made to work by a future generation). i should have said this in RFC 2671 ("EDNS0"), because even at that time fragmentation was known not to work off-LAN, and i should have specified that !DF should only be set in an EDNS0 transaction if there was specific guiding knowledge (such as working PMTUD, which has not happened and may never happen.)

oerdnj commented 4 years ago

i have VLANs whose MTU is 9000 ("jumbograb") because it speeds up NFS, whether over UDP or TCP, to be able to fit page-aligned 4K or 8K pages as payloads. on those VLANs, no device which can't have its MTU told to it (so, IoT) is permitted.

when i operated a global ISP backbone in 2001, we had some customers who used an MTU of 4470 when speaking to us, because we had a SONET/SDH backhaul. see:

https://www.juniper.net/documentation/en_US/junos/topics/reference/general/interfaces-media-mtu-size-by-interface-type.html

and they were gluing their enterprise together using transit from my company.

In a DNS context...

oerdnj commented 4 years ago

i was responding to text you sent. was it your text or were you parroting?

Paul, stop being rude.

vixie commented 4 years ago

On Monday, 2 September 2019 19:25:46 UTC Ondřej Surý wrote:

...

when i operated a global ISP backbone in 2001, we had some customers who used an MTU of 4470 when speaking to us, because we had a SONET/SDH backhaul. see:

https://www.juniper.net/documentation/en_US/junos/topics/reference/general /interfaces-media-mtu-size-by-interface-type.html

and they were gluing their enterprise together using transit from my company.

In a DNS context...

they were, and i'm sure still are, using DNS. avoiding TC=1 and the resulting 3-way handshake, and avoiding the state costs of TCP protocol control blocks, are all computable virtues.

if we do the 640K thing, whatever limit we recommend will in a DNS context keep anyone from offering any answers larger than that, now and forever. i argue that the future deserves better consideration from us than that.

-- Paul

vixie commented 4 years ago

On Monday, 2 September 2019 19:26:03 UTC Ondřej Surý wrote:

i was responding to text you sent. was it your text or were you parroting?

Paul, stop being rude.

if someone in this thread dismisses a proposal for reasons they give, but then claims to be representing the will of the community, i think the rudeness ship has sailed. we should be transparent as to what we personally think, or, whose views (with names and referencing) we are representing.

in realpolitik, it is common to speak words like "i've got people looking at this in hawaii right now, and you won't believe what they're finding" when referring for example to a political opponent's birth certificate. technical debate is not realpolitik and should not be made so.

if this decision has already been reached via secret handshake, and this whole debate is just designed to create the false impression of consensus, please just say so, and we'll all stop pretending that alternative proposals, or technical merit, are at all relevant.

-- Paul

oerdnj commented 4 years ago

if someone in this thread dismisses a proposal for reasons they give, but then claims to be representing the will of the community, i think the rudeness ship has sailed.

@vixie, so far, you are the only one being rude to other people. I have politely asked you to stop, and I would expect you to apologise to Jerry, not give me a list of reasons why you can be rude here. So, I am going to ask you again to not be rude to others and treat other people with respect even if you disagree with them or the course of action.

vixie commented 4 years ago

On Monday, 2 September 2019 10:03:46 UTC Petr Špaček wrote:

In general using same algorithm as for initial TCP MSS value sounds good to me, but I do not have implementation experience with it. I think this can be implemented later as an optimization. Purpose of DNS flag day 2020 is to fix particular interoperability problem with lost fragments, and optimizations can come later if there is need/interest in them.

if there will be guidance published, can it be in an RFC?

can the guidance be "use the best information you have about the PMTU between yourself and the remote address, which may just be the MTU associated with the destination route, and if you have no information whatsoever, use 1200 octets"?

i really think we should avoid 640K, in all its forms.

-- Paul

vixie commented 4 years ago

On Monday, 2 September 2019 19:45:45 UTC Ondřej Surý wrote:

...

@vixie, so far, you are the only one being rude to other people. I have politely asked you to stop, and I would expect you to apologise to Jerry, not give me a list of reasons why you can be rude here. So, I am going to ask you again to not be rude to others and treat other people with respect even if you disagree with them or the course of action.

if you check your timestamps, you'll see that i edited my rude post to include an apologia to all and a thank-you to yourself, shortly after our last exchange. if you think i haven't gone far enough, please be specific.

-- Paul

oerdnj commented 4 years ago

if you check your timestamps, you'll see that i edited my rude post to include an apologia to all and a thank-you to yourself, shortly after our last exchange. if you think i haven't gone far enough, please be specific.

Thank you, I haven't noticed that. This is much appreciated.

vixie commented 4 years ago

On Tuesday, 3 September 2019 07:25:49 UTC Ondřej Surý wrote:

... Thank you, I haven't noticed that. This is much appreciated.

when i'm wrong, i'll own it.

but in addition, i'd still like answers to the surrounding questions. there is broad consensus that RFC 2671 should not have defined a protocol which required fragmentation, but there is no consensus on whether the problem was the fixed number itself (4096) or having a fixed number at all. my own view is that RFC 2671 ought to have said, do what TCP does for MSS. i've also found that windows and linux both have user mode access to the nec'y information, and i already knew that BSD has it, so the only question mark is mac/os/x.

may 2020 is not yet tomorrow. what can we do that the future will thank us for?

-- Paul

oerdnj commented 4 years ago

but in addition, i'd still like answers to the surrounding questions. there is broad consensus that RFC 2671 should not have defined a protocol which required fragmentation, but there is no consensus on whether the problem was the fixed number itself (4096) or having a fixed number at all. my own view is that RFC 2671 ought to have said, do what TCP does for MSS. i've also found that windows and linux both have user mode access to the nec'y information, and i already knew that BSD has it, so the only question mark is mac/os/x.

The DNS is lightweight protocol with low-latency requirements where the answers are generally larger than the queries. Doing what the TCP does for the MSS is worth for the long lived connections that usually don't happen in the DNS and we generally want to avoid the increased latency caused by PMTUD. The other property of DNS is that most of the answers fit within 1232 boundary and there's a reliable fallback to TCP in the edge cases.

The Linux getsockopt() IP_MTU value doesn't appear on the socket out of the sudden, and so far there has been no working code or research that would prove that implementing PMTUD in DNS is actually helpful on a practical level as compared to reducing the default EDNS buffer size to 1232.

oerdnj commented 4 years ago

That said, the lightweight handling of the IP fragmentation problem proposed here is not something that can't be ever changed in the future generations.

oerdnj commented 4 years ago

And just to be absolutely sure that we are talking about the same thing, here's the excerpt from the man 7 ip:

       IP_MTU (since Linux 2.2)
              Retrieve the current known path MTU of the current socket.  Returns an integer.

              IP_MTU is valid only for getsockopt(2) and can be employed only when the socket has been connected.

       IP_MTU_DISCOVER (since Linux 2.2)
              Set or receive the Path MTU Discovery setting for a socket.  When enabled, Linux will perform Path MTU Discovery as defined in RFC 1191 on SOCK_STREAM sockets.  For non-SOCK_STREAM sockets, IP_PMTUDISC_DO forces the don't-fragment  flag  to  be
              set  on  all outgoing packets.  It is the user's responsibility to packetize the data in MTU-sized chunks and to do the retransmits if necessary.  The kernel will reject (with EMSGSIZE) datagrams that are bigger than the known path MTU.  IP_PM‐
              TUDISC_WANT will fragment a datagram if needed according to the path MTU, or will set the don't-fragment flag otherwise.

              The system-wide default can be toggled between IP_PMTUDISC_WANT and IP_PMTUDISC_DONT by writing (respectively, zero and nonzero values) to the /proc/sys/net/ipv4/ip_no_pmtu_disc file.

              Path MTU discovery value   Meaning
              IP_PMTUDISC_WANT           Use per-route settings.
              IP_PMTUDISC_DONT           Never do Path MTU Discovery.
              IP_PMTUDISC_DO             Always do Path MTU Discovery.
              IP_PMTUDISC_PROBE          Set DF but ignore Path MTU.

              When PMTU discovery is enabled, the kernel automatically keeps track of the path MTU per destination host.  When it is connected to a specific peer with connect(2), the currently known path MTU can be retrieved  conveniently  using  the  IP_MTU
              socket  option (e.g., after an EMSGSIZE error occurred).  The path MTU may change over time.  For connectionless sockets with many destinations, the new MTU for a given destination can also be accessed using the error queue (see IP_RECVERR).  A
              new error will be queued for every incoming MTU update.

              While MTU discovery is in progress, initial packets from datagram sockets may be dropped.  Applications using UDP should be aware of this and not take it into account for their packet retransmit strategy.

              To bootstrap the path MTU discovery process on unconnected sockets, it is possible to start with a big datagram size (headers up to 64 kilobytes long) and let it shrink by updates of the path MTU.

              To get an initial estimate of the path MTU, connect a datagram socket to the destination address using connect(2) and retrieve the MTU by calling getsockopt(2) with the IP_MTU option.

              It is possible to implement RFC 4821 MTU probing with SOCK_DGRAM or SOCK_RAW sockets by setting a value of IP_PMTUDISC_PROBE (available since Linux 2.6.22).  This is also particularly useful for diagnostic tools such as tracepath(8)  that  wish
              to deliberately send probe packets larger than the observed Path MTU.

mnordhoff commented 4 years ago

I think this conversation has focused on non-adversarial MTU issues, but fragmentation attacks can still be a problem.

It doesn't matter if you can correctly negotiate 1400, or whatever, if you have a vulnerable endpoint that an attacker can poison to use 296.

AIUI, the best the DNS community can do about that is to ensure that as many partial mitigations as can be found are deployed as widely as possible. 512 in IPv4. 1232 in IPv6. IPv6. DNSSEC. ECDSA and NSEC. Alternatively, no DNSSEC and responses as small as possible. TCP. Secure IPIDs. Imploring people to deploy fixes for vulnerabilities such as CVE-2019-10638.

Habbie commented 4 years ago

512 in IPv4. 1232 in IPv6.

Assuming servers start to set DF on all responses, right?

vixie commented 4 years ago

Matt Nordhoff wrote on 2019-09-03 09:44:

I think this conversation has focused on non-adversarial MTU issues, but fragmentation attacks can still be a problem.

to be clear, RFC 2671 was wrong to specify fragmentability (IP.DF=0) and to truly get out of the bad era thus begun we will have to set IP.DF=1 -- and on the plus side, this would actually qualify as a "flag day" which merely changing the default buffer size would not.

-- P Vixie

oerdnj commented 4 years ago

Assuming servers start to set DF on all responses, right?

With DF on all responses, you either have to go really low with IPv4, which kills performance because it will switch a lot for traffic to TCP or implement PMTUD which kills lot of other things like making things simpler rather than more complex (the main goal of the DNS Flag Days), or ephemeral port connections or macOS which has no API for this or ECMP enabled networks because that's really broken.

And this all for a little benefit to deliver those few DNS messages with payload bigger than 1232.

Habbie commented 4 years ago

or implement PMTUD

Which indeed is entirely pointless on UDP DNS responses.

Habbie commented 4 years ago

I am not sure where the justification for 1220 comes from as VPNs and/or GRE adds more than 12 octets. Thus making the default 1232 makes most sense to me.

I agree with this. 1220 has no benefit over 1232, so I support 1232.

oerdnj commented 4 years ago

they were, and i'm sure still are, using DNS. avoiding TC=1 and the resulting 3-way handshake, and avoiding the state costs of TCP protocol control blocks, are all computable virtues.

I am sorry, but this is still hypothetical scenario. I, for example, I am aware of the IN DNSKEY org issue that comes from the real world (and not "what the heck, let's put everything into DNS" world). And can be solved by:

limiting the number of published DNSKEY records or
limiting the number of published RRSIGs (to just the KSK)
switching to ECC keys

It's certain that we need more measurements and data on this, but we should optimise for the things we can actually measure and has been used by the real people, not for the edge conditions that can be optimised by turning knobs. We should not break the edge cases (and that we don't), but the real focus must be on the fast path and majority.

vixie commented 4 years ago

Ondřej Surý wrote on 2019-09-03 14:32:

Assuming servers start to set DF on all responses, right?

right. because fragmentation is either a goal, or a non-goal. to remove it as an attack vector, it has to become an explicit non-goal.

With DF on all responses, you either have to go really low with IPv4, which kills performance because it will switch a lot for traffic to TCP

it should be fine with us if V4 becomes a legacy protocol, slower than V6.

or implement PMTUD ...

i am not proposing that we do that. as defined, PMTUD cannot work, and PMTUD6 cannot be secure. what i'm looking for is an open path from this present to some future where something like PMTUD has been made to work. to do that, we have to know the MTU of the interface we will use to transmit the packet, and of the interface the operating system will use to forward the packet (if any), so that we exceed neither. for extra credit we should look at the route the operating system will use to forward the packet, which may have an even smaller MTU, if one has been set by the operator, or discovered by some future PMTUD-like system.

i am not proposing PMTUD. only that we use the information we have, which may some day be informed by something like PMTUD, but is already informed by static configuration knowledge.

And this all for a little benefit to deliver those few DNS messages with payload bigger than 1232. this is the topic i tried to address in bangkok but the discussion closed. 1232 is like 640K -- reasonable in its day, unreasonable in its era. at 10Mbit/sec we got about 10K PPS. since the MTU hasn't been able to evolve, at 10Gbit/sec we're seeing about 10M PPS. that's a lot of PPS, and can't be expected to scale indefinitely. MTU's will increase.

on a network with an MTU of 64K or larger, fragmentation disappears, because IP's maximum packet size is 64K.

in a DNS with stub validation, it would be nice to send a full signature and certificate chains without first creating TCP state.

we don't know that these things will be done. but we have no good cause to prohibit them a-priori in the next move after RFC 2671's EDNS0.

-- P Vixie

vixie commented 4 years ago

Peter van Dijk wrote on 2019-09-03 14:33:

or implement PMTUD
Which indeed is entirely pointless on DNS responses. i am not calling for PMTUD, which is known not to work in today's IPv4 or IPv6.

however, if something like it is developed in the future, it will not be at all pointless for DNS responses.

all we have to do right now is not make it less relevant or more difficult.

-- P Vixie

oerdnj commented 4 years ago

any much older network like SLIP whose MTU is lower than 1220 will thank us. but more importantly, we would be seen as cooperative with metadata that the Internet System already has a place for.

JFTR I strongly believe that any much older network like SLIP should be using custom setup (f.e. resolver to resolver) and we should not optimise network networking protocols nor our code for it. The resources for BIND 9 development are scarce and we must pick our targets to not waste it.

oerdnj commented 4 years ago

however, if something like it is developed in the future, it will not be at all pointless for DNS responses.

all we have to do right now is not make it less relevant or more difficult.

I am genuinely confused what do you propose then? It might be a language barrier, but it seems to be that you are either proposing to do nothing or develop something that doesn't exist.

If something like it is developed in the future, I am sure all the DNS developers will gladly use it because it will improve the experience of the DNS user base. But this is not a new problem and it lasts for years now, so I don't see how keeping the DNS broken any longer would incite any development in this area or how setting the payload size to 1232 would make it less relevant than today as from all the people on the Earth who even remotely care about the issue, 90% are here.

vttale commented 4 years ago

Since the 2019 flag day, my impression -- which I am totally prepared to discover is faulty -- is that the various participating implementations don't have any sort of EDNS fallback retry mechanisms now w.r.t. buffer sizes. That is, if the resolver sends out an EDNS message of whatever buffer size and doesn't get a response (say, because a frag was dropped so the OS couldn't reassemble the datagram) then it's just treated like a normal DNS timeout and the resolver moves on to the next server, if any.

Is this correct for BIND, Unbound, Knot, ...?

vcunat commented 4 years ago

Knot Resolver never had this auto-adjustment of EDNS buffer sizes (default 4096 from the RFC, so far), but it tries TCP after four UDP packets without a reply (those may all go to different IPs). This strategy was unchanged in 2019 as well, even though it can probably save many of these cases. Still, we are preparing large changes in the server-selection approach, unrelated to the Flag Days...

wtoorop commented 4 years ago

@vttale Incorrect. See @ralphdolmans earlier message:

Unbound currently starts, by default, with 4096 and if that fails will try 1232 for IPv6 and 1472 for IPv4

vcunat commented 4 years ago

2019 was about fallback that turns off EDNS completely even without a FORMERR reply (the correct way to indicate non-support for EDNS). Going without EDNS requries the limit of 512, among other implications.

dns-violations / dnsflagday

flag day 2020: Recommended EDNS buffer size #125