PowerDNS / pdns

PowerDNS Authoritative, PowerDNS Recursor, dnsdist
https://www.powerdns.com/
GNU General Public License v2.0
3.48k stars 888 forks source link

Better options for DDOS mitigation with DoT #12164

Open jacobbunk opened 1 year ago

jacobbunk commented 1 year ago

Short description

Recently the random subdomain attacks we see from time to time against our auth DNS service has started also hitting our DoT endpoint set up with dnsdist. That is causing very high CPU usage on our dnsdist instances even with just a few thousand QPS over DoT because the attacker will only do a single query on each connection.

I would love to see some better tools for rate limiting new TCP/TLS connections per IP or IP subnet.

Usecase

Rate limiting new TLS connections per IP or IP subnet in order to protect the service from malicious users creating many new connections would probably help

Description

During a DDOS against the DNS service at $work we saw just a few thousand QPS over DoT make CPU usage increase to 2000+% on some of our DNS servers running on an AMD EPYC 7501, a rather powerful CPU. Running perf top -p <pid of dnsdist> I learned that 80% of the CPU time was spent in BN_X931_generate_prime_ex in libcrypto.so.1.1. I suspect that's about setting up new TLS connections. After the attack I looked at the dnsdist_frontend_tcpavgconnectionduration metrics and learned that average connection durations had dropped to almost 0 during the attack, which is why I suspect that each connection was only used for a single query.

I suggest adding the possibility to set a rate limit on new TLS connections per IP or IP subnet much like the limit you can set on query rates using MaxQPSIPRule(), except it would probably have to be an option on the addTLSLocal() function since the high load is triggered way before any queries gets to be processed by the dnsdist rules.

Alternatively this can also be done in iptables using an approach like outlined here: https://making.pusher.com/per-ip-rate-limiting-with-iptables/ - which may be the preferred approach for busy servers?

johnhtodd commented 1 year ago

I see a few items embedded here that maybe need taking apart as possible discrete methods to mitigate attacks or misconfiguration fault events. There is the concept of purely volumetric rate limiting, where number of new encryption events are counted on a per IP or per-subnet basis (probably needs to be per-subnet, given the behaviors of v6 hosts.) There is the concept of looking at existing volumes of connections, and creating some maximum CPU (difficult!) or quantity count thresholds, where the system can have a maximum of N encryption sessions in Y seconds (or just a maximum number of established sessions) before refusing or delaying others. Then there is also the concept of looking at average number of queries in a session, and (for example) rejecting client connections from subnets/hosts which have an average less than Y queries per encryption setup event in the last Z seconds. I'm sure there are other ideas for this as well, but it is true that encryption exposes dnsdist to a much MUCH larger risk potential for resource exhaustion with what would be otherwise easily-deflected attacks in UDP. Note that I'm also being somewhat broad with the term "encryption" since I could see these metrics being assigned not just to DoT and DoH but also to DoQ potentially.

rgacogne commented 1 year ago

I wonder how the HTTPS world deals with that kind of issues. It would be very useful to know what kind of mitigations exist in haproxy and nginx, for example, so that we don't reinvent the wheel on our own.

mnordhoff commented 1 year ago

What even is BN_X931_generate_prime_ex() and what's it being used for? Is perf identifying functions correctly?

It sounds like something about generating RSA keys according to a particular obsolete specification.

This may be beside the point; TLS is always going to use some CPU.

hlindqvist commented 1 year ago

I wonder how the HTTPS world deals with that kind of issues. It would be very useful to know what kind of mitigations exist in haproxy and nginx, for example, so that we don't reinvent the wheel on our own.

I don't know how strong the DDoS focus should be (the title says so, but maybe it makes sense to back off slightly from that), but my distinct impression from the HTTPS world is a strong tendency to leap to "let's employ the services of Cloudflare/Akamai/..." and similar drastic changes of architecture/topology, which makes sense in the context of directly addressing the distributed nature of such an attack but of course doesn't particularly help improving the capabilities of dnsdist.

As for local (D)DoS mitigations, the already mentioned potential mitigation strategies make sense to me (as well as peeking at eg haproxy), and overall a "dynblocks" style approach (based on TLS-related metrics instead) could make sense for abusive TLS handshakes.

What I'd also like to add is the matter of being careful about how one chooses to set up TLS in the first place, in order to minimize the cost of the handshakes you do go through with (more of a documentation issue if considered to be in scope at all?), Things like how going with ECDSA over RSA is notably more performant for the server-side in a typical setup, as well as if you have any kind of hardware acceleration or particularly optimized variants of crypto libraries, the importance of picking TLS parameters to match those capabilities.

jacobbunk commented 1 year ago

I had a chat with my local Linux kernel developer at a conference today - we discussed the options for rate limiting new TCP connections using BPF, and he told that it should be fairly easy (at least if you are kernel developer working on those parts). Since dnsdist already has some integration with BPF, I wonder if that would be something to consider doing as part of the dnsdist config?

@hlindqvist - good input about going for more efficient crypto algorithms than RSA - I'll give that a spin at my earliest convenience.

rgacogne commented 1 year ago

I suggest adding the possibility to set a rate limit on new TLS connections per IP or IP subnet much like the limit you can set on query rates using MaxQPSIPRule(), except it would probably have to be an option on the addTLSLocal() function since the high load is triggered way before any queries gets to be processed by the dnsdist rules.

That does not sound too hard to implement, we could keep a per IP map, likely with LRU for the cleaning to be efficient, and allow a number of new session per second (token-based, perhaps). I'm not sure if we want that per addTLSLocal or global, and we might want to have different rate limits for full handshakes vs session resumption.

rgacogne commented 1 year ago

I had a chat with my local Linux kernel developer at a conference today - we discussed the options for rate limiting new TCP connections using BPF, and he told that it should be fairly easy (at least if you are kernel developer working on those parts). Since dnsdist already has some integration with BPF, I wonder if that would be something to consider doing as part of the dnsdist config?

I like BPF a lot but using BPF as a devops is very different than shipping it inside an application. We don't want to embed the whole clang runtime, and shipping compiled eBPF like we do for dynamic blocking is not great because of kernel version compatibility issues.

setharnold commented 1 year ago

Admins looking to solve a problem immediately might benefit from using plain old firewalling tools, too: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/security_guide/sec-using_nftables_to_limit_the_amount_of_connections

phonedph1 commented 1 year ago

https://indico.dns-oarc.net/event/46/contributions/978/attachments/947/1754/Cache%20Poisoning%20Protection%20-%20Deployment%20Experience.pdf page 13 is also semi-relevant to this

jacobbunk commented 1 year ago

https://indico.dns-oarc.net/event/46/contributions/978/attachments/947/1754/Cache%20Poisoning%20Protection%20-%20Deployment%20Experience.pdf page 13 is also semi-relevant to this

We did get a mention on slide 12. Google's use of our ADoT service is what caused me to create this ticket in the first place since they - from my perspective - do it in a sub-optimal way where Google's systems opens thousands of mostly idle TLS connections and then only run queries on them every 5-20 seconds.

I have been in touch with Puneeth and Tianhao, but they have stopped responding to my email, I guess because fixing what I'm asking them to fix (use fewer TLS connections that shouldn't be left idle for extended periods) would require massive changes on their end.

thestinger commented 3 months ago

@rgacogne

I wonder how the HTTPS world deals with that kind of issues. It would be very useful to know what kind of mitigations exist in haproxy and nginx, for example, so that we don't reinvent the wheel on our own.

As we've learned the hard way, nginx's approach can't mitigate layer 7 HTTPS attacks decently beyond scaling up your infrastructure to handle all the TLS connections. The nginx features protect a backend able to handle far fewer connections than nginx but don't do much to deal with being overwhelmed at the nginx layer. The approach seems to predate having TLS everywhere. It still works fine for protecting backends... but you'd have to massively scale up to handle a layer 7 TLS DDoS yourself unless you do total connection and new connection rate limiting at a low level combined with disabling keepalive while under attack. Forcing the attacker to make new TCP connections for each request helps a ton and then you can deal with it at the firewall layer via synproxy + connection/rate limits.

Connection limits in nginx via limit_conn and rate limits via limit_req deny connections after the TLS connection has been brought up when it should really be denied without establishing a TCP connection in the first place. The nginx connection and rate limiting is useful to deal with repeated requests over the same connections along with multiplexing for HTTP/2 and HTTP3. You still need low-level connection limits and rate limiting when under attack or it falls down really easily.

Admins looking to solve a problem immediately might benefit from using plain old firewalling tools, too: https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/security_guide/sec-using_nftables_to_limit_the_amount_of_connections

@setharnold Doing it the way that's proposed nearly everywhere by adding/enforcing based on ct count in a set for every single new connection (every SYN packet) is a bad idea. Every inbound connection should really either be notrack or go through synproxy to avoid DoS via filling the conntrack table and these connection/rate limiting sets using spoofed packets. Their proposed approach makes an entry in the set for every SYN packet which can be spoofed. IPs should be added to the set on the other side of synproxy. The same thing applies to rate limiting to an even greater extent since you wouldn't want spoofed packets resulting in hosts being blocked despite not sending anything. Using synproxy enables using rate limiting behind synproxy which prevents an attacker can't spoof packets to get hosts blocked. As with connection limits triggered behind synproxy. the blocks can be activated in front of synproxy to prevent establishing any new TCP connections.

There's very poor documentation on combining synproxy and connection limiting. We couldn't find any good examples. We recently deployed it to production for SSH and we're likely going to deploy it for most of our services by default. Windows unfortunately doesn't enable TCP timestamps by default which mixes very badly with Linux-style SYN cookies and cripples TCP performance, so we're likely only going to use it for services not accessed via web browsers by default. The synproxy SYN cookies work better than native Linux SYN cookies with some of the major drawbacks eliminated, but you still depend on having TCP timestamps to avoid losing a bunch of TCP performance.

Here's what we've started with for SSH connection limits which we can make super aggressive (1 connection) since it's something we only use ourselves and we can use multiplexed connections:

https://github.com/GrapheneOS/infrastructure/compare/16ef317460b07517788079bc5d643f51f8e8e6f0...8c929f02ac18f21c8954b34cd2103faab05030e0

We plan to do the same for DNS-over-TCP and DNS-over-TLS for our servers with PowerDNS too.

We're also testing rate limiting with the same approach. It just needs to be cleaned up and optimized a bit splitting out chains to avoid a bunch of redundant checks.

DNS resolvers mostly have timestamps enabled and aren't using tons of bandwidth via a single connection anyway so I don't think always-on SYN cookies will hurt much. I don't see a good way to do this at the nftables layer without always-on SYN proxy.

BPF could theoretically be a lot better than nftables because nftables is a separate network stack acting as if it's a dedicated firewall computer even when it's being used as a host-based firewall on the same machine. Synproxy establishes a connection all by itself and then spoofs packets via loopback to establish the connection with the backend, and that's how you can rate limit connections behind it by adding to set and rejecting over threshold in the loopback input path for the spoofed packets. However, I doubt that it's really possible to do better in practice especially since native Linux SYN cookies are essentially buggy.

Our infrastructure repository is MIT licensed and we're going to be putting this stuff together there. We're dealing with a lot of DDoS attacks but yet don't want to simply stick everything behind a service doing TLS interception such as Cloudflare's main reverse HTTPS proxy and their Spectrum (stream) TLS termination support. We were considering using path.net via BuyVM with the synproxy filter but path.net seems to be dying at the moment. A lot of these services are quite expensive and many including path.net and even Cloudflare to an extent are shady companies with ties to groups doing DDoS attacks, etc.

https://github.com/GrapheneOS/infrastructure/blob/main/LICENSE

It's not great that it's becoming increasingly hard to have services exposed to the internet if anyone wants to harm you without simply giving in to having all the traffic go through Cloudflare with them having access to all of it too.

thestinger commented 3 months ago

Since Linux TCP SYN cookies are now based on SipHash, the only real downside is a tiny bit of overhead for legitimate connections and more importantly clients without TCP timestamps not being able to negotiate window scaling / SACK. All reasonable clients have TCP timestamps but Windows disabled them at some point to save 12 bytes per packet. I don't know if they ever enabled them again. I expect they're still disabled there. Some people may also disable them on Linux, FreeBSD, etc. due to tutorials to do it.

We've extended the synproxy + connection limiting to DNS-over-TCP and DNS-over-TLS as an always enabled protection now:

https://github.com/GrapheneOS/infrastructure/compare/6b573fe227d0daadfe955f1aa542ac577fd5aba4...811fcf593e82d05d73fe4ea5af11b81a5cc3f1c7

There will be some follow-up changes to factor out the common checks. We also plan to add rate limiting of new connections behind synproxy which can result in short blocks getting applied in front of synproxy too. Currently testing this kind of rate limiting with our staging services and certain production services.

It's unfortunate TCP timestamps aren't universally enabled because it blocks us from using this approach for our web sites as a default baseline. Instead, we'll need to have it as something that gets toggled on when under a particularly bad attack, probably manually initially to avoid having an easy way to degrade performance for Windows users and some other users. I don't think it's an issue for DNS though, since it's not as if you need a large amount of per-connection throughput for it in normal use cases.

Would be nice if Linux used the same hacks used by FreeBSD to provide SACK and coarse window scaling support with SYN cookies when TCP timestamps aren't enabled. We could simply be using this approach across the board if that happened.

It's possible to rate limit synproxy where SYN packets below the rate limit pass through it and enforce the connection limit in both places but it makes it a bit more complex and requires choosing a rate limit. We can investigate it in production now that we have this deployed for a service with some steady traffic.

rgacogne commented 3 months ago

Thanks a lot for sharing your experience, Daniel, that really is much appreciated!

Do you think there are some mitigations that would be useful to implement in dnsdist itself, or is does it make more sense to let existing tools like nftables/syncookie/synproxy do their job?

thestinger commented 3 months ago

Having a very efficient authoritative resolver with an efficient in-memory backend able to efficiently handle incremental updates to the static configuration rather than needing a database seems like the best way to deal with things. I think the best DDoS mitigation is simply having it as a highly efficient static service which can essentially be deployed as a CDN by just adding more instances which are all the same. That's how we're trying to do things. We currently only have ADoT as a bonus forward looking feature where the worst case for a DDoS attack on it would be that it stops working and clients fall back to UDP/TCP. In the future, it would matter a lot more if it becomes possible to enforce using it in practice. We don't plan on using QUIC in the foreseeable future because it makes DDoS resistance much harder due to being UDP at the kernel / firewall level and having lower-level encryption. The way it's integrated into applications in practice removes a lot of existing tools. It will work fine at Google scale but it's going to make attacks easier on small deployments.

We currently simply use the authoritative server directly but the TCP implementation doesn't scale well. We don't yet care much about ADoT scaling because it's mostly just used opportunistically by Google. We're currently hosting it with nginx which is not great because it can't do the queries to the authoritative server via UDP. We could consider using dnsdist for it, and maybe it would also work better for handling TCP. We'd rather just have a more efficient/scalable authoritative server with DoT built into it so we could avoid another layer on top.

Going off a bit into a slightly different but related topic: we're currently using the geoip backend and we're concerned about it forcing NSEC white lies, which means attackers can force uncached DNSSEC signing with the cache growing until it hits the limit. It also prevents resolvers caching ranges of NXDOMAIN responses. We don't actually use the geoip backend's GeoDNS logic anymore since we fully moved to Lua records where we can do both GeoDNS and failover together, so we could move to the bind backend instead but it requires figuring out how to move over everything we have. We'd prefer if it the GeoIP backend just supported having static records with normal NSEC.

So, the 3 big issues we have right now are 1) poor TCP scaling, which could perhaps be improved with dnsdist in front, but that seems fundamentally worse than better TCP in auth, 2) unclear on the best way to provide ADoT, which might be better provided via dnsdist than our current nginx approach since nginx forces using TCP behind it but ideally auth would have a great implementation itself, 3) NSEC white lies are bad for scaling / caching.

For 3):

https://datatracker.ietf.org/doc/html/rfc8198

This seems very useful, but unfortunately we lose giving resolvers the opportunity to do this with the geoip backend. It would be a great way of mitigating attacks which go through DNS resolvers like Google Public DNS.

thestinger commented 3 months ago

There's now an improved implementation of this DDoS protection available at https://github.com/GrapheneOS/infrastructure/blob/b21ea0a23f4d59b7774f4f2ac3dfa4cee7d2597b/nftables-ns1.conf which only uses synproxy to handle SYN packets beyond a rate limit that's unable to exhaust the conntrack table. To avoid spoofed SYN packets counting towards the connection limit, a similar approach is used of only adding connections to the connection limit sets after they're established but with the limits enforced for new connections too so that in practice most connections beyond the limit get rejected before they're established. Conntrack marking is used for this instead of checking the sets for every single packet for established connections. Due to the improved approach, we've been able to deploy it for all of our TCP services instead of only specific ones such as DNS-over-TCP and DNS-over-TLS.

It's unfortunate that there are essentially no guides on doing this properly. Nearly everything tells you to do connection limits directly for new connections (SYN packets) which will add spoofed SYN packets to the connection limit set and allow DoS in 3 new ways: exhausting the connection limit set (fail closed is even worse than fail open), exhausting the conntrack table (due to not falling back to synproxy) and targeting specific IPs to be banned by spamming a huge number of spoofed packets from them to keep the connection limit maxed out despite that host likely rejecting the SYN-ACK packets since they still count towards the limit until the RST for SYN-ACK is received. Hope this is useful for other people too.

Plan to add proper documentation via comments about all this but first we had to figure out the best way to do it and deploy that to production.

One more thing to consider is that for enforcing the limits for new connections where it only consults the sets rather than adding connections to them, it may be better in some cases to drop the packets when over the connection limit rather than rejecting them to force a timeout resulting in falling back to another IP address if the clients don't do that on TCP RST. Clients without happy eyeballs which may hit the IPv6 /64 connection limit and give up right away when sharing a block with the bad clients. It may be best to enforce a stricter /128 limit in addition to the /64 limit. It's mainly an issue due to lots of providers only giving a /128 instead of a /64 particularly for VPS instances used for a VPNs, etc.