auth: Improved tcp stack

RobinGeuze commented 3 years ago

Program: Authoritative
Issue type: Feature request

Short description

Currently the authoritative server uses a very simple TCP connection model. Each connection gets its own thread and that thread is discarded once the connection is closed. This works fine when dealing with a small number of connections, however TCP is seeing increasing use in DNS and as a result the number of active connections is growing. This leads to the problem that when dealing with high connection counts the actual performance decreases due to the high number of thread context switches.

To solve this a more "event" based model could be a solution. A limited number of threads that use polling to identify whether data is available. I will use this issue to work out some of the ideas for this.

Usecase

Ideally the new stack will:

assign work on a per query basis to worker threads rather than on a connection basis
only delegate full queries to worker threads (to prevent slow loris)
spread both connection acceptance and data reading from connections over a number of threads
split AXFR queries to separate workers (due to potentially long query duration)

Description

I have 2 possible "solution" directions:

Solution 1

Use a readily available event library, like libevent, libev or libuv.

Pros:

easy to implement
already tested/in use
Cons:
yet another dependency
less control

Solution 2

The other route is to build a custom solution. My proposed solution would be this:

create a multithreaded mplexer
- epoll and kqueue appear to support multithreaded use, however this should be thoroughly tested (and for epoll you probably need to use edge triggered)
- other implementations can use a simple lock structure, where connections get removed from the check array once a thread picks them up and readded once the thread is done with its processing
have a bunch of connection workers
- these use the multithreaded mplexer to wait for activity on client and listening sockets
- they will accept new sockets on the listening sockets, creating client info and inserting them into the mplexer
- they will read packet data and once a full DNS packet has been received do the initial parsing
- they can then determine whether its an *XFR query, dns update or a "normal" query and dispatch it via a simple MPMC queue to a bunch of worker threads for that type of query
worker threads poll for work from the queue and handle generating the answer. They also send the answer, using a simple mutex on the socket to prevent scrambled queries

A small variation on this solution is that you could potentially combine the connection workers and the "normal" query workers.

If you use non-blocking sockets you can do some smart logic, like setting an initial watermark of 2 bytes on the TCP connections and once the length has been read but the packet is incomplete set a watermark at the amount of bytes still needed to complete the packet.

Pros:

lots of flexibility
can be highly optimised
Cons:
might require a lot of research to get "exactly" right
complex setup, potentially prone to bugs

rgacogne commented 3 years ago

I'm not sure I understand the need for the "multithreaded mplexer" part? dnsdist does the accept() in a separate thread before dispatching the connections to workers, and we could probably get rid of that with the current architecture and directly do the accept() in the worker threads, either with a per-thread socket (reuseport) or a shared socket (which leads to some thundering herd issues, granted).

RobinGeuze commented 3 years ago

The idea behind the threaded multiplexer is basically to prevent unbalanced situations. In the reuseport scenario you can potentially have one thread thats completely choked while the others are idle. I'm not so worried about the accept part, its more that if you have a threaded multiplexer anyway why not throw the listen sockets in there as well. The main reason for the threaded multiplexer is that it prevents the theoretical scenario where a single connection is starving a certain thread (by sending huge packets in many parts for example) and thereby also affecting any other connections assigned to that thread.

A more simple solution (that might have the single thread starvation problem, but the question is, how big is the problem), would be to have a bunch of threads that have individual accept sockets and mplexers.

rgacogne commented 3 years ago

The main reason for the threaded multiplexer is that it prevents the theoretical scenario where a single connection is starving a certain thread (by sending huge packets in many parts for example) and thereby also affecting any other connections assigned to that thread.

Right, I see your point. I think we can mitigate that pretty easily by setting a maximum time/number of queries over a single TCP connection, especially since individual DNS messages are capped to 65535 bytes, but perhaps it's not enough in certain cases. The drawback of the threaded multiplexer is that, in addition to some event systems requiring a lock, we will need to be careful with more than one thread accessing the same connection, so it might require locking at the connection level as well.

RobinGeuze commented 3 years ago

Yes indeed, would be a good idea to do some performance testing. Another solution to the problem I'm describing could potentially be some form of work stealing, where work (eg connections) is redivided over the threads in case one thread is significantly more heavily loaded than the rest, however that might get complex as well.

thestinger commented 3 years ago

https://blog.cloudflare.com/the-sad-state-of-linux-socket-balancing/ is a nice article showing the issues with the available approaches on Linux. I think the FreeBSD situation is similar.

By default, nginx uses a shared socket with EPOLLEXCLUSIVE which uses LIFO rather than FIFO wakeups so it keeps assigning everything to the same thread(s) as much as possible. It's easy to verify how terrible it is with any multi-core nginx HTTPS setup using 1 worker per core. It only starts distributing load once it's already overloaded and harming latency, and it trickles down poorly to other workers. It's very bad at taking advantage of multi-core.

The available workaround is using reuseport which naively distributes them across threads and that's generally better than overloading threads. It's also lower overhead from not doing (as much) shared locking but that's only an issue at ridiculously high load. The best approach would normally be a shared socket with EPOLLEXCLUSIVE and a sadly non-existent FIFO wake mode so it would be giving connections to the workers waiting for events longest. It's sad that this isn't available.

Linux has a way to attach an eBPF program for reuseport load balancing and that's probably the best way to deal with this. You could start with direct use of naive reuseport and then migrate to a less naive form of it understanding worker load to some extent. It would be quite hard to do work stealing and I don't really think it's important as long as you had a way to adjust the incoming load to balance it out, which non-eBPF reuseport usage wouldn't be doing.

thestinger commented 3 years ago

The available workaround is using reuseport which naively distributes them across threads and that's generally better than overloading threads.

The reason it doesn't work well for Cloudflare's test may be because nginx has blocking open calls and I think they hadn't yet implemented their non-blocking open. Otherwise, it'd almost always be better to naively spread things out rather than the mess EPOLLEXCLUSIVE causes via LIFO.

RobinGeuze commented 3 years ago

As far as I can tell its not really necessary to use EPOLLEXCLUSIVE? You can just use a single epoll FD with EPOLLET set. There is a nasty corner case with ET that I still need to figure out, but mostly it should work. Also the CF article concerns multiple worker processes, whereas powerdns uses threads, which somewhat improves the situation.

thestinger commented 3 years ago

Threads vs. processes isn't really any different as far as Linux is concerned.

If there's no EPOLLEXCLUSIVE then isn't it waking up every waiting worker for new connections and they race to handle it? That spreads things out more evenly than waking only the last worker to start waiting for new events but it's way worse than just waking up one since it wastes the time of workers that were waiting for new events and it wastes CPU time.

thestinger commented 3 years ago

nginx just uses multi-process so they can use an allocator without any locking, etc. with very careful use of threads only for things like AIO thread pools carefully avoiding their normal APIs. Linux considers threads/processes to be essentially the same thing beyond a process group ID existing matching the main thread task ID. You can optionally share the address space across processes if you spawn them that way. It's just a flag for clone. It doesn't really impact stuff like epoll, etc. only futex with FUTEX_PRIVATE_FLAG, etc.

thestinger commented 3 years ago

I really think the best bet is just using reuseport and then eventually adding an eBPF program to spread them out more evenly with heuristics (thread with least connections, optionally with the threads setting an integer indicating how active they are to make it more sophisticated, etc.). It will be fastest to spread out the connections with reuseport for cases with extreme load so in the end it's going to be the best approach with a good enough heuristic. You aren't stuck with the standard hash-based load balancing. As long as the threads don't actually do blocking operations it should work well.

RobinGeuze commented 3 years ago

With EPOLLET it only wakes up one thread waiting on a epoll FD as far as I understood (an event is only returned once as far as I can tell).

The "big" problem I have with REUSEPORT is that the actual client FD's are permanently distributed over the threads, rather than all threads cooperating to process all the work.

rgacogne commented 2 years ago

The "big" problem I have with REUSEPORT is that the actual client FD's are permanently distributed over the threads, rather than all threads cooperating to process all the work.

On the other hand, not sharing the listening socket and having only one thread responsible for a connection simplifies the design significantly, and does not require any locking in the TCP workers :)

RobinGeuze commented 2 years ago

Yes. I've been thinking about this for a while and the "simplest" design I can come up with that is also properly portable across platforms would be to have a single "polling" thread that simply checks whether data is available and then dispatches it as "work" to a worker thread. The worker thread then reads whatever data is available, and potentially handles an actual query (or dispatches an AXFR query to the AXFR workers), before pushing it back into the epoll queue with a proper watermark set. This avoids the need to use edge triggered epoll, which is a lot more complicated to work with than "normal" epoll. For other implementations like kqueue and select you could have a simple locked vector with all sockets that the polling thread should also look at next time around it starts a poll interval. In theory you could split the listening socket into a separate thread since that can just use the same methodology for adding new sockets to the poll as the worker threads do (you need a bit of magic to prevent the listening socket from hogging the lock around the sockets vector, but its easy to solve).

The only thing that doesn't really fit into this model is io_uring, but I'm not sure thats an issue right now.

klaus-nicat commented 2 years ago

fyi - just a comment how other software does it (maybe it helps, maybe no): Kamailio SIP Proxy has a single process which does all the TCP handling. It handles FDs and gives them to a configureable number of worker processes only doing TCP. So a worker handles multiple TCP connections. Once a request is processed the FDs is given back to the single TCP handler process which decides what to do with the connection. If it receives again data on an open TCP connection, the FD is again given to a worker process.

RobinGeuze commented 2 years ago

So that matches pretty much the idea I outlined right?

rgacogne commented 2 years ago

For what it's worth, dnsdist has N threads accepting new connections using SO_REUSEPORT, dispatching them right after the accept to one of the M TCP worker threads. A TCP worker thread can handle a very large number of connections using the multiplexer abstraction over epoll/kqueue/devpoll/etc, and will handle a connection from beginning to end.

RobinGeuze commented 2 years ago

Yes but the difference is that dnsdist does very little work itself, its mainly a passthrough, whereas the auth needs to do actual local processing that can be quite slow, depending on the backend and the type of processing (eg AXFR is terribly slow).

rgacogne commented 2 years ago

Oh, so you want to do the actual processing of the query in the TCP workers? I thought we were going to reuse the backend threads that we already have for UDP processing for that, ie the TCP workers would pass a fully received query to a backend thread and asynchronous get a response that they would forward to the client.

omoerbeek commented 2 years ago

dnsdist also has to deal with all complexity of connection management. With the cross protocol and out-of-order handling that is also complex and way more than just pass through.

RobinGeuze commented 2 years ago

My original idea was to have a bunch of TCP workers that do all the reading and then hand of the work to some a pool of worker threads (which could be shared with UDP I guess). However, it might be simpler to have a single "poller" thread which simply checks if there is data available on a connection and then hands it off to a worker thread to actually fetch that data and handle a request if its available. The advantage of that design is that it means that during the processing the connection is fully owned by the worker thread, eg it can handle connection drops and such itself, without having to inform anyone. The downside is that it doesn't do out of order processing, but I'm not sure that is really an issue.

rgacogne commented 2 years ago

How does the poller thread know whether the full query is available without actually reading it?

RobinGeuze commented 2 years ago

It doesn't, it just hands the connection of to the worker, which reads in a non-blocking fashion whatever data is available. If it has a fully query it handles it, if it doesn't it hands it back to the poller. To prevent lots of connection passing around you can start with an initial low watermark of 2 (eg the packet length bytes) and after you have the packet length just put that as the low watermark.

rgacogne commented 2 years ago

I see. The back and forth between the poller and the worker seems a bit silly, though? Why not read the query in the poller and only pass it along once it's ready?

klaus-nicat commented 2 years ago

Maybe it is better if the readed message is already in the memory of the worker process/thread? I think it depends on how you want to optimize. Usually DNS requests are small. Hence, with a single read from the socket there should be the whole message. On the other hand that could be exploited by an attacker sending byte after byte and hence blocking the TCP worker, if the worker only handles a single TCP connection at a time.

RobinGeuze commented 2 years ago

The idea of not reading is that in that case the poller thread is solely responsible for work distribution, its basically a (very dumb) task scheduler. It might be useful to experiment with this to see what solution is most resistant to TCP attack approaches like slow-loris, connection spamming, etc...

rgacogne commented 2 years ago

Attacks like slowloris would clearly trigger a lot of context switches between the poller and the TCP worker in that model. Also note that if we ever want to support incoming DoT directly in the authoritative server, we will also need to handle having to write, and sometimes blocking on that, while reading the query.

rgacogne commented 2 years ago

(and for DoT you can forget being able to read everything in one piece, by the way)

omoerbeek commented 2 years ago

The idea of not reading is that in that case the poller thread is solely responsible for work distribution, its basically a (very dumb) task scheduler. It might be useful to experiment with this to see what solution is most resistant to TCP attack approaches like slow-loris, connection spamming, etc...

After accept it makes no sense to me to have all I/O go trough a single poll thread that does not do reading as well.

I don't see how a dedicated poller/task scheduler for all open connections benefits a connection oriented protocol. A thread doing the reads has to maintain state to handle partial reads and things like proxy protocol. That state is associated with a connected fd. So the proposed thread doing the polling has to know the state and pass the state and fd to a thread doing the reading. That thread might as well maintain the state and do the polling itself.

IMO polling and reading should be done in the same thread.

RobinGeuze commented 2 years ago

No they wouldn't context switch a lot, 2 times per valid query packet at most (assuming we ram the door shut once we get an invalid packet). Once for the first 2 bytes, and once for whatever number of bytes was specified in the length bytes, you can regulate that using a low watermark on the socket.

omoerbeek commented 2 years ago

Setting the low watermark for each packet sounds wasteful.

Habbie commented 2 years ago

IMO polling and reading should be done in the same thread.

Agreed, and that thread should not be a thread that is hogging a database connection.

thestinger commented 2 years ago

I'm sure you still want to have worker threads for serious work beyond handling the events. A worker per core with the load spread across the workers and connections remaining in a worker is the typical approach. A more sophisticated approach could have work stealing to redistribute connections since a worker may end up with a bunch of long-lived connections.

There are 3 reasonable ways to spread out the load on Linux: a socket in one thread handling new connections and distributing them (slowest), shared socket across workers with epoll with EPOLLEXCLUSIVE (still has locking in the kernel and wakes in LIFO order which is a terrible approach for spreading load) or socket per worker with REUSEPORT (round-robin load distribution without locking across the workers). The single socket connection handler approach would be 100% obsolete if they added a way to make EPOLLEXCLUSIVE use FIFO order instead of LIFO order, and then that would probably be the best approach overall. REUSEPORT is the most scalable but is oblivious to whether the workers are busy so it will happily give new work to a worker that's overloaded. epoll isn't smart but at least it only gives it to a worker that's waiting, but the standard LIFO order of EPOLLEXCLUSIVE means it overloads workers by design and then starts moving on to overloading others one by one.

Set up nginx with an automatic worker count and TLS for a load test. It uses EPOLLEXCLUSIVE by default and you can see it doesn't properly leverage multiple cores. Switch it to REUSEPORT and it scales far better and the kernel also doesn't have the locking overhead across the processes. It also has an accept mutex option which will do the same thing as EPOLLEXCLUSIVE more slowly in userspace but it uses FIFO order so the load is spread out much better. There is no best option.

REUSEPORT can be enhanced with eBPF. You can attach an eBPF load balancing program which you could teach to give connections to the workers with the least connections instead of round-robin, etc. Lots of options on how to do that. Can do much better load balancing than simply FIFO waking of an idle waiter.

thestinger commented 2 years ago

Work stealing is hard to implement since you need to avoid it adding significant overhead and need to support moving the state of a connection to another worker. REUSEPORT + work stealing is pretty much as good as it gets though.

IMO, just start with REUSEPORT since it's very simple to use and is mandatory for an ideal approach on Linux and now FreeBSD since they added SO_REUSEPORT_LB.

There is no perfect way to distribute connections in advance so there's no substitute for some form of work stealing if you want the best 99th percentile latency, etc. by dealing with the load becoming unbalanced intelligently without serializing stuff with a global lock.

thestinger commented 2 years ago

Noticed this ugly hack in nginx mainline:

https://github.com/nginx/nginx/commit/96c342e56035a9676180d03b4659d5b05b9c6b07

PowerDNS / pdns