PowerDNS / pdns

PowerDNS Authoritative, PowerDNS Recursor, dnsdist
https://www.powerdns.com/
GNU General Public License v2.0
3.7k stars 908 forks source link

When the recursor is configured to use the TCP protocol for outbound query, TCP fd leakage problem will occur #13422

Closed zjs604381586 closed 4 days ago

zjs604381586 commented 1 year ago

Short description

When the recursor is configured to use the pure TCP protocol for outbound parsing, after one parsing is completed, the TCP connection will be placed in the TCPOutConnectionManager. When the back-end authority closes the TCP connection, after a period of time more than 10s(tcp-out-max-idle-ms=10000), the houseKeeping scheduling thread executes the The connection cleaning function in TCPOutConnectionManager, with TCPOutConnectionManager::cleanup() method, but the smart pointer of Connection.d_handler is not released at this time because its reference count is 2. At this time, check the fd status on the machine, and the recursor will have a CLOSE-WAIT status. image

My troubleshooting:

According to my troubleshooting, the reference count of Connection.d_handler is increased to 2 after the first TCP read in the tcpsendrecv() function in the lwres.cc file, code is ret = arecvtcp(packet, 2, connection.d_handler, false); Causes the connection to not be closed during cleanup(). When requesting parsing out of the network again, the last connection.d_handler smart pointer will be released, and the fd in the CLOSE_WAIT state will also be closed. It feels like the Key is cached in MTasker. Is this d_eventkey Member variables?

Environment

Steps to reproduce

  1. Force recursor to use pure TCP protocol when querying authority when going out of the network

Expected behaviour

TCP fd can be closed normally, there is no fd in CLOSE_WAIT state

Actual behaviour

TCP fd failed to close normally, there is fd in CLOSE_WAIT state

Other information

omoerbeek commented 1 year ago

When I did the ougoing DoT code I made sure to verify there where no leaks. It can take a while for an TCP connection to be cleaned, but it will be closed at some point.

Are you actually seeing growing fd counts over a long period of time?

zjs604381586 commented 1 year ago

I'm forcing the use of bare TCP protocol, not DOT. When there is no second parsing of the same request, the FD status CLOSE_WAIT will always exist. The log I printed in the TCPOutConnectionManager::cleanup() function shows that cleanup has been performed.

You can change the code to use the naked TCP protocol and then dig

omoerbeek commented 1 year ago

I will do some tests soon.

omoerbeek commented 1 year ago

Here's wat I observed:

on a recursor that does nothing after a connection is marked CLOSE_WAIT, it does not get cleaned. The reason is that via PacketId and MTasker there are cases where a ref to that connection remains. Only when some queries are processed that lead to mthreads being created (so no packet cache hits), MTasker will release the ref to that PacketID, which results in actual cleanup.

Note that MTasker is a thread local object, so the activity has to be by the thread that holds the ref. That thread could be a thread that does not a lot of resolving work, such as the handler thread (aka. web+stat) thread, which only does root hints resolving and the occasional task execution.

Summary: it can take a long while, especially on a mostly idle recursor, but the connection will get cleaned in the end.

For a busy recursor, the situation does not occur

zjs604381586 commented 1 year ago

Yes, MTasker holds a reference to the smart pointer. When the worker thread makes a request, the cleanup is quite timely. If there is no parsing request, fd will be occupied

omoerbeek commented 1 year ago

Even though I do not think this is a real problem, I am thinking about triggering MTasker in housekeeping, so that on an idle recursor the CLOSE_WAIT connections and their corresponding file descriptors are cleaned up more quickly. But it is not an urgent thing.

BTW: parsing is not exactly the right term, processing is better.