Closed zjs604381586 closed 4 days ago
When I did the ougoing DoT code I made sure to verify there where no leaks. It can take a while for an TCP connection to be cleaned, but it will be closed at some point.
Are you actually seeing growing fd counts over a long period of time?
I'm forcing the use of bare TCP protocol, not DOT. When there is no second parsing of the same request, the FD status CLOSE_WAIT will always exist. The log I printed in the TCPOutConnectionManager::cleanup() function shows that cleanup has been performed.
You can change the code to use the naked TCP protocol and then dig
I will do some tests soon.
Here's wat I observed:
on a recursor that does nothing after a connection is marked CLOSE_WAIT
, it does not get cleaned. The reason is that via PacketId
and MTasker
there are cases where a ref to that connection remains. Only when some queries are processed that lead to mthreads being created (so no packet cache hits), MTasker
will release the ref to that PacketID
, which results in actual cleanup.
Note that MTasker
is a thread local object, so the activity has to be by the thread that holds the ref. That thread could be a thread that does not a lot of resolving work, such as the handler thread (aka. web+stat) thread, which only does root hints resolving and the occasional task execution.
Summary: it can take a long while, especially on a mostly idle recursor, but the connection will get cleaned in the end.
For a busy recursor, the situation does not occur
Yes, MTasker holds a reference to the smart pointer. When the worker thread makes a request, the cleanup is quite timely. If there is no parsing request, fd will be occupied
Even though I do not think this is a real problem, I am thinking about triggering MTasker
in housekeeping, so that on an idle recursor the CLOSE_WAIT
connections and their corresponding file descriptors are cleaned up more quickly. But it is not an urgent thing.
BTW: parsing is not exactly the right term, processing is better.
Short description
When the recursor is configured to use the pure TCP protocol for outbound parsing, after one parsing is completed, the TCP connection will be placed in the TCPOutConnectionManager. When the back-end authority closes the TCP connection, after a period of time more than 10s(
tcp-out-max-idle-ms=10000
), the houseKeeping scheduling thread executes the The connection cleaning function in TCPOutConnectionManager, withTCPOutConnectionManager::cleanup()
method, but the smart pointer of Connection.d_handler is not released at this time because its reference count is 2. At this time, check the fd status on the machine, and the recursor will have a CLOSE-WAIT status.My troubleshooting:
According to my troubleshooting, the reference count of Connection.d_handler is increased to 2 after the first TCP read in the tcpsendrecv() function in the lwres.cc file, code is
ret = arecvtcp(packet, 2, connection.d_handler, false)
; Causes the connection to not be closed during cleanup(). When requesting parsing out of the network again, the last connection.d_handler smart pointer will be released, and the fd in the CLOSE_WAIT state will also be closed. It feels like the Key is cached in MTasker. Is this d_eventkey Member variables?Environment
Steps to reproduce
Expected behaviour
TCP fd can be closed normally, there is no fd in CLOSE_WAIT state
Actual behaviour
TCP fd failed to close normally, there is fd in CLOSE_WAIT state
Other information