Open aanm opened 1 month ago
This is a concurrent read/write on Conn.UDPSize
in the dns library. This has not happened before due to us not trying to read and write on Conn
at the same time.
The specific instance of write above indicates that the forwarded DNS request was a DNSSec one, as on current main branch the line vendor/github.com/cilium/dns/client.go:242 is:
func (c *Client) SendContext(ctx context.Context, m *Msg, co *Conn, t time.Time) error {
opt := m.IsEdns0()
// If EDNS0 is used use that for size.
if opt != nil && opt.UDPSize() >= MinMsgSize {
co.UDPSize = opt.UDPSize() <<<< HERE
}
// Otherwise use the client's configured UDP size.
if opt == nil && c.UDPSize >= MinMsgSize {
co.UDPSize = c.UDPSize
}
In order to fix this we may have to assume as large as possible UDP buffer size (64k, or as large as ever needed, maybe 4k) for the shared client and change the setting of Conn.UDPSize
happen only when the Conn
is first created.
Client.DialContext()
already sets Conn.UDPSize
, so if set on the Client
we could change SendContext
to only change Conn.UDPSize
if it is already not big enough. Besides, for the shared client setting this value after the start of the receive loop is not going to take effect anyway.
I've taken a look and wrote down what I've understood.
A dns.Conn
holds state which is related to a single DNS exchange (query/response). Specifically, it contains a UDPSize
field, which determines what size of buffer is allocated to read the response from the OS. If this buffer is too small, conn.Read(buf)
will silently truncate the message to fit the buffer. The conn.UDPSize
field is set when sending the query. DNS messages are normally limited to 512 bytes, but DNS has an extension mechanism (called EDNS0) which allows increasing this limit to 65K. Implementationwise, this sets the conn.UDPSize
to the specified maximal supported buffer size, so that when receiving the response message a large enough buffer is allocated.
Our shared client shares single dns.Conn
between multiple goroutines, each responsible for a single DNS exchange. When sharing the dns.Conn
object, the UDPSize
is written by multiple goroutines. The race detector has flagged this unsynchronised read/write to/of the UDPSize
field. Given the right sequence of events, the shared client reading goroutine reads from the network with a buffer that is too small for the incoming message, leading to truncation.
Furthermore dns.Conn
also contains DNSSec state, so called Tsig
state. This state is also written to on sending a query. I don't know DNSSec, but it seems unlikely that sharing this state works as designed.
I don't know whether mixing message sizes is common (or even allowed), or how common usage of EDNS0 is. It seems to be used in DNSSec, however. On the other hand, I don't know of people using DNSSec inside k8s clusters, and I'm not sure CoreDNS would handle forwarding DNSSec requests either.
To make the race detector happy, it would AFAICT be enough to change the UDPSize to be an atomic uint
. However, this does not solve the actual problem, since it's still possible for a undersized buffer to be allocated. Consider the follow scenario:
1. A normal query is sent, setting UDPSize to 512
2. The client allocates a 512 byte buffer to read the response, and blocks on reading
3. A query is sent specifying large buffers, setting UDPSize to 4096 (since we are sharing the conn, no other read buffer allocation occurs until a message is received)
4. The server sends back a large reponse (say 1024 bytes) to the second query.
5. Reading unblocks, and attempts to read 1024 bytes into the 512 byte buffer, truncating the message.
Since the buffer allocation and blocking read occurs before the second query is even sent, the atomicity of UDPSize
is not sufficient to prevent the issue, even though the race detector would AFAIK be okay with read/write to an atomic.
As mentioned above, one valid approach is to always allocate 65K buffers. This is pretty wasteful. An memory/CPU tradeoff can be made by having a single 65K buffer per conn and copying the actual data once the actual length is known.
A second approach: As far as I understand the DNS RFC, as a proxy it would also be allowed to clamp the advertised max buffer size to what the proxy can handle. I don't like semantically changing the data we're proxying, though.
A third solution: Large responses can only come from the server after we have sent a query which advertises large buffer. Hence, we can locally delay sending such a query, abort a potential pending read and restart it with a large enough buffer and only then send out the query. As long as we ensure that the buffer size only grows per conn, I believe this is sufficient. It's not possible to implement this using the existing interfaces of the DNS library, but given we have a fork already, we can adapt interfaces to fit.
The DNSSec state remains an issue even after solving the buffer sizing. In the spirit of solving what is reported instead of going after theoretical issues, I'd argue for ignoring this issue until reports of breakage come in, with a description of how DNSSec is used. Ideally, though, we avoid race detector hits however, if possible.
seen with 54796b085eb2f8c695db74cef95521e732effacf