[DATA RACE]: github.com/cilium/dns.(*Conn).ReadMsgHeader()

aanm commented 1 month ago

2024-07-18T14:37:47.357247712Z Read at 0x00c00754eb90 by goroutine 3305340:
2024-07-18T14:37:47.357252450Z   github.com/cilium/dns.(*Conn).ReadMsgHeader()
2024-07-18T14:37:47.357257480Z       /go/src/github.com/cilium/cilium/vendor/github.com/cilium/dns/client.go:304 +0xc4
2024-07-18T14:37:47.357262319Z   github.com/cilium/dns.(*Conn).ReadMsg()
2024-07-18T14:37:47.357267288Z       /go/src/github.com/cilium/cilium/vendor/github.com/cilium/dns/client.go:274 +0x3d
2024-07-18T14:37:47.357272528Z   github.com/cilium/dns.handler.func2()
2024-07-18T14:37:47.357277507Z       /go/src/github.com/cilium/cilium/vendor/github.com/cilium/dns/shared_client.go:173 +0x15a
2024-07-18T14:37:47.357282857Z 
2024-07-18T14:37:47.357288267Z Previous write at 0x00c00754eb90 by goroutine 3305339:
2024-07-18T14:37:47.357293467Z   github.com/cilium/dns.(*Client).SendContext()
2024-07-18T14:37:47.357299017Z       /go/src/github.com/cilium/cilium/vendor/github.com/cilium/dns/client.go:242 +0xfe
2024-07-18T14:37:47.357304728Z   github.com/cilium/dns.handler()
2024-07-18T14:37:47.357309738Z       /go/src/github.com/cilium/cilium/vendor/github.com/cilium/dns/shared_client.go:251 +0x773
2024-07-18T14:37:47.357331218Z   github.com/cilium/dns.(*SharedClient).ExchangeSharedContext.gowrap1()
2024-07-18T14:37:47.357336457Z       /go/src/github.com/cilium/cilium/vendor/github.com/cilium/dns/shared_client.go:298 +0x5d
2024-07-18T14:37:47.357340785Z 
2024-07-18T14:37:47.357345324Z Goroutine 3305340 (running) created at:
2024-07-18T14:37:47.357350383Z   github.com/cilium/dns.handler()
2024-07-18T14:37:47.357355262Z       /go/src/github.com/cilium/cilium/vendor/github.com/cilium/dns/shared_client.go:161 +0x2ab
2024-07-18T14:37:47.357359330Z   github.com/cilium/dns.(*SharedClient).ExchangeSharedContext.gowrap1()
2024-07-18T14:37:47.357364089Z       /go/src/github.com/cilium/cilium/vendor/github.com/cilium/dns/shared_client.go:298 +0x5d
2024-07-18T14:37:47.357369478Z 
2024-07-18T14:37:47.357374568Z Goroutine 3305339 (running) created at:
2024-07-18T14:37:47.357379718Z   github.com/cilium/dns.(*SharedClient).ExchangeSharedContext()
2024-07-18T14:37:47.357384506Z       /go/src/github.com/cilium/cilium/vendor/github.com/cilium/dns/shared_client.go:298 +0x52b
2024-07-18T14:37:47.357389957Z   github.com/cilium/dns.(*SharedClients).ExchangeContext()
2024-07-18T14:37:47.357395206Z       /go/src/github.com/cilium/cilium/vendor/github.com/cilium/dns/shared_client.go:36 +0xa8
2024-07-18T14:37:47.357400347Z   github.com/cilium/dns.(*SharedClients).Exchange()
2024-07-18T14:37:47.357406027Z       /go/src/github.com/cilium/cilium/vendor/github.com/cilium/dns/shared_client.go:31 +0x9e
2024-07-18T14:37:47.357410846Z   github.com/cilium/cilium/pkg/fqdn/dnsproxy.(*DNSProxy).ServeDNS()
2024-07-18T14:37:47.357415604Z       /go/src/github.com/cilium/cilium/pkg/fqdn/dnsproxy/proxy.go:1085 +0x279b
2024-07-18T14:37:47.357420224Z   github.com/cilium/dns.(*Server).serveDNS()
2024-07-18T14:37:47.357425222Z       /go/src/github.com/cilium/cilium/vendor/github.com/cilium/dns/server.go:653 +0x681
2024-07-18T14:37:47.357430022Z   github.com/cilium/dns.(*Server).serveUDPPacket()
2024-07-18T14:37:47.357434741Z       /go/src/github.com/cilium/cilium/vendor/github.com/cilium/dns/server.go:605 +0x398
2024-07-18T14:37:47.357439500Z   github.com/cilium/dns.(*Server).serveUDP.func2()
2024-07-18T14:37:47.357444228Z       /go/src/github.com/cilium/cilium/vendor/github.com/cilium/dns/server.go:531 +0xd8

seen with 54796b085eb2f8c695db74cef95521e732effacf

jrajahalme commented 1 month ago

This is a concurrent read/write on Conn.UDPSize in the dns library. This has not happened before due to us not trying to read and write on Conn at the same time.

The specific instance of write above indicates that the forwarded DNS request was a DNSSec one, as on current main branch the line vendor/github.com/cilium/dns/client.go:242 is:

func (c *Client) SendContext(ctx context.Context, m *Msg, co *Conn, t time.Time) error {
    opt := m.IsEdns0()
    // If EDNS0 is used use that for size.
    if opt != nil && opt.UDPSize() >= MinMsgSize {
        co.UDPSize = opt.UDPSize()                            <<<< HERE
    }
    // Otherwise use the client's configured UDP size.
    if opt == nil && c.UDPSize >= MinMsgSize {
        co.UDPSize = c.UDPSize
    }

In order to fix this we may have to assume as large as possible UDP buffer size (64k, or as large as ever needed, maybe 4k) for the shared client and change the setting of Conn.UDPSize happen only when the Conn is first created.

Client.DialContext() already sets Conn.UDPSize, so if set on the Client we could change SendContext to only change Conn.UDPSize if it is already not big enough. Besides, for the shared client setting this value after the start of the receive loop is not going to take effect anyway.

bimmlerd commented 3 days ago

I've taken a look and wrote down what I've understood.

Background

A dns.Conn holds state which is related to a single DNS exchange (query/response). Specifically, it contains a UDPSize field, which determines what size of buffer is allocated to read the response from the OS. If this buffer is too small, conn.Read(buf) will silently truncate the message to fit the buffer. The conn.UDPSize field is set when sending the query. DNS messages are normally limited to 512 bytes, but DNS has an extension mechanism (called EDNS0) which allows increasing this limit to 65K. Implementationwise, this sets the conn.UDPSize to the specified maximal supported buffer size, so that when receiving the response message a large enough buffer is allocated.

Issue

Our shared client shares single dns.Conn between multiple goroutines, each responsible for a single DNS exchange. When sharing the dns.Conn object, the UDPSize is written by multiple goroutines. The race detector has flagged this unsynchronised read/write to/of the UDPSize field. Given the right sequence of events, the shared client reading goroutine reads from the network with a buffer that is too small for the incoming message, leading to truncation.

Furthermore dns.Conn also contains DNSSec state, so called Tsig state. This state is also written to on sending a query. I don't know DNSSec, but it seems unlikely that sharing this state works as designed.

I don't know whether mixing message sizes is common (or even allowed), or how common usage of EDNS0 is. It seems to be used in DNSSec, however. On the other hand, I don't know of people using DNSSec inside k8s clusters, and I'm not sure CoreDNS would handle forwarding DNSSec requests either.

Solutions

To make the race detector happy, it would AFAICT be enough to change the UDPSize to be an atomic uint. However, this does not solve the actual problem, since it's still possible for a undersized buffer to be allocated. Consider the follow scenario:

1. A normal query is sent, setting UDPSize to 512
2. The client allocates a 512 byte buffer to read the response, and blocks on reading
3. A query is sent specifying large buffers, setting UDPSize to 4096 (since we are sharing the conn, no other read buffer allocation occurs until a message is received)
4. The server sends back a large reponse (say 1024 bytes) to the second query.
5. Reading unblocks, and attempts to read 1024 bytes into the 512 byte buffer, truncating the message.

Since the buffer allocation and blocking read occurs before the second query is even sent, the atomicity of UDPSize is not sufficient to prevent the issue, even though the race detector would AFAIK be okay with read/write to an atomic.

As mentioned above, one valid approach is to always allocate 65K buffers. This is pretty wasteful. An memory/CPU tradeoff can be made by having a single 65K buffer per conn and copying the actual data once the actual length is known.

A second approach: As far as I understand the DNS RFC, as a proxy it would also be allowed to clamp the advertised max buffer size to what the proxy can handle. I don't like semantically changing the data we're proxying, though.

A third solution: Large responses can only come from the server after we have sent a query which advertises large buffer. Hence, we can locally delay sending such a query, abort a potential pending read and restart it with a large enough buffer and only then send out the query. As long as we ensure that the buffer size only grows per conn, I believe this is sufficient. It's not possible to implement this using the existing interfaces of the DNS library, but given we have a fork already, we can adapt interfaces to fit.

The DNSSec state remains an issue even after solving the buffer sizing. In the spirit of solving what is reported instead of going after theoretical issues, I'd argue for ignoring this issue until reports of breakage come in, with a description of how DNSSec is used. Ideally, though, we avoid race detector hits however, if possible.

cilium / cilium