DataDog / datadog-agent

Main repository for Datadog Agent
https://docs.datadoghq.com/
Apache License 2.0
2.84k stars 1.2k forks source link

Data races due to append calls in pkg/trace/info/stats.go #11565

Open deadok22 opened 2 years ago

deadok22 commented 2 years ago

There are data races due to these append() calls in pkg/trace/info/stats.go.

The append(tags, ...) calls linked to above can modify the array backing the tags slice, if the array's capacity is larger than the slice's size, while the code backing metrics.Count reads the same array. Note that the code backing metrics.Count may access the len(tags)-th array element while it's being written to by another goroutine. This can lead to all sorts of trouble (reading arbitrary memory, segfaults, etc.).

deadok22 commented 2 years ago

Here's a Go playground link with a simplified example of the same flavor of a data race.

package main

const (
    allowedThirdValue1 = "a-third-value"
    allowedThirdValue2 = "another-third-value"
)

func checkTheThird(tags chan []string) {
    for {
        v := (<-tags)[2]
        switch v {
        case allowedThirdValue1, allowedThirdValue2:
        default:
            panic("read something unexpected: " + v)
        }
    }
}

func main() {
    tags := make([]string, 0, 3)
    tags = append(tags, "one", "two")

    tagsCh := make(chan []string, 42)
    go checkTheThird(tagsCh)
    for {
        tagsCh <- append(tags, allowedThirdValue1)
        tagsCh <- append(tags, allowedThirdValue2)
    }
}
knusbaum commented 2 years ago

@deadok22 Yes, we see the issue with this. Thanks for the report. We'll put up a fix shortly and release with the next version.

deadok22 commented 2 years ago

We upgraded our fleet from 7.33.1 to 7.36.1. The races described here broke the tagging of the datadog.trace_agent.normalizer.spans_malformed and datadog.trace_agent.normalizer.traces_dropped metrics (they now fairly reliably get no reason tag set, but do get the priority tags added for the datadog.trace_agent.receiver.traces_priority metrics).

na

deadok22 commented 2 years ago

I dug a bit deeper and found the races manifested themselves due to the upgrade of datadog-go to v5.1.0 in https://github.com/DataDog/datadog-agent/pull/10547. v5 enables client-side aggregation by default and up until v5.1.1 it retained the tags slice - this is fixed in https://github.com/DataDog/datadog-go/commit/2179b317ef1a338aa64f57a3b5042c387e894b19.

Upgrading datadog-go to v5.1.1 will fix the data races described here.