Open diranged opened 2 weeks ago
So ... reverting to the 1.7.0
operator and the 1.8.6
chart didn't resolve the issue... after over an hour of debugging, including fully deleting the DatadogAgent
resource, it still didn't resolve... Then magically it resolved on its own:
I can't fathom what happened... I spent time digging through the code at https://github.com/DataDog/datadog-operator/blob/v1.7.0/controllers/datadogagent/controller_reconcile_v2_common.go#L155-L207 and I can only think that there is something odd happening with the setting of the needsUpdate
variable at https://github.com/DataDog/datadog-operator/blob/v1.7.0/controllers/datadogagent/controller_reconcile_v2_common.go#L194. Unfortunately there are no logs in the operator to indicate what the difference it was seeing might be..
Ok.. so I checked out our other clusters (that didn't have this upgrade done) - and apparently this pattern just happens periodically throughout the day!
Describe what happened: We're testing the
datadog-operator
helm chart upgradedatadog-operator 1.8.6...2.0.0
- and we're seeing a new behavior in the operator pods where they are in a reconciliation loop on the Datadog Agent DaemonSet:I have tried monitoring the
datadog-agent
datemonset usingkubectl get daemonset datadog-agent -o json -w
, and I see zero changes being made to the resource... so this appears to be an internal reconciliation loop that isn't actually making changes to the API.When we look at the Audit logs, we see a dramatic increase in the number of requests/second being made (though its absolute value isn't insane, the change is significant):