Open arzarif opened 3 years ago
Additional details provided from a recurrence of the same issue:
# kubectl get pods -n datadog
NAME READY STATUS RESTARTS AGE
datadog-agent-4jxcr 3/3 Running 0 21m
datadog-agent-4nddm 3/3 Running 0 21m
datadog-agent-4rq9p 3/3 Running 0 21m
datadog-agent-6q4fj 3/3 Running 0 21m
datadog-agent-9dm7p 0/3 Init:0/3 0 23m
datadog-agent-9lrq2 3/3 Running 0 21m
datadog-agent-crvgb 3/3 Running 0 21m
datadog-agent-m6vjb 3/3 Running 0 21m
datadog-agent-n4dqf 3/3 Running 0 21m
datadog-agent-nkr2m 3/3 Running 0 21m
datadog-agent-svwcl 3/3 Running 0 21m
datadog-agent-w46db 3/3 Running 0 21m
datadog-agent-wqt7l 3/3 Running 0 21m
datadog-agent-wsl2t 3/3 Running 0 21m
datadog-agent-x67tl 3/3 Running 0 21m
datadog-cluster-agent-9f75b64bd-5f98p 1/1 Running 0 23m
datadog-cluster-agent-9f75b64bd-m94n4 1/1 Running 0 23m
datadog-cluster-checks-runner-f5f4bfbc5-mng6c 0/1 Init:0/2 0 23m
datadog-cluster-checks-runner-f5f4bfbc5-wtwdb 1/1 Running 0 23m
datadog-operator-65cc54d98b-2wlgf 1/1 Running 1 35h
# kubectl get pod -n datadog datadog-agent-9dm7p -o yaml | grep creationTimestamp:
creationTimestamp: "2021-08-04T04:28:25Z"
# kubectl get secret datadog-agent-token-mqc5g -n datadog -o yaml | grep creationTimestamp:
creationTimestamp: "2021-08-04T04:28:46Z"
# kubectl get pod -n datadog datadog-agent-9dm7p -o yaml | grep secretName:
secretName: datadog-agent-token-mgqpt
# kubectl get pod -n datadog datadog-agent-4jxcr -o yaml | grep secretName:
secretName: datadog-agent-token-mqc5g
Hey @arzarif !
Thanks for opening this issue and sorry for the trouble. We're looking into it, but it would be super useful for us if you could share any interesting operator logs corresponding to the timeframe when this happened.
In any case, recreating the agent after an operator restart is definitely not the preferred behavior and shouldn't happen (if you delete the operator pod, it will not happen) but there might be issue somewhere else in the reconciliation logic.
Thank you!
Describe what happened: I received a bunch of "no data" alerts from a cluster with agents deployed using this operator (v0.6.0). I checked the agents and several were stuck in an init phase:
It was clear that 3 hours prior, the operator restarted, which triggered a redeployment of all child resources associated with our DatadogAgent object during reconciliation.
The error from the operator logs appears to indicate some type of temporary connectivity problem with the Kubernetes API:
Several of the newly-created pods were still referencing an old service account token, which had also be been redeployed:
It appears that
datadog-agent-token-xhnfh
was the old token, whereas the current one is actuallydatadog-agent-token-2jdhr
.Describe what you expected: The behavior we're seeing deviates from what's expected in a couple of way:
First, I wouldn't expect a redeployments of all resources in the event of a temporary problem in accessing the Kubernetes API (if I'm assessing that as the root cause of this event correctly). Second, I wouldn't expect newly-generated pods to have specs that reference the old service account secret.
Steps to reproduce the issue:
Additional environment details (Operating System, Cloud provider, etc): Ubuntu 18.04 Azure AKS (1.19.7) Datadog Operator v0.6.0