DataDog / datadog-operator

Kubernetes Operator for Datadog Resources
Apache License 2.0
298 stars 102 forks source link

Possible race condition involving service account token during reconciliation #347

Open arzarif opened 3 years ago

arzarif commented 3 years ago

Describe what happened: I received a bunch of "no data" alerts from a cluster with agents deployed using this operator (v0.6.0). I checked the agents and several were stuck in an init phase:

# kubectl get pods -n datadog
NAME                                             READY   STATUS     RESTARTS   AGE
datadog-agent-4xncr                              3/3     Running    0          3h12m
datadog-agent-6nkd4                              3/3     Running    0          3h12m
datadog-agent-8pfkn                              3/3     Running    0          3h12m
datadog-agent-cggv9                              0/3     Init:0/3   0          3h13m
datadog-agent-ckxhp                              3/3     Running    0          3h12m
datadog-agent-ddlmq                              0/3     Init:0/3   0          3h13m
datadog-agent-dw2gc                              0/3     Init:0/3   0          3h13m
datadog-agent-hc8tk                              0/3     Init:0/3   0          3h13m
datadog-agent-hj8jv                              0/3     Init:0/3   0          3h13m
datadog-agent-kg9tw                              0/3     Pending    0          3h12m
datadog-agent-l8p5f                              0/3     Init:0/3   0          3h13m
datadog-agent-v6bvt                              3/3     Running    0          3h12m
datadog-agent-w59gz                              3/3     Running    0          3h12m
datadog-agent-x7ssf                              0/3     Init:0/3   0          3h13m
datadog-agent-x8wgx                              3/3     Running    0          3h12m
datadog-cluster-agent-b6b54c6c8-4jpzh            1/1     Running    0          3h13m
datadog-cluster-agent-b6b54c6c8-7k2ps            1/1     Running    0          3h13m
datadog-cluster-checks-runner-57689b4869-7qddl   1/1     Running    0          3h13m
datadog-cluster-checks-runner-57689b4869-q8ldw   1/1     Running    0          3h13m
datadog-operator-65cc54d98b-6w2bl                1/1     Running    9          80d

It was clear that 3 hours prior, the operator restarted, which triggered a redeployment of all child resources associated with our DatadogAgent object during reconciliation.

The error from the operator logs appears to indicate some type of temporary connectivity problem with the Kubernetes API:

# kubectl logs  datadog-operator-65cc54d98b-6w2bl -n datadog -p
 ...
 ...
 ...
{"level":"ERROR","ts":"2021-08-03T20:11:51Z","logger":"klog","msg":"Failed to update lock: Put \"https://10.239.224.1:443/api/v1/namespaces/datadog/configmaps/datadog-operator-lock\": context deadline exceeded\n"}
{"level":"INFO","ts":"2021-08-03T20:11:51Z","logger":"klog","msg":"failed to renew lease datadog/datadog-operator-lock: timed out waiting for the condition\n"}
{"level":"ERROR","ts":"2021-08-03T20:11:51Z","logger":"setup","msg":"Problem running manager","error":"leader election lost"}

Several of the newly-created pods were still referencing an old service account token, which had also be been redeployed:

# kubectl describe pod datadog-agent-cggv9 -n datadog
...
...
...

  Events:
  Type     Reason       Age                     From     Message
  ----     ------       ----                    ----     -------
  Warning  FailedMount  9m11s (x94 over 3h14m)  kubelet  MountVolume.SetUp failed for volume "datadog-agent-token-xhnfh" : secret "datadog-agent-token-xhnfh" not found
  Warning  FailedMount  4m3s (x83 over 178m)    kubelet  (combined from similar events): Unable to attach or mount volumes: unmounted volumes=[datadog-agent-token-xhnfh], unattached volumes=[src dsdsocket seccomp-root checksd cgroups config debugfs host-osrelease procdir installinfo datadog-agent-token-xhnfh passwd modules confd datadog-agent-auth datadog-agent-security logdatadog sysprobe-socket-dir system-probe-config runtimesocketdir]: timed out waiting for the condition
# kubectl get secret -n datadog | grep datadog-agent-token-
datadog-agent-token-2jdhr                   kubernetes.io/service-account-token   3      3h14m

It appears that datadog-agent-token-xhnfh was the old token, whereas the current one is actually datadog-agent-token-2jdhr.

Describe what you expected: The behavior we're seeing deviates from what's expected in a couple of way:

First, I wouldn't expect a redeployments of all resources in the event of a temporary problem in accessing the Kubernetes API (if I'm assessing that as the root cause of this event correctly). Second, I wouldn't expect newly-generated pods to have specs that reference the old service account secret.

Steps to reproduce the issue:

Additional environment details (Operating System, Cloud provider, etc): Ubuntu 18.04 Azure AKS (1.19.7) Datadog Operator v0.6.0

arzarif commented 3 years ago

Additional details provided from a recurrence of the same issue:

# kubectl get pods -n datadog
NAME                                            READY   STATUS     RESTARTS   AGE
datadog-agent-4jxcr                             3/3     Running    0          21m
datadog-agent-4nddm                             3/3     Running    0          21m
datadog-agent-4rq9p                             3/3     Running    0          21m
datadog-agent-6q4fj                             3/3     Running    0          21m
datadog-agent-9dm7p                             0/3     Init:0/3   0          23m
datadog-agent-9lrq2                             3/3     Running    0          21m
datadog-agent-crvgb                             3/3     Running    0          21m
datadog-agent-m6vjb                             3/3     Running    0          21m
datadog-agent-n4dqf                             3/3     Running    0          21m
datadog-agent-nkr2m                             3/3     Running    0          21m
datadog-agent-svwcl                             3/3     Running    0          21m
datadog-agent-w46db                             3/3     Running    0          21m
datadog-agent-wqt7l                             3/3     Running    0          21m
datadog-agent-wsl2t                             3/3     Running    0          21m
datadog-agent-x67tl                             3/3     Running    0          21m
datadog-cluster-agent-9f75b64bd-5f98p           1/1     Running    0          23m
datadog-cluster-agent-9f75b64bd-m94n4           1/1     Running    0          23m
datadog-cluster-checks-runner-f5f4bfbc5-mng6c   0/1     Init:0/2   0          23m
datadog-cluster-checks-runner-f5f4bfbc5-wtwdb   1/1     Running    0          23m
datadog-operator-65cc54d98b-2wlgf               1/1     Running    1          35h

# kubectl get pod -n datadog datadog-agent-9dm7p -o yaml | grep creationTimestamp:
  creationTimestamp: "2021-08-04T04:28:25Z"

# kubectl get secret datadog-agent-token-mqc5g -n datadog -o yaml | grep creationTimestamp:
  creationTimestamp: "2021-08-04T04:28:46Z"

# kubectl get pod -n datadog datadog-agent-9dm7p -o yaml | grep secretName:
      secretName: datadog-agent-token-mgqpt

# kubectl get pod -n datadog datadog-agent-4jxcr -o yaml | grep secretName:
      secretName: datadog-agent-token-mqc5g
ahmed-mez commented 3 years ago

Hey @arzarif !

Thanks for opening this issue and sorry for the trouble. We're looking into it, but it would be super useful for us if you could share any interesting operator logs corresponding to the timeframe when this happened.

In any case, recreating the agent after an operator restart is definitely not the preferred behavior and shouldn't happen (if you delete the operator pod, it will not happen) but there might be issue somewhere else in the reconciliation logic.

Thank you!