DaemonSet keeps replicas as Pending on GKE Autopilot

tnovau commented 3 years ago

Describe what happened: I'm trying to deploy Datadog on GKE Autopilot cluster. I have followed this tutorial.

When I install the helm chart I'm getting DaemonSet to have less replicas than expected.

Describe what you expected: I expect all the deployed workloads to be green 🟢

Steps to reproduce the issue: 1- Create a K8S Autopilot cluster in GKE (Region: us-west2) 2- Open CloudShell 3- Follow the tutorial 4- Get this error:

Additional environment details (Operating System, Cloud provider, etc): Cloud provider: Google (GCP) Region: us-west2 Release channel: Regular channel Version: 1.20.10-gke.301 Chart we're using:

Command and output (helm install)

helm install datadog-autopilot-test-qa \
>     --set datadog.apiKey=<OMITTED_API_KEY> \
>     --set datadog.appKey=<OMITTED_APP_KEY> \
>     --set clusterAgent.enabled=true \
>     --set clusterAgent.metricsProvider.enabled=true \
>     --set providers.gke.autopilot=true \
>     --set datadog.logs.enabled=true \
>     --set datadog.apm.enabled=true \
>     --set datadog.kubeStateMetricsEnabled=false \
>     --set datadog.kubeStateMetricsCore.enabled=true \
>     datadog/datadog
W1006 08:46:46.419177     675 warnings.go:70] Autopilot set default resource requests for DaemonSet default/datadog-autopilot-test-qa, as resource requests were not specified. See http://g.co/gke/autopilot-defaults.
W1006 08:46:46.670791     675 warnings.go:70] Autopilot set default resource requests for Deployment default/datadog-autopilot-test-qa-cluster-agent, as resource requests were not specified. See http://g.co/gke/autopilot-defaults.
NAME: datadog-autopilot-test-qa
LAST DEPLOYED: Wed Oct  6 08:46:37 2021
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
Datadog agents are spinning up on each node in your cluster. After a few
minutes, you should see your agents starting in your event stream:
    https://app.datadoghq.com/event/stream

The Datadog Agent is listening on port 8126 for APM service.

#################################################################
####               WARNING: Deprecation notice               ####
#################################################################

The option `datadog.apm.enabled` is deprecated, please use `datadog.apm.portEnabled` to enable TCP communication to the trace-agent.
The option `datadog.apm.socketEnabled` is enabled by default and can be used to rely on unix socket or name-pipe communication.

###################################################################################
####   WARNING: dogstatsd with Unix socket is not supported on GKE Autopilot   ####
###################################################################################

##############################################################################
####   WARNING: APM with Unix socket is not supported on GKE Autopilot   ####
##############################################################################

helm version output:

version.BuildInfo{Version:"v3.5.0", GitCommit:"32c22239423b3b4ba6706d450bd044baffdcf9e6", GitTreeState:"clean", GoVersion:"go1.15.6"}

kubectl get nodes output:

NAME                                                  STATUS   ROLES    AGE   VERSION
gk3-autopilot-datadog-te-default-pool-c7c1cafe-5x2n   Ready    <none>   15h   v1.20.10-gke.301
gk3-autopilot-datadog-te-default-pool-e75bacb2-lzcc   Ready    <none>   15h   v1.20.10-gke.301

kubectl get pods -o wide output:

NAME                                                       READY   STATUS    RESTARTS   AGE     IP           NODE                                                  NOMINATED NODE   READINESS GATES
datadog-autopilot-test-qa-5nts6                            0/3     Pending   0          6m47s   <none>       <none>                                                <none>           <none>
datadog-autopilot-test-qa-ckfkd                            0/3     Pending   0          6m47s   <none>       <none>                                                <none>           <none>
datadog-autopilot-test-qa-cluster-agent-778569c98d-9wc47   1/1     Running   0          6m47s   10.114.0.6   gk3-autopilot-datadog-te-default-pool-c7c1cafe-5x2n   <none>           <none>

tnovau commented 3 years ago

I workarounded it using this solution, thanks for your help and your patience @arapulido

arapulido commented 3 years ago

I workarounded it using this solution, thanks for your help and your patience @arapulido

You're welcome! Happy to help! ❤️ Very happy to hear this worked for you as well

Scalahansolo commented 2 years ago

Jumping in on this issue. I recently ran into this same issue. Trying out the solution from @arapulido didn't seem to alleviate my issue either unfortunately.

CharlyF commented 2 years ago

Hello,

Using the last helm chart in a brand new GKE autopilot cluster I was able to reproduce this by adding applicative pods (redis) from a deployment prior to deploying the agent. As I did not specify requirements, Autopilot specified the req/limits as 500m CPU and 2Gi of memory for each pod. However for the default Node Group with only

Allocatable:
  cpu:                        940m
  memory:                     2885836Ki

you will quickly be overcommitted if you add an application prior to the agent, as you have kube-dns, fluentbit, gke-metadata, gke-metrics, kube-proxy and the metrics-server requesting up to 62% of the CPU allocatable. (Allocatable includes the system processes such as the CRI, the kubelet etc. The nodes from the default Node Group have 2CPUs and 4Gi of memory).

However, after creating a PriorityClass that creates the agent pods prior to any application pod, Autopilot has the Cluster Autoscaler scale up a Node Group that creates nodes that can accomodate application and system resources. In my case the node group was named gk3-charly-autopilot-nap-XXX but it single handedly accommodated for:

Non-terminated Pods:          (15 in total)
  Namespace                   Name                                                          CPU Requests  CPU Limits  Memory Requests  Memory Limits  Age
  ---------                   ----                                                          ------------  ----------  ---------------  -------------  ---
  default                     datadog-autopilot-cluster-agent-787f47b998-d8tm2              500m (12%)    500m (12%)  512Mi (3%)       512Mi (3%)     5h26m
  default                     datadog-autopilot-clusterchecks-6d6cc4f868-t95h7              500m (12%)    500m (12%)  2Gi (15%)        2Gi (15%)      5h26m
  default                     datadog-autopilot-clusterchecks-6d6cc4f868-z9nhr              500m (12%)    500m (12%)  2Gi (15%)        2Gi (15%)      5h26m
  default                     datadog-autopilot-znvwq                                       150m (3%)     150m (3%)   300Mi (2%)       300Mi (2%)     5h26m
  default                     redis-7f6fbc85b5-ddpb6                                        500m (12%)    500m (12%)  2Gi (15%)        2Gi (15%)      9h
  default                     redis-7f6fbc85b5-wqh8m                                        500m (12%)    500m (12%)  2Gi (15%)        2Gi (15%)      5h26m
  default                     redis-7f6fbc85b5-zvsvs                                        500m (12%)    500m (12%)  2Gi (15%)        2Gi (15%)      9h
  kube-system                 filestore-node-xpbtc                                          0 (0%)        0 (0%)      0 (0%)           0 (0%)         9h
  kube-system                 fluentbit-gke-dvc9s                                           100m (2%)     0 (0%)      200Mi (1%)       500Mi (3%)     9h
  kube-system                 gke-metadata-server-nftt5                                     100m (2%)     100m (2%)   100Mi (0%)       100Mi (0%)     9h
  kube-system                 gke-metrics-agent-mzv5z                                       3m (0%)       0 (0%)      50Mi (0%)        50Mi (0%)      9h
  kube-system                 kube-proxy-gk3-charly-autopilot-nap-e4f2eggn-80c802db-cnfk    100m (2%)     0 (0%)      0 (0%)           0 (0%)         9h
  kube-system                 netd-57bb8                                                    0 (0%)        0 (0%)      0 (0%)           0 (0%)         9h
  kube-system                 node-local-dns-jn99g                                          25m (0%)      0 (0%)      5Mi (0%)         0 (0%)         9h
  kube-system                 pdcsi-node-7ngk8                                              10m (0%)      0 (0%)      20Mi (0%)        100Mi (0%)     9h

NB: I tweaked the Cluster Agent resources for testing purposes, but the point is that using a Priority class on the agent DS, we can get a proper scale up of resources that can accommodate for applicative deployments as well as the remainder of core resources for Datadog's monitoring system, given that the default nodepool is too small to have all of them. As we QA the change in this PR and merge it, I would love for you to either confirm if this solves your problem and if not, please share more details about what is not working for you.

PS: I am going to update the documentation to specify resource reqs/limits that are more appropriate.

Best, .C

kylegalbraith commented 2 years ago

@CharlyF we just tried the new priorityClassCreate in our GKE Autopilot cluster and there is still pods stuck in pending trying to place on the default node pool. We have updated our resources as shown below to match requests and limits per GCP docs. But guidance on what those should be might help alleviate the problem.

agents:
  priorityClassCreate: true
  containers:
    # Set resource limits for agent container
    agent:
      resources:
        limits:
          cpu: 200m
          memory: 256Mi
        requests:
          cpu: 200m
          memory: 256Mi
    # Set resource limits for process agent container
    processAgent:
      resources:
        limits:
          cpu: 100m
          memory: 200Mi
        requests:
          cpu: 100m
          memory: 200Mi
    # Set resource limits for trace agent container
    traceAgent:
      resources:
        limits:
          cpu: 100m
          memory: 200Mi
        requests:
          cpu: 100m
          memory: 200Mi
    # Set resource limits for system probe container
    systemProbe:
      resources:
        limits:
          cpu: 100m
          memory: 200Mi
        requests:
          cpu: 100m
          memory: 200Mi
    # Set resource limits for init containers
    initContainers:
      resources:
        limits:
          cpu: 100m
          memory: 200Mi
        requests:
          cpu: 100m
          memory: 200Mi

orlandothoeny commented 2 years ago

I also ran into this issue. The GCP support recommended me this solution. While not great, it works for our use-case. We're also using GKE Autopilot. The supporter said that it isn't possible to have a PriorityClass that's a higher priority than the GKE ones:

kubectl get pc
NAME                      VALUE        GLOBAL-DEFAULT   AGE
datadog-agent             1000000000   false            57d
system-cluster-critical   2000000000   false            74d
system-node-critical      2000001000   false            74d

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: datadog-agent
value: 1000000000
preemptionPolicy: PreemptLowerPriority
globalDefault: false
description: "Ensure that DataDog Agent Pods are always scheduled onto Nodes, by evicting other non-essential workloads."

values.yaml:

datadog:
  agents:
    # Ensure that Agent Pods are always scheduled onto Nodes, by evicting other non-essential workloads
    priorityClassName: datadog-agent

A little trick (not great :neutral_face: ): Manually delete a Pod that's taking up resources on a particular node, to allow the DataDog Pod to be scheduled on it

cloudgk commented 1 year ago

Still facing issue on autopilot when APM is enabled.

Below are my helm values:

targetSystem: linux
datadog:
  apiKey: ***
  appKey: ***
  site: datadoghq.eu
  clusterName: ***
  logs:
    enabled: true
    containerCollectAll: true
  apm:
    portEnabled: true

  kubeStateMetricsCore:
    enabled: true
  # Avoid deploying kube-state-metrics chart.
  kubeStateMetricsEnabled: false
clusterChecksRunner:
  enabled: true
clusterAgent:
  enabled: true
  confd:
    postgres.yaml: |-
        cluster_check: true
        init_config:
        instances:
           - dbm: true
             host: 'xxx'
             port: 5432
             username: datadog
             password: 'xxx'
             gcp:
              project_id: 'xxx'
              instance_id: 'xxx'

agents:
  priorityClassName: datadog-agent
  containers:
    agent:
      # resources for the Agent container
      resources:
        requests:
          cpu: 200m
          memory: 256Mi
        limits:
          cpu: 200m
          memory: 256Mi

    traceAgent:
      # resources for the Trace Agent container
      resources:
        requests:
          cpu: 100m
          memory: 200Mi
        limits:
          cpu: 100m
          memory: 200Mi

    processAgent:
      # resources for the Process Agent container
      resources:
        requests:
          cpu: 100m
          memory: 200Mi
        limits:
          cpu: 100m
          memory: 200Mi

providers:
  gke:
    autopilot: true

Can someone pls help. Daemon set is in pending state:

Unschedulable and 1 more issue Cannot schedule pods: Insufficient cpu.

vszal commented 1 year ago

I was getting a crash loop on the "agent" container in the DS.

Solution: set resources.requests for the 3 containers in the datadog-agent daemonset per the docs here

I'm a little puzzled why Datadog doesn't set resources.requests by default though. Seems like something you'd want set no matter what to tune the resource utilization.

DataDog / helm-charts

DaemonSet keeps replicas as Pending on GKE Autopilot #409