Open tnovau opened 3 years ago
I workarounded it using this solution, thanks for your help and your patience @arapulido
I workarounded it using this solution, thanks for your help and your patience @arapulido
You're welcome! Happy to help! ❤️ Very happy to hear this worked for you as well
Jumping in on this issue. I recently ran into this same issue. Trying out the solution from @arapulido didn't seem to alleviate my issue either unfortunately.
Hello,
Using the last helm chart in a brand new GKE autopilot cluster I was able to reproduce this by adding applicative pods (redis) from a deployment prior to deploying the agent.
As I did not specify requirements, Autopilot specified the req/limits as 500m
CPU and 2Gi
of memory for each pod. However for the default Node Group with only
Allocatable:
cpu: 940m
memory: 2885836Ki
you will quickly be overcommitted if you add an application prior to the agent, as you have kube-dns, fluentbit, gke-metadata, gke-metrics, kube-proxy and the metrics-server requesting up to 62% of the CPU allocatable. (Allocatable includes the system processes such as the CRI, the kubelet etc. The nodes from the default Node Group have 2CPUs and 4Gi of memory).
However, after creating a PriorityClass that creates the agent pods prior to any application pod, Autopilot has the Cluster Autoscaler scale up a Node Group that creates nodes that can accomodate application and system resources. In my case the node group was named gk3-charly-autopilot-nap-XXX
but it single handedly accommodated for:
Non-terminated Pods: (15 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
default datadog-autopilot-cluster-agent-787f47b998-d8tm2 500m (12%) 500m (12%) 512Mi (3%) 512Mi (3%) 5h26m
default datadog-autopilot-clusterchecks-6d6cc4f868-t95h7 500m (12%) 500m (12%) 2Gi (15%) 2Gi (15%) 5h26m
default datadog-autopilot-clusterchecks-6d6cc4f868-z9nhr 500m (12%) 500m (12%) 2Gi (15%) 2Gi (15%) 5h26m
default datadog-autopilot-znvwq 150m (3%) 150m (3%) 300Mi (2%) 300Mi (2%) 5h26m
default redis-7f6fbc85b5-ddpb6 500m (12%) 500m (12%) 2Gi (15%) 2Gi (15%) 9h
default redis-7f6fbc85b5-wqh8m 500m (12%) 500m (12%) 2Gi (15%) 2Gi (15%) 5h26m
default redis-7f6fbc85b5-zvsvs 500m (12%) 500m (12%) 2Gi (15%) 2Gi (15%) 9h
kube-system filestore-node-xpbtc 0 (0%) 0 (0%) 0 (0%) 0 (0%) 9h
kube-system fluentbit-gke-dvc9s 100m (2%) 0 (0%) 200Mi (1%) 500Mi (3%) 9h
kube-system gke-metadata-server-nftt5 100m (2%) 100m (2%) 100Mi (0%) 100Mi (0%) 9h
kube-system gke-metrics-agent-mzv5z 3m (0%) 0 (0%) 50Mi (0%) 50Mi (0%) 9h
kube-system kube-proxy-gk3-charly-autopilot-nap-e4f2eggn-80c802db-cnfk 100m (2%) 0 (0%) 0 (0%) 0 (0%) 9h
kube-system netd-57bb8 0 (0%) 0 (0%) 0 (0%) 0 (0%) 9h
kube-system node-local-dns-jn99g 25m (0%) 0 (0%) 5Mi (0%) 0 (0%) 9h
kube-system pdcsi-node-7ngk8 10m (0%) 0 (0%) 20Mi (0%) 100Mi (0%) 9h
NB: I tweaked the Cluster Agent resources for testing purposes, but the point is that using a Priority class on the agent DS, we can get a proper scale up of resources that can accommodate for applicative deployments as well as the remainder of core resources for Datadog's monitoring system, given that the default nodepool is too small to have all of them. As we QA the change in this PR and merge it, I would love for you to either confirm if this solves your problem and if not, please share more details about what is not working for you.
PS: I am going to update the documentation to specify resource reqs/limits that are more appropriate.
Best, .C
@CharlyF we just tried the new priorityClassCreate
in our GKE Autopilot cluster and there is still pods stuck in pending trying to place on the default node pool. We have updated our resources as shown below to match requests
and limits
per GCP docs. But guidance on what those should be might help alleviate the problem.
agents:
priorityClassCreate: true
containers:
# Set resource limits for agent container
agent:
resources:
limits:
cpu: 200m
memory: 256Mi
requests:
cpu: 200m
memory: 256Mi
# Set resource limits for process agent container
processAgent:
resources:
limits:
cpu: 100m
memory: 200Mi
requests:
cpu: 100m
memory: 200Mi
# Set resource limits for trace agent container
traceAgent:
resources:
limits:
cpu: 100m
memory: 200Mi
requests:
cpu: 100m
memory: 200Mi
# Set resource limits for system probe container
systemProbe:
resources:
limits:
cpu: 100m
memory: 200Mi
requests:
cpu: 100m
memory: 200Mi
# Set resource limits for init containers
initContainers:
resources:
limits:
cpu: 100m
memory: 200Mi
requests:
cpu: 100m
memory: 200Mi
I also ran into this issue. The GCP support recommended me this solution. While not great, it works for our use-case.
We're also using GKE Autopilot. The supporter said that it isn't possible to have a PriorityClass
that's a higher priority than the GKE ones:
kubectl get pc
NAME VALUE GLOBAL-DEFAULT AGE
datadog-agent 1000000000 false 57d
system-cluster-critical 2000000000 false 74d
system-node-critical 2000001000 false 74d
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: datadog-agent
value: 1000000000
preemptionPolicy: PreemptLowerPriority
globalDefault: false
description: "Ensure that DataDog Agent Pods are always scheduled onto Nodes, by evicting other non-essential workloads."
values.yaml:
datadog:
agents:
# Ensure that Agent Pods are always scheduled onto Nodes, by evicting other non-essential workloads
priorityClassName: datadog-agent
A little trick (not great :neutral_face: ): Manually delete a Pod that's taking up resources on a particular node, to allow the DataDog Pod to be scheduled on it
Still facing issue on autopilot when APM is enabled.
Below are my helm values:
targetSystem: linux
datadog:
apiKey: ***
appKey: ***
site: datadoghq.eu
clusterName: ***
logs:
enabled: true
containerCollectAll: true
apm:
portEnabled: true
kubeStateMetricsCore:
enabled: true
# Avoid deploying kube-state-metrics chart.
kubeStateMetricsEnabled: false
clusterChecksRunner:
enabled: true
clusterAgent:
enabled: true
confd:
postgres.yaml: |-
cluster_check: true
init_config:
instances:
- dbm: true
host: 'xxx'
port: 5432
username: datadog
password: 'xxx'
gcp:
project_id: 'xxx'
instance_id: 'xxx'
agents:
priorityClassName: datadog-agent
containers:
agent:
# resources for the Agent container
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: 200m
memory: 256Mi
traceAgent:
# resources for the Trace Agent container
resources:
requests:
cpu: 100m
memory: 200Mi
limits:
cpu: 100m
memory: 200Mi
processAgent:
# resources for the Process Agent container
resources:
requests:
cpu: 100m
memory: 200Mi
limits:
cpu: 100m
memory: 200Mi
providers:
gke:
autopilot: true
Can someone pls help. Daemon set is in pending state:
Unschedulable and 1 more issue Cannot schedule pods: Insufficient cpu.
I was getting a crash loop on the "agent" container in the DS.
Solution: set resources.requests for the 3 containers in the datadog-agent daemonset per the docs here
I'm a little puzzled why Datadog doesn't set resources.requests by default though. Seems like something you'd want set no matter what to tune the resource utilization.
Describe what happened: I'm trying to deploy Datadog on GKE Autopilot cluster. I have followed this tutorial.
When I install the helm chart I'm getting DaemonSet to have less replicas than expected.
Describe what you expected: I expect all the deployed workloads to be green 🟢
Steps to reproduce the issue: 1- Create a K8S Autopilot cluster in GKE (Region: us-west2) 2- Open CloudShell 3- Follow the tutorial 4- Get this error:
Additional environment details (Operating System, Cloud provider, etc): Cloud provider: Google (GCP) Region: us-west2 Release channel: Regular channel Version: 1.20.10-gke.301 Chart we're using:
Command and output (helm install)
helm version
output:kubectl get nodes
output:kubectl get pods -o wide
output: