Open michaelst opened 2 years ago
Hi @michaelst
could you give use the version of the datadog-agent install with your chart:
you can check the image tag? or run the command kubectl exec -it <agent pod name> -c agent -- agent version
The issue seems to be more related to an update of the agent version than the chart version. by updating the chart, you might use a different agent version.
Based on the changelogs between 2.23.1
and 2.27.1
, the default agent version move from 7.31.0
to 7.32.1
.
for now what I can propose to you is to for the previous agent version with this parameter agents.image.tag:7.31.1
to override the default value.
or in you values.yaml
agents:
image:
tag: 7.31.1
we weren't overriding the datadog-agent version so whatever was part of the chart is what was installed, I will try the upgrade again tomorrow with the agent override to the old version
@michaelst
To better understand your configuration could you share your values.yaml
used to deploy/update your datadog chart deployment? π
doest datadog.securityAgent.runtime.enabled
was set to true?
datadog:
containerExclude: 'image:gke.gcr.io/* image:k8s.gcr.io/* image:gcr.io/stackdriver-agents/* image:gcr.io/datadoghq/* image:docker.io/istio/pilot'
dogstatsd:
useHostPort: true
logs:
enabled: true
containerCollectAll: true
apm:
portEnabled: true
orchestratorExplorer:
container_scrubbing:
enabled: true
kubeStateMetricsNetworkPolicy:
create: true
networkPolicy:
create: true
processAgent:
processCollection: true
# https://docs.datadoghq.com/integrations/kubernetes_state_core/?tab=helm
kubeStateMetricsCore:
enabled: true
kubeStateMetricsEnabled: false
securityAgent:
compliance:
enabled: true
runtime:
enabled: true
clusterAgent:
podSecurity:
podSecurityPolicy:
create: true
agents:
podSecurity:
podSecurityPolicy:
create: true
containers:
agent:
env:
- name: DD_LEADER_ELECTION
value: "true"
Hello @michaelst Are you running Container Optimized OS ? We just pushed and merged a PR that should make CWS start on kernels missing SELinux support - which is the case here - and will be part of the next 7.33 release. That being said, we do not support Container Optimized OS yet, and some features may not work correctly. Proper support is on the way though.
Ya we are running Container-Optimized OS with Containerd (cos_containerd)
, haven't had any issues with it until this though
Is this still an issue? Why has this been open for years now?
I seem to be having a similar issue. Only one of the DD agent pods is having the issue with the system-probe
container crashing.
β 2023-05-11 20:25:26 UTC | SYS-PROBE | INFO | (cmd/system-probe/api/module/loader.go:50 in Register) | module tcp_queue_length_tracer disabled β
β 2023-05-11 20:25:26 UTC | SYS-PROBE | INFO | (cmd/system-probe/api/module/loader.go:50 in Register) | module oom_kill_probe disabled β
β 2023-05-11 20:25:26 UTC | SYS-PROBE | INFO | (cmd/system-probe/api/module/loader.go:50 in Register) | module security_runtime disabled β
β 2023-05-11 20:25:26 UTC | SYS-PROBE | INFO | (cmd/system-probe/api/module/loader.go:50 in Register) | module process disabled β
β 2023-05-11 20:25:26 UTC | SYS-PROBE | CRITICAL | (comp/core/log/logger.go:92 in Criticalf) | error while starting api server, exiting: failed to create system probe: no module could be loaded β
β Error: error while starting api server, exiting: failed to create system probe: no module could be loaded β
β Stream closed EOF for datadog/datadog-agent-gkd5m (system-probe)
any ideas?
This is with datadog-3.29.0
Hi @lodotek
Thanks for commenting this issue, but I think the issue that you are facing is not related to the initial reported problem.
Unfortunately the investigation will require more information like the helm values.yaml
file, the agent version and configuration.
Could you please contact Datadog support and provide Agent and Cluster-Agent flare
For users that encounter system-probe
container start issues on COS (Container-Optimized OS),
we have added recently a new option in the chart, to configure the agent deployment specifically for COS.
If you are using COS
please add this option in the values.yaml
chart deployment configuration
providers:
gke:
cos: true
Describe what happened:
I upgraded from 2.23.1 to 2.27.3 and on the daemonset the system-probe container ends up in a crash loop backoff.
Additional environment details (Operating System, Cloud provider, etc):
GKE