DataDog / helm-charts

Helm charts for Datadog products
Apache License 2.0
347 stars 1.02k forks source link

datadog-lib-java-init container dies with OOMKilled #1519

Closed jwtrim closed 1 month ago

jwtrim commented 2 months ago

Describe what happened:

Java cronjob fails due to the datadog-lib-java-init container terminating with OOMKilled.

Describe what you expected:

The DD init container does not terminate with OOMKilled.

Steps to reproduce the issue:

The issue does seem to be intermittent, but adding auto-instrumentation on a Java cronjob and continually triggering the job does eventually reproduce the issue.

│ Init Containers:                                                                                                                                                │
│   datadog-lib-java-init:                                                                                                                                        │
│     Container ID:  containerd://e638d5686e7d5a4ddf5ed2ecd4ddec6c84ad875a8dcd34d1ac068a8f4ab991b4                                                                │
│     Image:         gcr.io/datadoghq/dd-lib-java-init:v1.38.1                                                                                                    │
│     Image ID:      gcr.io/datadoghq/dd-lib-java-init@sha256:789df6113ee8afaccc4b8a397a4e8dee934a4867fe95b551ec632ae859e00ddd                                    │
│     Port:          <none>                                                                                                                                       │
│     Host Port:     <none>                                                                                                                                       │
│     Command:                                                                                                                                                    │
│       sh                                                                                                                                                        │
│       copy-lib.sh                                                                                                                                               │
│       /datadog-lib                                                                                                                                              │
│     State:          Terminated                                                                                                                                  │
│       Reason:       OOMKilled                                                                                                                                   │
│       Exit Code:    137                                                                                                                                         │
│       Started:      Thu, 05 Sep 2024 11:15:01 -0400                                                                                                             │
│       Finished:     Thu, 05 Sep 2024 11:15:04 -0400                                                                                                             │
│     Ready:          False                                                                                                                                       │
│     Restart Count:  0                                                                                                                                           │
│     Limits:                                                                                                                                                     │
│       cpu:     50m                                                                                                                                              │
│       memory:  20971520                                                                                                                                         │
│     Requests:                                                                                                                                                   │
│       cpu:     50m                                                                                                                                              │
│       memory:  20971520

Additional environment details (Operating System, Cloud provider, etc):

BenjaminMichel commented 2 months ago

Hello,

Same issue here with the "datadog-lib-python-init", we get an OOMKilled as described by @jwtrim

Datadog Agent and Cluster Agent version 7.57.0

sumeetgajjar commented 1 month ago

Hi, We are facing the same issue here with Java, .Net, Python, and JS init containers.

Agent info

$ agent version
Cluster Agent 7.56.0 - Commit: f7e1780 - Serialization version: v5.0.124 - Go version: go1.22.5

Init containers version info

admission.datadoghq.com/java-lib.version: v1.38.1
admission.datadoghq.com/js-lib.version: v5.21.0
admission.datadoghq.com/python-lib.version: v2.11.0
admission.datadoghq.com/dotnet-lib.version: v2.56.0
admission.datadoghq.com/ruby-lib.version: v2.2.0

It would be great to have the option to customize the memory limits

Screenshot 2024-09-27 at 17 27 12
david-bour commented 1 month ago

Seeing this issue as well. Would like an option to customize the memory limits

arafatkhan-optimizely commented 1 month ago

Happened with all of our python injection. For us, it was related to upgrading the GKE cluster from 1.29.xxx to 1.30.xxx. Took us days to figure it out the issue was related to GKE update. We performed nodepool rollback and init OOM stopped instantly

Image

fanny-jiang commented 1 month ago

Hello all, to avoid the OOMs, we've increased for now the default requests/limits to 100Mi with agent 7.57.2 to conform with Alpine recommended base values: https://github.com/DataDog/datadog-agent/blob/main/CHANGELOG.rst#bug-fixes.

The initContainer resources can also be configured manually using the DD_ADMISSION_CONTROLLER_AUTO_INSTRUMENTATION_INIT_RESOURCES_MEMORY environment variable on the cluster agent https://github.com/DataDog/datadog-agent/blob/596053e0d87db92237f887e9302c088650698893/pkg/clusteragent/admission/mutate/autoinstrumentation/auto_instrumentation.go#L666:

env:
- name: DD_ADMISSION_CONTROLLER_AUTO_INSTRUMENTATION_INIT_RESOURCES_MEMORY
  value: "50Mi"

While the feature is in beta, we are working on optimizing the memory usage before general availability.

david-bour commented 1 month ago

Hello all, to avoid the OOMs, we've increased for now the default requests/limits to 100Mi with agent 7.57.2 to conform with Alpine recommended base values: https://github.com/DataDog/datadog-agent/blob/main/CHANGELOG.rst#bug-fixes.

The initContainer resources can also be configured manually using the DD_ADMISSION_CONTROLLER_AUTO_INSTRUMENTATION_INIT_RESOURCES_MEMORY environment variable on the cluster agent https://github.com/DataDog/datadog-agent/blob/596053e0d87db92237f887e9302c088650698893/pkg/clusteragent/admission/mutate/autoinstrumentation/auto_instrumentation.go#L666:

env:

  • name: DD_ADMISSION_CONTROLLER_AUTO_INSTRUMENTATION_INIT_RESOURCES_MEMORY value: "50Mi" While the feature is in beta, we are working on optimizing the memory usage before general availability.

Can confirm that upgrading the agent worked.

sumeetgajjar commented 1 month ago

Deploying the 7.57.2 agent resolved the issue, Thanks!

tbavelier commented 1 month ago

Will be closing this issue as it's not related to the Helm chart, but rather to the Agent and this beta feature. As Fanny mentioned, the team is aware and working on the memory optimisation, while the default resources have been increased in the meantime