DataDog / dd-trace-py

Datadog Python APM Client
https://ddtrace.readthedocs.io/
Other
542 stars 411 forks source link

High CPU utilization causing kubernetes pod scaling with ddtrace > 2.3.0 #9447

Open hemantgir opened 4 months ago

hemantgir commented 4 months ago

Summary of problem

We have noticed that after upgrading ddtrace to any version above 2.3.0, results in a significant increase in CPU utilization, which leds to the maximum number of replicas being deployed.

For instance, our Kubernetes application is configured with an auto-scaling limit of 36 maximum replicas. Prior to the upgrade, our stage environment would typically use only 6-8 pods while idle. However, post-upgrade, we are reaching the upper limit of 36 replicas.

This unexpected behavior suggests that there may be a spike in resource usage introduced in versions above 2.3.0. We would like to understand the cause of this increased resource consumption and seek a solution to optimize it.

Additionally, updated datadog_lambda==5.83.0 to be compatible with ddtrace==2.3.0 version.

( Maybe a red herring - we also noticed calls to POST /telemetry/proxy/api/v2/apmtelemetry increase on versions above 2.3.0 ).

Datadog screenshots (Kubernetes pods are in idle state): on ddtrace 2.7.5: sum:kubernetes_state.deployment.replicas_available{env:... ,service:...} image

APM POST /telemetry/proxy/api/v2/apmtelemetry image

on ddtrace 2.3.0: sum:kubernetes_state.deployment.replicas_available{env:... ,service:...} image

APM POST /telemetry/proxy/api/v2/apmtelemetry image

Which version of dd-trace-py are you using?

Originally had bumped to 2.7.5, but now downgraded to 2.3.0. Have also tried with latest 2.8.5.

Which version of pip are you using?

pip 24.0

Spike with:

Any version above ddtrace 2.3.0

pip freeze

aioboto3==9.5.0
aiobotocore==2.2.0
aiodns==3.0.0
aiohttp==3.9.5
aiohttp-retry==2.4.5
aioitertools==0.8.0
aioredis==1.3.1
aioredis-cluster==1.5.2
aiosignal==1.2.0
ansible==9.1.0
ansible-core==2.16.4
asgiref==3.8.0
asn1crypto==1.5.1
async-kinesis==1.1.5
async-timeout==4.0.2
asyncio-throttle==1.0.2
atomicwrites==1.4.0
attrs==20.3.0
aws-kinesis-agg==1.1.3
aws-xray-sdk==2.6.0
awscli==1.22.76
bcrypt==3.2.0
black==24.4.2
blinker==1.7.0
boto==2.45.0
boto3==1.21.21
botocore==1.24.21
Brotli==1.0.9
brotlipy==0.7.0
bytecode==0.15.1
CacheControl==0.12.6
cachetools==4.1.1
cattrs==22.2.0
certifi==2023.7.22
cffi==1.16.0
chardet==3.0.4
charset-normalizer==2.0.8
cityhash==0.4.7
click==8.1.7
colorama==0.4.1
coverage==7.0.4
cryptography==42.0.5
dal-admin-filters==1.1.0
datadog==0.41.0
datadog_lambda==5.91.0
ddsketch==2.0.4
ddtrace==2.7.4
decorator==4.4.2
defusedxml==0.7.1
Deprecated==1.2.14
deprecation==2.1.0
Django==4.2.11
django-auditlog==3.0.0
django-autocomplete-light==3.11.0
django-cleanup==6.0.0
django-cors-headers==3.7.0
django-csp==3.7
django-discover-runner==1.0
django-extensions==3.1.5
django-filter==2.4.0
django-health-check==3.18.1
django-hosts==5.1
django-json-widget==2.0.1
django-nested-admin==3.4.0
django-redis==4.11.0
django-rest-serializer-field-permissions==4.1.0
django-role-permissions==2.2.0
django-rq==2.10.2
django-ses==3.5.0
django-snowflake==4.2.2
django-storages==1.12.3
django-webpack-loader==0.5.0
django_reverse_admin==2.9.6
djangorestframework==3.14.0
djangorestframework-csv==2.1.0
djangorestframework-gis==0.18
dnspython==2.6.1
docutils==0.15.2
dogslow==1.2
drf-flex-fields==0.9.8
drf-jwt==1.19.2
elementpath==2.2.3
envier==0.5.1
et-xmlfile==1.1.0
execnet==1.9.0
fakeredis==2.7.1
filelock==3.12.2
frozenlist==1.4.1
future==0.18.3
geojson==2.4.1
googleapis-common-protos==1.53.0
grpcio==1.62.0
grpcio-health-checking==1.62.0
grpcio-reflection==1.62.0
grpcio-status==1.62.0
gunicorn==22.0.0
hiredis==2.3.2
httplib2==0.19.0
idna==3.7
importlib-metadata==6.11.0
importlib-resources==5.8.0
iniconfig==2.0.0
intervaltree==3.1.0
isort==5.13.2
Jinja2==3.1.3
jmespath==0.10.0
json-stream==2.3.2
json-stream-rs-tokenizer==0.4.25
jsonpickle==3.0.3
jsonschema==4.5.1
magicattr==0.1.5
MarkupSafe==2.1.1
more-itertools==8.6.0
msgpack==1.0.0
multidict==5.1.0
mypy-extensions==1.0.0
nplusone==1.0.0
openpyxl==3.0.7
opentelemetry-api==1.23.0
orjson==3.9.15
packaging==24.0
paramiko==3.4.0
pathspec==0.12.1
pillow==10.3.0
platformdirs==3.8.1
pluggy==1.0.0
protobuf==4.21.7
psycopg2==2.9.9
psycopg2-binary==2.9.9
py-dateutil==2.2
pyasn1==0.4.8
pycares==4.2.0
pycodestyle==2.5.0
pycountry==22.3.5
pycparser==2.20
PyJWT==2.4.0
PyNaCl==1.5.0
pyOpenSSL==24.0.0
pyparsing==2.4.7
pyrsistent==0.18.1
pytest==7.2.0
pytest-cov==4.0.0
pytest-django==4.5.2
pytest-shard==0.1.2
pytest-xdist==3.1.0
python-dateutil==2.8.0
python-json-logger==0.1.8
python-memcached==1.59
python-monkey-business==1.0.0
pytz==2020.4
PyYAML==5.3.1
redis==3.5.3
redis-py-cluster==2.1.3
requests==2.31.0
resolvelib==0.5.4
rq==1.14.0
rsa==4.7
s3transfer==0.5.0
setproctitle==1.1.10
Shapely==1.6.4
simplejson==3.14.0
six==1.16.0
snowflake-connector-python==3.7.1
sortedcontainers==2.4.0
splunk-handler==2.0.7
sqlparse==0.5.0
tenacity==6.2.0
tomlkit==0.12.1
typing_extensions==4.7.1
unicodecsv==0.14.1
urllib3==1.26.18
Werkzeug==3.0.1
whitenoise==6.0.0
wrapt==1.14.0
xmlschema==1.2.5
xmltodict==0.13.0
yarl==1.9.4
zipp==3.18.1

How can we reproduce your problem?

I'm not sure how you can replicate the issue from your end. We are utilizing Datadog tools, and we have established metrics that continuously monitor and provide results whether in an idle or running.

What is the result that you get?

High CPU utilization causing Kubernetes pod scaling upto the max replicas even in idle condition, on ddtrace > 2.3.0.

What is the result that you expected?

CPU utilization and Kubernetes pod scaling only as much as required, on ddtrace > 2.3.0

emmettbutler commented 4 months ago

Thank you for reporting this, @hemantgir. Could you share all relevant environment variables set in the app environment? This will help us understand what bits of Datadog functionality are enabled and disabled in this case.

hemantgir commented 4 months ago

Thank you for reporting this, @hemantgir. Could you share all relevant environment variables set in the app environment? This will help us understand what bits of Datadog functionality are enabled and disabled in this case.

Thank you for your response. Please find the list of environment variables below:

DD_DBM_PROPAGATION_MODE : disabled DD_DJANGO_USE_HANDLER_RESOURCE_FORMAT : True DD_ENV : stage DD_LOGS_INJECTION : True DD_SERVICE : Django DD_TRACE_SAMPLE_RATE : 1 DD_TRACE_SAMPLING_RULES : [{"sample_rate": 1}]