DataDog / dd-trace-py

Datadog Python APM Client
https://ddtrace.readthedocs.io/
Other
532 stars 408 forks source link

Mongoengine/PyMongo instrumentation caused extreme slowness(nearly causing an outage). #2474

Closed dmarkey closed 3 years ago

dmarkey commented 3 years ago

Which version of dd-trace-py are you using?

ddtrace==0.48.4

Which version of pip are you using?

pip 20.0.1

ddtrace requires pip>=18 to install one of our pre-built wheels

Which version of the libraries are you using?

amqp==2.6.0
argh==0.26.2
asn1crypto==0.24.0
atomicwrites==1.3.0
attrs==19.3.0
aws-requests-auth==0.4.2
backoff==1.4.3
bcrypt==3.1.6
billiard==3.6.3.0
blackbox==0.0.12
blinker==1.4
boto==2.42.0
boto3==1.12.7
botocore==1.15.7
cairocffi==1.1.0
CairoSVG==2.5.1
celery==4.4.6
certifi==2019.3.9
cffi==1.12.2
chardet==3.0.4
cleanco==1.3
Click==7.0
clickclick==1.2.2
compare-requirements==0.2.2
connexion==2.7.0
coverage==4.5.4
cryptography==3.3.2
cssselect2==0.3.0
CurrencyConverter==0.14.2
ddtrace==0.48.4
decorator==4.4.2
defusedxml==0.6.0
Django==2.2.20
django-angular==0.7.15
django-appconf==1.0.2
django-compressor==2.2
django-configurations==2.1
django-cursor-pagination==0.1.4
django-debug-toolbar==2.2.1
django-extensions==1.9.9
django-filter==2.1.0
django-js-asset==1.2.2
django-log-request-id==1.3.2
django-mptt==0.9.0
django-nose==1.4.6
django-oauth-toolkit==1.0.0
django-partial-index==0.6.0
django-redis==4.10.0
django-reversion==2.0.12
djangorestframework==3.9.4
djangorestframework-expander==0.2.3
docutils==0.15.2
elasticsearch==6.4.0
entrypoints==0.3
factory-boy==2.12.0
Faker==2.0.3
flake8==3.7.7
Flask==1.0.4
Flask-Cors==3.0.9
flask-talisman==0.7.0
freezegun==0.3.12
future==0.18.2
gevent==1.4.0
graphviz==0.10.1
greenlet==0.4.15
hashids==1.1.0
hiredis==1.0.0
html5lib==1.0.1
idna==2.7
importlib-metadata==1.7.0
inflection==0.3.1
iniconfig==1.0.1
intervaltree==3.1.0
itsdangerous==1.1.0
jellyfish==0.5.6
Jinja2==2.11.3
jmespath==0.9.5
jsonfield==2.0.2
jsonschema==3.1.1
kombu==4.6.11
limits==1.5.1
markdown2==2.3.9
MarkupSafe==1.1.1
marshmallow==3.2.1
mccabe==0.6.1
model-mommy==1.6.0
mongoengine==0.17.0
more-itertools==8.4.0
mrq==0.9.10
nose==1.3.7
numpy==1.17.3
oauthlib==3.0.1
objgraph==3.4.0
openapi-spec-validator==0.2.8
packaging==20.4
pandas==1.1.2
parameterized==0.7.1
pathtools==0.1.2
permission==0.4.1
petl==1.6.8
Pillow==8.1.1
pluggy==0.13.1
prance==0.19.0
progressbar2==3.38.0
prometheus-client==0.6.0
protobuf==3.14.0
psutil==5.6.6
psycopg2==2.8.6
py==1.10.0
pycodestyle==2.5.0
pycountry==18.5.26
pycparser==2.19
pyflakes==2.1.1
pymongo==3.7.2
pyparsing==2.4.7
Pyphen==0.9.5
pyrsistent==0.15.4
pytest==6.0.1
pytest-cov==2.8.1
pytest-cover==3.0.0
pytest-coverage==0.0
pytest-django==3.9.0
pytest-env==0.6.2
pytest-mock==1.10.0
pytest-subtests==0.4.0
pytest-test-groups==1.0.3
python-dateutil==2.8.0
python-intercom==3.1.0
python-json-logger==0.1.10
python-utils==2.3.0
pytz==2018.5
PyYAML==5.4
ratings==0.0.10
rcssmin==1.0.6
redis==3.3.11
requests==2.22.0
requests-mock==1.7.0
responses==0.11.0
rjsmin==1.0.12
s3transfer==0.3.3
semver==2.10.2
sentry-sdk==0.10.2
sh==1.12.14
simplejson==3.16.0
six==1.12.0
sortedcontainers==2.3.0
sqlparse==0.3.0
structlog==17.2.0
swagger-ui-bundle==0.0.5
tenacity==6.2.0
termcolor==1.1.0
testfixtures==5.3.1
text-unidecode==1.3
tinycss2==1.0.2
toml==0.10.1
typing==3.6.6
ujson==1.35
Unidecode==1.1.1
url-normalize==1.4.1
urllib3==1.24.3
us==1.0.0
uWSGI==2.0.18
validators==0.18.2
vine==1.3.0
watchdog==0.9.0
WeasyPrint==51
webencodings==0.5.1
Werkzeug==0.15.5
zipp==0.6.0

How can we reproduce your problem?

When using patch_all() and using Mongoengine, connection times were very erratic and sometime taking 15+ seconds. This consumed nearly all our UWSGI workers.

What is the result that you get?

Endpoints became extremely slow.

What is the result that you expected?

Performance is not effected.

We were about to track down and isolate the problem with:

- name: DD_TRACE_MONGOENGINE_ENABLED          
   value: "False"        
- name: DD_TRACE_PYMONGO_ENABLED          
   value: "False"
brettlangdon commented 3 years ago

@dmarkey I am having some trouble replicating this issue:

I am just doing:

# run.py
import timeit

import pymongo

def test():
    client = pymongo.MongoClient()
    assert client.server_info()

t = timeit.Timer(test)
results = t.repeat(200, 1)
print(f"min: {min(results)}")
print(f"max: {max(results)}")
print(f"avg: {sum(results) / len(results)}")

and then:

$ ddtrace-run python run.py
min: 0.0066359740000000444
max: 0.03534934000000001
avg: 0.007919707545000005

Even with patching disabled the numbers are basically the same. Same for if I do ddtrace.patch_all() instead of ddtrace-run.

I mostly wanted to quickly check and see if there was anything obvious that is causing one specific connection to take longer than others.

Is there any further information you can provide to help me replicate the issue?

Does the issue occur only when using uWSGI? If so, would you be able to share your uWSGI configuration with me?

Do you have specific connection setup that I should try to replicate?

What version of Python are you using?

What version of Mongo are you using?

dmarkey commented 3 years ago

I think i've seen the issue also occurring using manage.py runserver but here is the uwsgi configuration anyway.

[uwsgi]
plugins = python3
need-app = True
uid = django
gid = django
log-5xx = true
disable-logging = true
enable-threads = True
workers = 3
harakiri = 55
# Enforce limits to requests to avoid memory leaks
max-requests = 5000
max-requests-delta = 1000
wsgi-disable-file-wrapper = true
vacuum = True
chdir = /opt/app/src
env = DJANGO_SETTINGS_MODULE=project.docker_settings
env = prometheus_multiproc_dir=/tmp/metrics/
manage-script-name = true
mount = /=project.wsgi:application
# the socket
socket = 127.0.0.1:8001
reload-on-as = 768
reload-on-rss = 256
harakiri-verbose = true
reload-mercy = 300
lazy-apps=true
# Export uWSGI stats
stats=127.0.0.1:1717

safe-pidfile=/opt/app/uwsgi.pid

lazy_apps=true may be important here.

alpine:3.10
Python 3.7.10 (default, Mar  2 2021, 09:06:08)

Other things of note patch_all() is happening early in app initialisation (at the start of the wsgi file)

We have the following in the docker_settings.py file to initialise the connection to mongoengine:

from mongoengine import connect

connect(MONGODB_DATABASE, host=MONGODB_SERVER, port=MONGODB_PORT, connect=False)

This was the exact code path that was slowing down(it's simply to check connectivity - wrapped in a simple view):

        conn = mongoengine.connect(
            settings.MONGODB_DATABASE,
            host=settings.MONGODB_SERVER,
            port=settings.MONGODB_PORT, connect=True)
        conn.admin.command('ismaster')        
        conn.close()

Using this mongodb image:

mvertes/alpine-mongo:3.4.9-0

Hope this helps!

brettlangdon commented 3 years ago

Thanks! This is great. It might take me a minute to reproduce, but I'll let you know if I need anything else!

brettlangdon commented 3 years ago

oh wow... yup... definitely can confirm this

$ ddtrace-run python run.py
count: 25
min: 0.013789596000000015
p50: 0.5316218780000002
p75: 4.768647105
p95: 173.10879761329997
max: 204.309137153

Here are the results running with the patching disabled:

$ DD_TRACE_MONGOENGINE_ENABLED=false DD_TRACE_PYMONGO_ENABLED=false ddtrace-run python run.py
count: 25
min: 0.012196068000000004
p50: 0.5111844989999987
p75: 0.5154976094999997
p95: 0.5215658128000004
max: 0.5227140460000008
# run.py
import statistics
import timeit

import ddtrace

ddtrace.patch_all()
from mongoengine import connect

def test():
    client = connect(connect=False)
    client.admin.command("ismaster")
    client.close()

t = timeit.Timer(test)
results = t.repeat(25, 1)
p = statistics.quantiles(results, n=100)
print(f"count: {len(results)}")
print(f"min: {min(results)}")
print(f"p50: {p[49]}")
print(f"p75: {p[74]}")
print(f"p95: {p[94]}")
print(f"max: {max(results)}")

I do not have a great sense of why this is happening, but I was able to reproduce. Next steps we'll need to dig into this further.

dmarkey commented 3 years ago

Great that you can repro, if you need anything else let me know.

brettlangdon commented 3 years ago

@dmarkey fix provided in #2475

brettlangdon commented 3 years ago

@dmarkey this has been released to PyPI as version 0.49.2, let us know if the issue persists!