DataDog / dd-trace-py

Datadog Python APM Client
https://ddtrace.readthedocs.io/
Other
546 stars 411 forks source link

ddtrace versions 2.2.0 and above increase the duration of "django.request" #8388

Open sebyul2 opened 8 months ago

sebyul2 commented 8 months ago

Summary of problem

An overall increase in duration has been observed in ddtrace versions 2.2.0 and above. This has led to a significant increase in latency in actual user service environments, resulting in service delays and failures. The issue has been resolved by using ddtrace versions 2.1.9 and below.

스크린샷 2024-02-13 오후 6 54 40 스크린샷 2024-02-13 오후 6 55 15

api Latency increased during the patch period that included the ddtrace 2.2.0 upgrade (log scale) 스크린샷 2024-02-13 오후 7 02 57 스크린샷 2024-02-13 오후 7 03 25

Compare duration logs between ddtrace 2.1.9 and 2.2.0

Which version of dd-trace-py are you using?

2.2.0

Which version of pip are you using?

22.2.2

Which libraries and their versions are you using?

`pip freeze` amqp==2.6.0 ansible==6.3.0 ansible-core==2.13.3 asgiref==3.7.2 astroid==1.6.0 attrs==20.1.0 awscli==1.25.67 bcrypt==4.0.0 billiard==3.6.3.0 boto3==1.14.19 botocore==1.17.19 bytecode==0.15.1 cattrs==23.1.2 celery==4.4.0 certifi==2018.8.24 cffi==1.15.1 chardet==3.0.4 charset-normalizer==2.1.1 colorama==0.4.4 coreapi==2.3.3 coreschema==0.0.4 coverage==4.4.2 cryptography==3.2 ddsketch==2.0.4 ddtrace==1.9.0 Deprecated==1.2.14 distlib==0.3.6 distro==1.7.0 Django==3.0.11 django-cacheops==5.1 django-cors-headers==2.1.0 django-redis==4.12.1 django-storages==1.10.1 django-utils-six==2.0 djangoql==0.13.1 djangorestframework==3.12.0 docker==6.0.0 docker-compose==1.29.2 dockerpty==0.4.1 docopt==0.6.2 docutils==0.14 drf-spectacular==0.17.2 elasticsearch==6.8.0 envier==0.5.1 exceptiongroup==1.2.0 facebook-sdk==3.0.0 filelock==3.8.0 freezegun==1.0.0 funcy==1.10 future==0.16.0 hashids==1.2.0 idna==2.7 inflection==0.5.1 isort==4.2.15 itypes==1.1.0 Jinja2==2.10 jmespath==0.9.3 jsonschema==3.2.0 jwcrypto==1.4.2 kombu==4.6.10 lazy-object-proxy==1.3.1 MarkupSafe==1.1.0 mccabe==0.6.1 msgpack-python==0.5.1 oauthlib==2.0.7 openapi-codec==1.3.2 packaging==21.3 paramiko==2.11.0 paypal-checkout-serversdk==1.0.1 paypalhttp==1.0.0 phonenumbers==8.10.22 platformdirs==2.5.2 ply==3.11 protobuf==4.25.2 psycopg2-binary==2.8.6 py==1.9.0 pyasn1==0.4.8 pybreaker==0.6.0 pycparser==2.21 pycryptodome==3.9.8 PyJWT==1.7.1 pymongo==3.9.0 PyNaCl==1.5.0 pyOpenSSL==20.0.1 pyparsing==3.0.9 pyrsistent==0.18.1 python-dateutil==2.7.3 python-dotenv==0.21.0 python-jwt==3.3.4 python-twitter==3.4 pytz==2017.3 PyYAML==5.3.1 redis==3.5.3 requests==2.19.1 requests-oauthlib==0.8.0 resolvelib==0.8.1 rsa==4.7.2 s3transfer==0.3.7 sentry-sdk==0.12.2 simplejson==3.13.2 six==1.16.0 sqlparse==0.3.0 tenacity==8.2.3 texttable==1.6.4 typing_extensions==4.9.0 ua-parser==0.18.0 uritemplate==3.0.0 urllib3==1.23 user-agents==1.1.0 uWSGI==2.0.23 vine==1.3.0 virtualenv==20.16.4 websocket-client==1.4.1 wrapt==1.10.11 xmltodict==0.13.0

How can we reproduce your problem?

Using ddtrace version >= 2.2.0

What is the result that you get?

An overall increase in latency (duration) is observed.

What is the result that you expected?

A similar level of duration as versions 2.1.9 and below.

emmettbutler commented 7 months ago

This seems likely to be related to the abstraction layer added to the Django integration in https://github.com/DataDog/dd-trace-py/pull/7011. It's worth looking into to understand what optimizations to that layer are available.

emmettbutler commented 7 months ago

Here's my first simple attempt at replicating this behavior. It makes a bunch of requests to a basic django ASGI app and averages the time they take.

from contextlib import contextmanager
import os
import subprocess
import time

from tests.webclient import Client

SERVER_PORT = 8008
ITERATIONS = 1000

@contextmanager
def daphne_client():
    env = os.environ.copy()
    env.update(
        {
            "DJANGO_SETTINGS_MODULE": "tests.contrib.django.django_app.settings",
        }
    )

    cmd = ["ddtrace-run", "daphne", "-p", str(SERVER_PORT), "tests.contrib.django.asgi:application"]
    proc = subprocess.Popen(
        cmd,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
        close_fds=True,
        env=env,
    )

    client = Client("http://localhost:%d" % SERVER_PORT)

    client.wait()

    try:
        yield (client, proc)
    finally:
        resp = client.get_ignored("/shutdown-tracer")
        assert resp.status_code == 200
        proc.terminate()

def main():
    request_durations = []
    with daphne_client() as tup:
        client, _ = tup
        for _ in range(ITERATIONS):
            start_time = time.time()
            client.get("/")
            request_durations.append(time.time() - start_time)
    avg = sum(request_durations) / len(request_durations)
    print(f"Avg {avg}s per request over {ITERATIONS} requests")

if __name__ == "__main__":
    main()
$ cd ddtrace
$ git checkout v2.1.9
$ python -m venv venv && source venv/bin/activate && pip install requests daphne django_q channels -e .
$ python django_benchmark.py
Avg 0.00252020263671875s per request over 1000 requests
$ git checkout v2.2.1
$ python -m venv venv && source venv/bin/activate && pip install requests daphne django_q channels -e .
$ python django_benchmark.py
Avg 0.00283501672744751s per request over 1000 requests

Running this test on my laptop shows a 12% difference between the two versions, which is far from the large difference you're seeing, @sebyul2.

Do you have any ideas on how I might adjust my benchmark case to make it more closely match the situation in your environment?

sebyul2 commented 7 months ago

Thanks @emmettbutler! However, my environment is a uwsgi environment, not an asgi, so it is likely caused by a uwsgi-related change in ddtrace.

sebyul2 commented 7 months ago

I closed the issue by mistake and reopened it.

emmettbutler commented 7 months ago

Here's a test that uses uwsgi, with similar performance results.

from contextlib import contextmanager
import os
import subprocess
import sys
import time

from django.core.wsgi import get_wsgi_application
from django.http import HttpResponse
from django.urls import path

from tests.webclient import Client

filepath, extension = os.path.splitext(__file__)
ROOT_URLCONF = os.path.basename(filepath)
WSGI_APPLICATION = os.path.basename(filepath) + ".app"
SERVER_PORT = 8008
ITERATIONS = 1000

def handler(_):
    return HttpResponse("Hello!")

os.environ["DJANGO_SETTINGS_MODULE"] = "django_benchmark"

urlpatterns = [path("", handler)]
app = get_wsgi_application()

@contextmanager
def django_client():
    cmd = [
        "uwsgi",
        "--master",
        "--enable-threads",
        "--lazy-apps",
        "--module=django_benchmark:app",
        "--http",
        f":{str(SERVER_PORT)}",
        "--import=ddtrace.bootstrap.sitecustomize",
    ]
    proc = subprocess.Popen(
        cmd,
        stdout=sys.stdout,
        stderr=sys.stderr,
        close_fds=True,
        env=os.environ.copy(),
    )

    client = Client("http://localhost:%d" % SERVER_PORT)

    client.wait()

    try:
        yield (client, proc)
    finally:
        proc.terminate()
        proc.kill()

def main():
    request_durations = []
    with django_client() as (client, _):
        for _ in range(ITERATIONS):
            start_time = time.time()
            client.get("/")
            request_durations.append(time.time() - start_time)
    avg = sum(request_durations) / len(request_durations)
    print(f"Avg {avg}s per request over {ITERATIONS} requests")

if __name__ == "__main__":
    main()
$ cd ddtrace
$ git checkout v2.1.9
$ python -m venv venv && source venv/bin/activate && pip install requests django_q channels pytest -e .
$ python django_benchmark.py
Avg 0.0013474271297454835s per request over 1000 requests
$ git checkout v2.2.1
$ python -m venv venv && source venv/bin/activate && pip install requests django_q channels pytest -e .
$ python django_benchmark.py
Avg 0.0017592124938964843s per request over 1000 requests
sebyul2 commented 7 months ago

It's not as big of a difference as I've experienced, but there seems to be a performance difference in the benchmark code anyway. Although it is a small number in absolute terms, there is a clear difference in latency in services that generate a lot of traffic, and there are many more timeout errors than usual. To counter this, the situation is leading to an increase in instance costs. Is there anything else I should look into?

스크린샷 2024-02-29 오전 10 44 42

emmettbutler commented 7 months ago

@sebyul2 I think the next step here is for our team to profile the benchmark code and understand the origin of the latency difference.

emmettbutler commented 7 months ago

Generating flamegraphs showing request servicing of a traced django uwsgi server under each version:

import argparse
from contextlib import contextmanager
from multiprocessing import Pipe
import os
import subprocess
import sys
import time

from django.core.wsgi import get_wsgi_application
from django.http import HttpResponse
from django.urls import path

from tests.webclient import Client

filepath, extension = os.path.splitext(__file__)
ROOT_URLCONF = os.path.basename(filepath)
WSGI_APPLICATION = os.path.basename(filepath) + ".app"
SERVER_PORT = 8008
ITERATIONS = 1000

def handler(_):
    return HttpResponse("Hello!")

os.environ["DJANGO_SETTINGS_MODULE"] = "django_benchmark"

urlpatterns = [path("", handler)]
app = get_wsgi_application()

@contextmanager
def django_client(flame_filename="flamegraph.svg"):
    flame_file = open(flame_filename, "w")
    uwsgi_cmd = [
        "uwsgi",
        "--master",
        "--enable-threads",
        "--lazy-apps",
        "--exit-on-reload",
        "--die-on-term",
        "--module=django_benchmark:app",
        "--http",
        f":{str(SERVER_PORT)}",
        "--import=ddtrace.bootstrap.sitecustomize",
    ]
    uwsgi_proc = subprocess.Popen(
        uwsgi_cmd,
        stderr=sys.stderr,
        stdout=sys.stdout,
        close_fds=True,
        env=os.environ.copy(),
    )
    austin_cmd = ["austin", "-C", "-P", f"--pid={uwsgi_proc.pid}"]
    austin_proc = subprocess.Popen(
        austin_cmd,
        stderr=sys.stderr,
        stdout=subprocess.PIPE,
        close_fds=True,
        env=os.environ.copy(),
    )
    graph_proc = subprocess.Popen(
        ["/Users/emmett.butler/git/FlameGraph/flamegraph.pl"],
        stdin=austin_proc.stdout,
        stdout=flame_file,
        stderr=sys.stderr,
    )

    client = Client("http://localhost:%d" % SERVER_PORT)

    client.wait()

    try:
        yield (client, uwsgi_proc)
    finally:
        print("killing austin")
        austin_proc.terminate()
        austin_proc.wait()
        print("terminating uwsgi")
        uwsgi_proc.terminate()
        print("waiting")
        time.sleep(2)
        print("killing uwsgi")
        uwsgi_proc.kill()
        print("closing flame file")
        flame_file.close()
        print("killing graph proc")
        graph_proc.terminate()
        graph_proc.wait()

def get_args():
    parser = argparse.ArgumentParser(
        description="Generate a flamegraph from a ddtrace'd django uwsgi server serving HTTP requests"
    )
    parser.add_argument(
        "-f", "--flame-filename", type=str, help="The file to which to write the flamegraph", default=""
    )
    return parser.parse_args()

def main():
    args = get_args()
    request_durations = []
    with django_client(flame_filename=args.flame_filename) as (client, _):
        for count in range(ITERATIONS):
            start_time = time.time()
            print(f"Doing request {count}")
            client.get("/")
            request_durations.append(time.time() - start_time)
    avg = sum(request_durations) / len(request_durations)
    print(f"Avg {avg}s per request over {ITERATIONS} requests")

if __name__ == "__main__":
    main()
$ brew install austin
$ git clone git@github.com:brendangregg/FlameGraph.git
$ cd ddtrace
$ git checkout v2.1.9
$ python -m venv venv && source venv/bin/activate && pip install requests django django_q channels pytest -e .
$ sudo python django_benchmark.py --flame-filename=flamegraph-2.1.9.svg
$ git checkout v2.2.1
$ python -m venv venv && source venv/bin/activate && pip install requests django django_q channels pytest -e .
$ sudo python django_benchmark.py --flame-filename=flamegraph-2.2.1.svg

The two resulting flamegraphs from my laptop are attached. I haven't dug into them yet.

flamegraph-2 1 9 flamegraph-2 2 1

emmettbutler commented 7 months ago

Here's some more detail on the performance comparison from the above script. The top graph is 2.1.9 and the bottom graph is 2.2.1.

flamecompzoom1 Here you can see that a bit longer is spent in trace_utils.py:wrapper, but there's no single function call that's primarily responsible. The clearest difference I can see is that the 2.2.1 flame graph shows slightly increased time spent and a deeper call stack around _start_span, which is circled and which the following images go deeper into.

flamecompzoom2 This shows a detail of the deeper call stack, with bits of overhead being added on by the various calls in core.py. Some amount of overhead is expected from a code change like the one we're examining, with a new abstraction layer having been added.

flamecompzoom3 This most zoomed-in screenshot shows that in _start_span itself there isn't much of a difference at all in time spent between the two versions.

This investigation paints a picture of an execution stack with some added layers and increased overhead in amounts basically consistent with what I'd expect from the addition of the Core API abstraction layer that happened in https://github.com/DataDog/dd-trace-py/pull/7011. The most readily apparent way to decrease the time spent in this new layer is to rewrite the slowest sections in Cython, which has been on the mid-term backlog since the Core API was introduced and might be more urgent now that we've migrated the Django integration to the Core API.

@sebyul2 in my estimation, these data don't completely account for the large latency disparity you noted in the original post. Even though my screenshots definitely show a performance degradation, it's only barely on the order of magnitude of your original post. I can't really think of a situation in which these deeper call stacks and little bits of added overhead would compound to result in a difference of that size. Given that, I think that going deeper on your particular case will require adjusting the benchmarking script above to be more similar to the code from which you generated the original post. Do you have any suggestions on how we might do that? Thanks again.

sebyul2 commented 7 months ago

There was no significant difference in the benchmark code you wrote. But actually, upgrading the ddtrace version on our server makes a big difference as follows: Increasing latency from 3ms to 9~15ms is a very big difference in commercial services. The screenshot is a photo compared at the same scale, and you can see that there is a very big difference. (Of course, the difference is smaller for APIs with higher database costs, but it is still a big difference.)

스크린샷 2024-03-11 오후 2 39 22

This difference cannot be reproduced with the benchmark code you provided, and even if you take a debug log, you cannot confirm exactly what the problem is, but what is clear is that the code changes in uwsgi and django of ddtrace are causing the above difference.

Since ddtrace version 1.9.0 does not support the latest versions of psycopg, pymongo, and redis-py libraries, upgrading the corresponding library versions is also not possible. How can we help you create benchmark code that reflects our commercial environment and identify any problems?

mccutchen commented 3 months ago

As another unhelpfully vague data point, we just upgraded a large, complex, monolithic Django service from ddtrace==1.20.19 to ddtrace==2.6.12 and saw significant latency spikes that forced us to downgrade back to 1.20.x.

Here's what p90/95/99/max latency looked like yesterday, using ddtrace==2.6.12, where each vertical bar indicates the completion of a rolling deploy: image

Here's what it looks like today, after reverting back to ddtrace==1.20.19 in the second of the three deploys: image

Additional context: