Open sebyul2 opened 8 months ago
This seems likely to be related to the abstraction layer added to the Django integration in https://github.com/DataDog/dd-trace-py/pull/7011. It's worth looking into to understand what optimizations to that layer are available.
Here's my first simple attempt at replicating this behavior. It makes a bunch of requests to a basic django ASGI app and averages the time they take.
from contextlib import contextmanager
import os
import subprocess
import time
from tests.webclient import Client
SERVER_PORT = 8008
ITERATIONS = 1000
@contextmanager
def daphne_client():
env = os.environ.copy()
env.update(
{
"DJANGO_SETTINGS_MODULE": "tests.contrib.django.django_app.settings",
}
)
cmd = ["ddtrace-run", "daphne", "-p", str(SERVER_PORT), "tests.contrib.django.asgi:application"]
proc = subprocess.Popen(
cmd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
close_fds=True,
env=env,
)
client = Client("http://localhost:%d" % SERVER_PORT)
client.wait()
try:
yield (client, proc)
finally:
resp = client.get_ignored("/shutdown-tracer")
assert resp.status_code == 200
proc.terminate()
def main():
request_durations = []
with daphne_client() as tup:
client, _ = tup
for _ in range(ITERATIONS):
start_time = time.time()
client.get("/")
request_durations.append(time.time() - start_time)
avg = sum(request_durations) / len(request_durations)
print(f"Avg {avg}s per request over {ITERATIONS} requests")
if __name__ == "__main__":
main()
$ cd ddtrace
$ git checkout v2.1.9
$ python -m venv venv && source venv/bin/activate && pip install requests daphne django_q channels -e .
$ python django_benchmark.py
Avg 0.00252020263671875s per request over 1000 requests
$ git checkout v2.2.1
$ python -m venv venv && source venv/bin/activate && pip install requests daphne django_q channels -e .
$ python django_benchmark.py
Avg 0.00283501672744751s per request over 1000 requests
Running this test on my laptop shows a 12% difference between the two versions, which is far from the large difference you're seeing, @sebyul2.
Do you have any ideas on how I might adjust my benchmark case to make it more closely match the situation in your environment?
Thanks @emmettbutler! However, my environment is a uwsgi environment, not an asgi, so it is likely caused by a uwsgi-related change in ddtrace.
I closed the issue by mistake and reopened it.
Here's a test that uses uwsgi, with similar performance results.
from contextlib import contextmanager
import os
import subprocess
import sys
import time
from django.core.wsgi import get_wsgi_application
from django.http import HttpResponse
from django.urls import path
from tests.webclient import Client
filepath, extension = os.path.splitext(__file__)
ROOT_URLCONF = os.path.basename(filepath)
WSGI_APPLICATION = os.path.basename(filepath) + ".app"
SERVER_PORT = 8008
ITERATIONS = 1000
def handler(_):
return HttpResponse("Hello!")
os.environ["DJANGO_SETTINGS_MODULE"] = "django_benchmark"
urlpatterns = [path("", handler)]
app = get_wsgi_application()
@contextmanager
def django_client():
cmd = [
"uwsgi",
"--master",
"--enable-threads",
"--lazy-apps",
"--module=django_benchmark:app",
"--http",
f":{str(SERVER_PORT)}",
"--import=ddtrace.bootstrap.sitecustomize",
]
proc = subprocess.Popen(
cmd,
stdout=sys.stdout,
stderr=sys.stderr,
close_fds=True,
env=os.environ.copy(),
)
client = Client("http://localhost:%d" % SERVER_PORT)
client.wait()
try:
yield (client, proc)
finally:
proc.terminate()
proc.kill()
def main():
request_durations = []
with django_client() as (client, _):
for _ in range(ITERATIONS):
start_time = time.time()
client.get("/")
request_durations.append(time.time() - start_time)
avg = sum(request_durations) / len(request_durations)
print(f"Avg {avg}s per request over {ITERATIONS} requests")
if __name__ == "__main__":
main()
$ cd ddtrace
$ git checkout v2.1.9
$ python -m venv venv && source venv/bin/activate && pip install requests django_q channels pytest -e .
$ python django_benchmark.py
Avg 0.0013474271297454835s per request over 1000 requests
$ git checkout v2.2.1
$ python -m venv venv && source venv/bin/activate && pip install requests django_q channels pytest -e .
$ python django_benchmark.py
Avg 0.0017592124938964843s per request over 1000 requests
It's not as big of a difference as I've experienced, but there seems to be a performance difference in the benchmark code anyway. Although it is a small number in absolute terms, there is a clear difference in latency in services that generate a lot of traffic, and there are many more timeout errors than usual. To counter this, the situation is leading to an increase in instance costs. Is there anything else I should look into?
@sebyul2 I think the next step here is for our team to profile the benchmark code and understand the origin of the latency difference.
Generating flamegraphs showing request servicing of a traced django uwsgi server under each version:
import argparse
from contextlib import contextmanager
from multiprocessing import Pipe
import os
import subprocess
import sys
import time
from django.core.wsgi import get_wsgi_application
from django.http import HttpResponse
from django.urls import path
from tests.webclient import Client
filepath, extension = os.path.splitext(__file__)
ROOT_URLCONF = os.path.basename(filepath)
WSGI_APPLICATION = os.path.basename(filepath) + ".app"
SERVER_PORT = 8008
ITERATIONS = 1000
def handler(_):
return HttpResponse("Hello!")
os.environ["DJANGO_SETTINGS_MODULE"] = "django_benchmark"
urlpatterns = [path("", handler)]
app = get_wsgi_application()
@contextmanager
def django_client(flame_filename="flamegraph.svg"):
flame_file = open(flame_filename, "w")
uwsgi_cmd = [
"uwsgi",
"--master",
"--enable-threads",
"--lazy-apps",
"--exit-on-reload",
"--die-on-term",
"--module=django_benchmark:app",
"--http",
f":{str(SERVER_PORT)}",
"--import=ddtrace.bootstrap.sitecustomize",
]
uwsgi_proc = subprocess.Popen(
uwsgi_cmd,
stderr=sys.stderr,
stdout=sys.stdout,
close_fds=True,
env=os.environ.copy(),
)
austin_cmd = ["austin", "-C", "-P", f"--pid={uwsgi_proc.pid}"]
austin_proc = subprocess.Popen(
austin_cmd,
stderr=sys.stderr,
stdout=subprocess.PIPE,
close_fds=True,
env=os.environ.copy(),
)
graph_proc = subprocess.Popen(
["/Users/emmett.butler/git/FlameGraph/flamegraph.pl"],
stdin=austin_proc.stdout,
stdout=flame_file,
stderr=sys.stderr,
)
client = Client("http://localhost:%d" % SERVER_PORT)
client.wait()
try:
yield (client, uwsgi_proc)
finally:
print("killing austin")
austin_proc.terminate()
austin_proc.wait()
print("terminating uwsgi")
uwsgi_proc.terminate()
print("waiting")
time.sleep(2)
print("killing uwsgi")
uwsgi_proc.kill()
print("closing flame file")
flame_file.close()
print("killing graph proc")
graph_proc.terminate()
graph_proc.wait()
def get_args():
parser = argparse.ArgumentParser(
description="Generate a flamegraph from a ddtrace'd django uwsgi server serving HTTP requests"
)
parser.add_argument(
"-f", "--flame-filename", type=str, help="The file to which to write the flamegraph", default=""
)
return parser.parse_args()
def main():
args = get_args()
request_durations = []
with django_client(flame_filename=args.flame_filename) as (client, _):
for count in range(ITERATIONS):
start_time = time.time()
print(f"Doing request {count}")
client.get("/")
request_durations.append(time.time() - start_time)
avg = sum(request_durations) / len(request_durations)
print(f"Avg {avg}s per request over {ITERATIONS} requests")
if __name__ == "__main__":
main()
$ brew install austin
$ git clone git@github.com:brendangregg/FlameGraph.git
$ cd ddtrace
$ git checkout v2.1.9
$ python -m venv venv && source venv/bin/activate && pip install requests django django_q channels pytest -e .
$ sudo python django_benchmark.py --flame-filename=flamegraph-2.1.9.svg
$ git checkout v2.2.1
$ python -m venv venv && source venv/bin/activate && pip install requests django django_q channels pytest -e .
$ sudo python django_benchmark.py --flame-filename=flamegraph-2.2.1.svg
The two resulting flamegraphs from my laptop are attached. I haven't dug into them yet.
Here's some more detail on the performance comparison from the above script. The top graph is 2.1.9 and the bottom graph is 2.2.1.
Here you can see that a bit longer is spent in trace_utils.py:wrapper
, but there's no single function call that's primarily responsible. The clearest difference I can see is that the 2.2.1 flame graph shows slightly increased time spent and a deeper call stack around _start_span
, which is circled and which the following images go deeper into.
This shows a detail of the deeper call stack, with bits of overhead being added on by the various calls in core.py
. Some amount of overhead is expected from a code change like the one we're examining, with a new abstraction layer having been added.
This most zoomed-in screenshot shows that in _start_span
itself there isn't much of a difference at all in time spent between the two versions.
This investigation paints a picture of an execution stack with some added layers and increased overhead in amounts basically consistent with what I'd expect from the addition of the Core API abstraction layer that happened in https://github.com/DataDog/dd-trace-py/pull/7011. The most readily apparent way to decrease the time spent in this new layer is to rewrite the slowest sections in Cython, which has been on the mid-term backlog since the Core API was introduced and might be more urgent now that we've migrated the Django integration to the Core API.
@sebyul2 in my estimation, these data don't completely account for the large latency disparity you noted in the original post. Even though my screenshots definitely show a performance degradation, it's only barely on the order of magnitude of your original post. I can't really think of a situation in which these deeper call stacks and little bits of added overhead would compound to result in a difference of that size. Given that, I think that going deeper on your particular case will require adjusting the benchmarking script above to be more similar to the code from which you generated the original post. Do you have any suggestions on how we might do that? Thanks again.
There was no significant difference in the benchmark code you wrote. But actually, upgrading the ddtrace version on our server makes a big difference as follows: Increasing latency from 3ms to 9~15ms is a very big difference in commercial services. The screenshot is a photo compared at the same scale, and you can see that there is a very big difference. (Of course, the difference is smaller for APIs with higher database costs, but it is still a big difference.)
This difference cannot be reproduced with the benchmark code you provided, and even if you take a debug log, you cannot confirm exactly what the problem is, but what is clear is that the code changes in uwsgi and django of ddtrace are causing the above difference.
Since ddtrace version 1.9.0 does not support the latest versions of psycopg, pymongo, and redis-py libraries, upgrading the corresponding library versions is also not possible. How can we help you create benchmark code that reflects our commercial environment and identify any problems?
As another unhelpfully vague data point, we just upgraded a large, complex, monolithic Django service from ddtrace==1.20.19
to ddtrace==2.6.12
and saw significant latency spikes that forced us to downgrade back to 1.20.x.
Here's what p90/95/99/max latency looked like yesterday, using ddtrace==2.6.12
, where each vertical bar indicates the completion of a rolling deploy:
Here's what it looks like today, after reverting back to ddtrace==1.20.19
in the second of the three deploys:
Additional context:
Summary of problem
An overall increase in duration has been observed in ddtrace versions 2.2.0 and above. This has led to a significant increase in latency in actual user service environments, resulting in service delays and failures. The issue has been resolved by using ddtrace versions 2.1.9 and below.
api Latency increased during the patch period that included the ddtrace 2.2.0 upgrade (log scale)
Compare duration logs between ddtrace 2.1.9 and 2.2.0
ddtrace==2.1.9
ddtrace==2.2.0
Which version of dd-trace-py are you using?
2.2.0
Which version of pip are you using?
22.2.2
Which libraries and their versions are you using?
`pip freeze`
amqp==2.6.0 ansible==6.3.0 ansible-core==2.13.3 asgiref==3.7.2 astroid==1.6.0 attrs==20.1.0 awscli==1.25.67 bcrypt==4.0.0 billiard==3.6.3.0 boto3==1.14.19 botocore==1.17.19 bytecode==0.15.1 cattrs==23.1.2 celery==4.4.0 certifi==2018.8.24 cffi==1.15.1 chardet==3.0.4 charset-normalizer==2.1.1 colorama==0.4.4 coreapi==2.3.3 coreschema==0.0.4 coverage==4.4.2 cryptography==3.2 ddsketch==2.0.4 ddtrace==1.9.0 Deprecated==1.2.14 distlib==0.3.6 distro==1.7.0 Django==3.0.11 django-cacheops==5.1 django-cors-headers==2.1.0 django-redis==4.12.1 django-storages==1.10.1 django-utils-six==2.0 djangoql==0.13.1 djangorestframework==3.12.0 docker==6.0.0 docker-compose==1.29.2 dockerpty==0.4.1 docopt==0.6.2 docutils==0.14 drf-spectacular==0.17.2 elasticsearch==6.8.0 envier==0.5.1 exceptiongroup==1.2.0 facebook-sdk==3.0.0 filelock==3.8.0 freezegun==1.0.0 funcy==1.10 future==0.16.0 hashids==1.2.0 idna==2.7 inflection==0.5.1 isort==4.2.15 itypes==1.1.0 Jinja2==2.10 jmespath==0.9.3 jsonschema==3.2.0 jwcrypto==1.4.2 kombu==4.6.10 lazy-object-proxy==1.3.1 MarkupSafe==1.1.0 mccabe==0.6.1 msgpack-python==0.5.1 oauthlib==2.0.7 openapi-codec==1.3.2 packaging==21.3 paramiko==2.11.0 paypal-checkout-serversdk==1.0.1 paypalhttp==1.0.0 phonenumbers==8.10.22 platformdirs==2.5.2 ply==3.11 protobuf==4.25.2 psycopg2-binary==2.8.6 py==1.9.0 pyasn1==0.4.8 pybreaker==0.6.0 pycparser==2.21 pycryptodome==3.9.8 PyJWT==1.7.1 pymongo==3.9.0 PyNaCl==1.5.0 pyOpenSSL==20.0.1 pyparsing==3.0.9 pyrsistent==0.18.1 python-dateutil==2.7.3 python-dotenv==0.21.0 python-jwt==3.3.4 python-twitter==3.4 pytz==2017.3 PyYAML==5.3.1 redis==3.5.3 requests==2.19.1 requests-oauthlib==0.8.0 resolvelib==0.8.1 rsa==4.7.2 s3transfer==0.3.7 sentry-sdk==0.12.2 simplejson==3.13.2 six==1.16.0 sqlparse==0.3.0 tenacity==8.2.3 texttable==1.6.4 typing_extensions==4.9.0 ua-parser==0.18.0 uritemplate==3.0.0 urllib3==1.23 user-agents==1.1.0 uWSGI==2.0.23 vine==1.3.0 virtualenv==20.16.4 websocket-client==1.4.1 wrapt==1.10.11 xmltodict==0.13.0How can we reproduce your problem?
Using ddtrace version >= 2.2.0
What is the result that you get?
An overall increase in latency (duration) is observed.
What is the result that you expected?
A similar level of duration as versions 2.1.9 and below.