DataDog / dd-trace-php

Datadog PHP Clients
https://docs.datadoghq.com/tracing/setup/php
Other
493 stars 152 forks source link

[FEATURE] All integrations enabled by default causes increased CPU usage #1199

Open tomcant opened 3 years ago

tomcant commented 3 years ago

We recently added APM to our stack and it's mostly working great, however we noticed a significant increase in CPU usage when the PHP tracer is enabled. Looking at the constructor in the IntegrationsLoader class it appears to add all the built-in integrations by default, and proceeds to load/initialise them all from dd_init.php.

We wondered if this could be the cause of our increased CPU usage, and in any case thought it might be more efficient to have an option for enabling specific integrations on a one-by-one basis with the default being false. Technically it does look to be possible using a combination of the DD_INTEGRATIONS_DISABLED and DD_TRACE_<INTEGRATION>_ENABLED environment variables, but it feels wrong to have to specify the names of all integrations you don't want enabled first.

I understand it might be more helpful in some circumstances to enable all integrations by default so that teams can just install the PHP extension and go, but maybe we could introduce something whereby if an integration has been explicitly enabled (e.g. DD_TRACE_SYMFONY_ENABLED=true) then all other integrations are disabled. It looks like this might be possible with a few tweaks to the ddtrace_config_integration_enabled() function.

One other thing worth mentioning is that we tried disabling the root span with DD_TRACE_GENERATE_ROOT_SPAN=false for some of our long-running processes, effectively reducing the footprint down to only having the extension enabled (i.e. no traces sent to DD), and we still saw the same impact on CPU usage, so it doesn't seem to be anything to do with the actual tracing.

It would be interesting to hear more thoughts on this. Perhaps we're missing something...

labbati commented 3 years ago

Hi @tomcant, and thanks for the report. I am curious about which version of PHP you are using and what is the CPU overhead you are measuring.

A few thoughts on your proposal.

We are progressively introducing lazy integration loading (starting from 7 and for libraries integrations, soon to be extended to all our integrations). With this mechanism an integration is loaded when it is used the very first time. This should be enough to address your (valid) concerns about 'wasted' CPU cycles for integrations you do not need.

About the CPU increase it really depends on the app, but just to give you some numbers for what you can expect, we are around 25% CPU overhead for a wordpress 5 home page with a few posts under heavy load. We are currently working to move ALL our code from PHP to C, this will reduce CPU consumption, among other things. This is in progress and part of our Q2 planning. We are currently reductant to work on optimizations related to performance until we have our new 'baseline' (aka no php code at all, only C) unless this, of course, implies stability issues.

tomcant commented 3 years ago

Hi @labbati, thanks for the speedy response and apologies for the delay in getting back to you.

We're running PHP 7.4 and the Symfony framework. We ran some experiments around our long running processes today, and here's what happened to CPU usage:

Screenshot 2021-04-15 at 18 05 12

Prior to 14:15, the PHP extension was enabled but DD_TRACE_CLI_ENABLED had not been set, so this was essentially a no-op and CPU usage looks roughly "normal" for that time of day. From 14:15 to 17:00, we set DD_TRACE_CLI_ENABLED=true with DD_TRACE_GENERATE_ROOT_SPAN=false, and we weren't creating any custom spans, so this was also a no-op. At 17:00 we additionally set the following to explicitly disable all built-in integrations (again with no root/custom spans):

DD_INTEGRATIONS_DISABLED=cakephp,codeigniter,curl,elasticsearch,eloquent,guzzle,laravel,lumen,memcached,mongo,mysqli,pdo,phpredis,predis,slim,symfony,wordpress,yii,zendframework

It looks like the integrations being enabled is what causes the additional load, and although difficult to quantify exactly, it looks to be by more than 25% on average.

The developments you mentioned around lazily loading the integrations and moving more of the work into C sound great. For now we'll experiment with disabling the plugins we're not interested in, and we'll keep an eye out for developments on the project.

eddmann commented 3 years ago

Just to add to what @tomcant has mentioned,

Experimenting with enabling certain desired integrations today it looks like the one that causes the performance hit is PDO. With enabling this integration in the configuration @tomcant explained above we see the same CPU usage increase (without any spans ever being created, and simply loading the extension into our environment).

We are attempting to trace a sample of Supervisor-based long-running process 'loops', these do communicate a lot with the database. It would be great if we could only incur this integration 'work cost' if we are in a Span and wished to be traced. At this time it seems like dd_trace_tracer_is_limited is called to short-circuit the work the integration wishes to do.