elastic / apm-agent-php

Apache License 2.0
252 stars 69 forks source link

High CPU usage #1006

Open nightmarebug opened 1 year ago

nightmarebug commented 1 year ago

I installed version 1.9.0 of the elastic APM agent into production server which runs php-7.4 on ubuntu 20.04 . The application itself is a large mysql backed Zend(1.1) application.

After installing APM withing few hours CPU is usage over 95% and application got crashed

After disabling the APM it returned to very low CPU usage and everything works fine

Congifuration:-

elastic_apm.enabled = true elastic_apm.environment = "production" elastic_apm.server_timeout = "30s" elastic_apm.async_backend_command = true elastic_apm.transaction_max_spans = 500 elastic_apm.transaction_sample_rate = 1.0

alehostert commented 1 year ago

I'm investigating a similar situation with my servers (PHP 7.4, CentOS 7).

I noticed that when elastic apm agent php is installed and enabled, several connections keep open on MySQL with the Sleep state. Sometimes, the connections ends with 0 (zero) time. But, sometimes, the connections keeps open until wait_timeout is reached.

If I uninstall the elastic apm agent, none Sleeping connections will appear, as usual.

Of course, we can tune "wait_timeout" to workaround this, but I'm curious if is expected to have such Sleeping connections on MySQL when apm agent is enabled, or not.

bck01215 commented 1 year ago

Also experiencing on php 8.1 and alpine base image.

CPU seems to be from an increase in spans from wordpress, but it might be something else. Spans have gone way up, but limiting spans and stack traces hasn't resolved the issue.

bck01215 commented 1 year ago
image

Adding an image from using phpspy and a minimalist wordpress setup, you can see under load testing APM is responsible for almost all CPU usage

When using a comparison between 1.9.1 and 1.8.4, Elastic\Apm\Impl\AutoInstrument\PhpPartFacade::callAndSwallowThrowable showed a massive increase in CPU usage. 1.9.1 used 92% /1.88 CPU hours in the same amount of time that 1.8.4 used 11.11% of the CPU/.22 CPU hours.

nightmarebug commented 1 year ago

Means 1.8.4 have less cpu usage

6fears7 commented 10 months ago

My team has reported this issue as well when moving to 1.9.0 (php 8.1 and alpine base image). They have regressed to 1.8.4 until this is resolved.

SergeyKleyman commented 8 months ago

@nightmarebug @6fears7 @bck01215 Could you please try the latest release v1.12.0 and if the agent still cause CPU increase please try setting ast_process_enabled configuration option (ELASTIC_APM_AST_PROCESS_ENABLED environment variable or elastic_apm.ast_process_enabled php.ini option) to false.

6fears7 commented 8 months ago

Setting the ELASTIC_APM_AST_PROCESS_ENABLED to false did reduce the CPU usage notably.

Before: Total of 61% usage in a CPU frame image

After: Total of 34% usage in a CPU frame. image

SergeyKleyman commented 8 months ago

Hi @6fears7 Thank you testing the last suggestion. Regarding the remaining CPU overhead it seems to be related to spans created by MySQLi instrumentation. Could you please check the effect on CPU if:

6fears7 commented 8 months ago

@SergeyKleyman

Roughly 20% CPU utilization in the frame.

image

SergeyKleyman commented 7 months ago

@6fears7 Thank you for the feedback. Could you please provide for comparison the flame graph without the agent (for example with elastic_apm extension commented out in php.ini)? Also is 20% CPU utilization you mentioned above inline with your expectations?

6fears7 commented 7 months ago

image

With agent disabled.

After seeing 20% utilization, I decided to regress back to 1.8.4 with all the same configs set that we've discussed above. 10% utilization.

Then I went back and did 1.12.0 again and extended my test duration by 5x the amount just to smooth the data out. The test is just curling the wordpress endpoint every two seconds until the duration is up.

There were frames that hit 30%, but overall the average remained the same as 1.8.4: 10%.

image

So in summary and to answer your question @SergeyKleyman , these options brought the CPU usage down to the 1.8.4 levels that I'd expected the agent to utilize:

elastic_apm.ast_process_enabled=false
elastic_apm.span_stack_trace_min_duration=100ms
elastic_apm.disable_instrumentations=MySQLi

References: span_stack_trace_min_duration option disable_instrumentations option

I don't have an upper bound in mind of acceptable thresholds, but if the average is 10% and the upper is 30%, that works for us.