DataDog / dd-trace-java

Datadog APM client for Java
https://docs.datadoghq.com/tracing/languages/java
Apache License 2.0
580 stars 288 forks source link

CI test cycles slow to a crawl with 1.18.0+ #5696

Closed internetstaff closed 1 year ago

internetstaff commented 1 year ago

Upgrading past 1.17.0 causes our test cycles to slow down so much they time out (~7 minutes to 60++).

We're on Java 17, Spring Boot 2.7, Junit 5. We haven't dug too deeply yet. What would help?

nikita-tkachenko-datadog commented 1 year ago

Hi @internetstaff,

As a quick fix I can suggest disabling code coverage in the tracer. This can be done by specifying the system property (-Ddd.civisibility.code.coverage.enabled=false) or the environment variable (export DD_CIVISIBILITY_CODE_COVERAGE_ENABLED=false). This should fix the issue.

On the other hand, I suspect that you might be using an older way of injecting the tracer: manually configuring Maven Surefire plugin or Gradle test task. If that is the case, my other suggestion would be for you to try the newer configuration way: recent versions of the tracer work best when the build system itself is instrumented, which requires a slightly different way of attaching the tracer. Instructions on how to do it can be found in CI Visibility docs (in short, it boils down to specifying MAVEN_OPTS/org.gradle.jvmargs instead of modifying the project build files).

Let me know if this helps. Thanks!

internetstaff commented 1 year ago

Disabling code coverage does appear to resolve the issue.

You're also correct in that we are still configuring surefire argLine.

Attempting the "new" way of integration we first ran into the requirement to enumerate all our annotation processors, which was frustrating enough that we eventually discovered dd.civisibility.compiler.plugin.auto.configuration.enabled=false.

However, getting past the compile stage, our tests then began crashing.

18:31:01,596 [ERROR] The forked VM terminated without properly saying goodbye. VM crash or System.exit called?
18:31:01,596 [ERROR] Error occurred in starting fork, check output in log
18:31:01,596 [ERROR] Process Exit Code: 134
18:31:01,596 [ERROR] org.apache.maven.surefire.booter.SurefireBooterForkException: The forked VM terminated without properly saying goodbye. VM crash or System.exit called?

There's not really much else to go on. We are on bellsoft/liberica-openjdk-alpine:17.0.8-7-x86_64 which is a slightly unusual JDK container build (alpine + glibc), but ...

Thanks.

nikita-tkachenko-datadog commented 1 year ago

Regarding the "The forked VM terminated without properly saying goodbye." could you please share the command that Maven Surefire was using to start the JVM? It's usually written to console along with the error (should be right before the error message, as far as I recall)

internetstaff commented 1 year ago

Here's a slightly redacted version:

Command was /bin/sh -c cd '/builds/redacted/javarepo/common/common' && '/usr/lib/jvm/jdk-17.0.8-bellsoft-x86_64/bin/java' '-javaagent:/builds/redacted/javarepo/.m2/repository/org/jacoco/org.jacoco.agent/0.8.10/org.jacoco.agent-0.8.10-runtime.jar=destfile=/builds/redacted/javarepo/common/common/target/jacoco-it.exec,excludes=**/generated/**/*' '-Ddd.civisibility.compiler.plugin.auto.configuration.enabled=false' '-Ddd.iast.enabled=false' '-Ddd.appsec.enabled=false' '-Ddd.civisibility.agent.jar.uri=file:/datadog/dd-java-agent.jar' '-Ddd.remote_config.enabled=false' '-Ddd.trace.enabled=false' '-Ddd.jmxfetch.enabled=false' '-Ddd.profiling.enabled=false' '-Ddd.prioritization.type=ENSURE_TRACE' '-Ddd.service=java' '-Ddd.civisibility.code.coverage.enabled=false' '-Ddd.civisibility.enabled=true' '-Ddd.cws.enabled=false' '-javaagent:/datadog/dd-java-agent.jar' '-javaagent:/builds/redacted/javarepo/.m2/repository/org/jacoco/org.jacoco.agent/0.8.10/org.jacoco.agent-0.8.10-runtime.jar=destfile=/builds/redacted/javarepo/common/common/target/jacoco-it.exec,excludes=**/generated/**/*' '-Duser.timezone=UTC' '-Duser.language=en' '-Duser.region=US' '-Dspring.cloud.bootstrap.enabled=false' '-XX:-TieredCompilation' '-XX:TieredStopAtLevel=1' '-jar' '/builds/redacted/javarepo/common/common/target/surefire/surefirebooter-20230811183016149_6.jar' '/builds/redacted/javarepo/common/common/target/surefire' '2023-08-11T18-30-14_790-jvmRun2' 'surefire-20230811183016149_4tmp' 'surefire_1-20230811183016149_5tmp'

nikita-tkachenko-datadog commented 1 year ago

I believe I have identified and fixed the issue that was causing this failure. Could you please try the newer instrumentation way once more with release v1.19.1?

I also think I've solved the problem that was causing your annotation processors to "disappear", so it'd be very nice if you could give it a try without the dd.civisibility.compiler.plugin.auto.configuration.enabled=false property (i.e. with the compiler plugin configuration enabled).

Thanks!

internetstaff commented 1 year ago

We've tried 1.19.2.

It seems like we can enable code coverage again without the slow down, and we did not have to do any finagling of annotation processors nor disable the compiler plugin.

However, test runs fail in a different way now:


    at datadog.trace.civisibility.events.BuildEventsHandlerImpl.onTestModuleFinish (BuildEventsHandlerImpl.java:125)
    at datadog.trace.instrumentation.maven3.MavenExecutionListener.mojoSucceeded (MavenExecutionListener.java:157)
    at jdk.internal.reflect.GeneratedMethodAccessor85.invoke (Unknown Source)
    at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke (Method.java:568)
    at datadog.trace.instrumentation.maven3.MavenLifecycleParticipant.lambda$afterSessionStart$0 (MavenLifecycleParticipant.java:70)
    at jdk.proxy3.$Proxy48.mojoSucceeded (Unknown Source)
    at org.apache.maven.lifecycle.internal.DefaultExecutionEventCatapult.fire (DefaultExecutionEventCatapult.java:80)
    at org.apache.maven.lifecycle.internal.DefaultExecutionEventCatapult.fire (DefaultExecutionEventCatapult.java:38)
    at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute2 (MojoExecutor.java:336)
    at org.apache.maven.lifecycle.internal.MojoExecutor.doExecute (MojoExecutor.java:316)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:212)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:174)
    at org.apache.maven.lifecycle.internal.MojoExecutor.access$000 (MojoExecutor.java:75)
    at org.apache.maven.lifecycle.internal.MojoExecutor$1.run (MojoExecutor.java:162)
    at org.apache.maven.plugin.DefaultMojosExecutionStrategy.execute (DefaultMojosExecutionStrategy.java:39)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:159)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:105)
    at org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call (MultiThreadedBuilder.java:193)
    at org.apache.maven.lifecycle.internal.builder.multithreaded.MultiThreadedBuilder$1.call (MultiThreadedBuilder.java:180)
    at java.util.concurrent.FutureTask.run (FutureTask.java:264)
    at java.util.concurrent.Executors$RunnableAdapter.call (Executors.java:539)
    at java.util.concurrent.FutureTask.run (FutureTask.java:264)
    at java.util.concurrent.ThreadPoolExecutor.runWorker (ThreadPoolExecutor.java:1136)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run (ThreadPoolExecutor.java:635)
    at java.lang.Thread.run (Thread.java:833)```
nikita-tkachenko-datadog commented 1 year ago

I suspect you might have a module in your project that runs multiple test executions bound to the same lifecycle phase. If that is the case, could you please try running the build with v1.19.3? The tracer was not handling this scenario correctly, but the issue is fixed in the newest release.

And if that doesn't help, could you please provide some details wrt the setup in the module that is failing? Namely the Surefire and Failsafe plugins configs (anything non-standard would be of interest, such as having multiple <execution> or <goal> declarations, binding to a non-default lifecycle phase, etc).

Thank you!

internetstaff commented 1 year ago

1.19.3 did not fix the session not found problem. However, I noticed it's caused by having two modules with the exact same <name> configured. In our case, this was a copy-and-paste error we were happy to fix, but I'm wondering if that couldn't be a problem for others.

We seem to be able to compile and test now with this integration method. Thanks!

nikita-tkachenko-datadog commented 1 year ago

Great, I'm glad it works for you now! And thanks for highlighting the identical <name> issue, I'll see if I can make the tracer respond in a more obvious way to cases like this.