DataDog / dd-trace-java

Datadog APM client for Java
https://docs.datadoghq.com/tracing/languages/java
Apache License 2.0
580 stars 288 forks source link

Bytecode manipulation of TestNG code skips test suite result collection. #5268

Closed kylelyk closed 1 year ago

kylelyk commented 1 year ago

Tested Versions: Java: 11, 17 TestNG: 7.0.0, 7.50, 7.6.1, 7.8.0 dd-trace-java.jar: 1.13.0, 1.14.0

Description of Problem: When running tests in parallel, it seems that the dd-trace-java.jar java agent changes the bytecode of org.testng.SuiteRunner such that it no longer runs the second half of SuiteRunner.runTest(TestRunner) which is in charge of adding the suite results to the runner’s suiteResults member field. The resulting generated testng-results.xml file then says that 0 tests were run and contains no test result output.

nikita-tkachenko-datadog commented 1 year ago

Hi @kylelyk, thank you for reporting this.

Do I understand correctly, that you are running the tracer with CI Visibility enabled? Also, is it true that the issue is only observed when the tests are run in parallel? Lastly, are there any exceptions or error messages in the logs?

Thanks!

kylelyk commented 1 year ago

We always run our tests in parallel, but for some reason the problem stopped when I disabled all but 2 of our test classes, so I believe that if TestNG doesn’t go down the parallel path, it doesn’t hit the issue. There are no exceptions in the logs that I found. This is how DD starts up:

[dd.trace 2023-05-24 19:25:45:523 +0000] [main] INFO datadog.trace.agent.common.writer.WriterFactory - Using 'EnsureTrace' prioritization type. (Do not use this type if your application is running in production mode)
error   24-May-2023 19:25:46    [dd.trace 2023-05-24 19:25:46:234 +0000] [dd-task-scheduler] INFO datadog.trace.agent.core.StatusLogger - DATADOG TRACER CONFIGURATION {"version":"1.14.0~0db43eba51","os_name":"Linux","os_version":"5.10.179-166.674.amzn2.x86_64","architecture":"amd64","lang":"jvm","lang_version":"11.0.19","jvm_vendor":"Amazon.com Inc.","jvm_version":"11.0.19+7-LTS","java_class_version":"55.0","http_nonProxyHosts":"null","http_proxyHost":"null","enabled":true,"service":"warehouseapitest","agent_url":"http://localhost:8126","agent_error":false,"debug":false,"trace_propagation_style_extract":["datadog"],"trace_propagation_style_inject":["datadog"],"analytics_enabled":false,"sampling_rules":[{},{}],"priority_sampling_enabled":true,"logs_correlation_enabled":true,"profiling_enabled":false,"remote_config_enabled":false,"debugger_enabled":false,"appsec_enabled":"FULLY_DISABLED","telemetry_enabled":true,"dd_version":"","health_checks_enabled":true,"configuration_file":"no config file present","runtime_id":"6da7d06a-cf3b-4aea-af46-f1b80d238325","logging_settings":{"levelInBrackets":false,"dateTimeFormat":"'[dd.trace 'yyyy-MM-dd HH:mm:ss:SSS Z']'","logFile":"System.err","configurationFile":"simplelogger.properties","showShortLogName":false,"showDateTime":true,"showLogName":true,"showThreadName":true,"defaultLogLevel":"INFO","warnLevelString":"WARN","embedException":false},"cws_enabled":false,"cws_tls_refresh":5000,"datadog_profiler_enabled":true,"datadog_profiler_safe":true}
error   24-May-2023 19:27:55    [dd.trace 2023-05-24 19:27:55:634 +0000] [TestNG-tests-1] WARN datadog.trace.civisibility.source.RepoIndexSourcePathResolver - Building index of source files in /home/ec2-user/bamboo-agent-home/xml-data/build-dir/WAREHOUSE-WPST7-JOB1/warehouseplatform. This operation can be slow, please consider using Datadog Java compiler plugin to avoid indexing
error   24-May-2023 19:27:56    [dd.trace 2023-05-24 19:27:56:456 +0000] [TestNG-tests-1] INFO datadog.trace.civisibility.source.RepoIndexSourcePathResolver - Indexing took 801 ms. Files visited: 3266, source files visited: 1716, source roots found: 5
nikita-tkachenko-datadog commented 1 year ago

@kylelyk, I fixed an issue related to instrumenting TestNG parallel runs, but I am not 100% sure if it was the problem that you observed, since I could not reproduce the exact symptoms that you described.

The fix will be included in the upcoming 1.15.0 release. Please test your runs with the new version and let me know if the problem persists. Thanks!

nikita-tkachenko-datadog commented 1 year ago

@kylelyk 1.15.0 has been released today, so feel free to try out the fix.

kylelyk commented 1 year ago

It doesn’t appear that the fix worked. I found that the issue is caused by an exception:

Active scope does not correspond to the finished suite, it is possible that end() was called multiple times or an operation that was started by the suite is still in progress; active scope span is: DDSpan [ t_id=2442811663709701659, s_id=5644848576612162877, p_id=3813780349465646619 ] trace=* tags={ci.workspace_path=*, component=testng, git.branch=*, git.commit.author.date=2023-05-25T16:41:30.000Z, git.commit.author.email=*, git.commit.author.name=*, git.commit.committer.date=2023-05-25T16:41:30.000Z, git.commit.committer.email=*, git.commit.committer.name=*, git.commit.message=*.
, git.commit.sha=*, git.repository_url=*, os.architecture=x86_64, os.platform=Mac OS X, os.version=12.6, runtime.name=OpenJDK Runtime Environment, runtime.vendor=Amazon.com Inc., runtime.version=17.0.1, span.kind=test_suite_end, test.framework=testng, test.framework_version=7.8.0, test.module=*, test.source.file=*, test.suite=*, test.traits={"category":["Smoke"]}, test.type=test, test_module_id=*, test_suite_id=*, thread.id=47, thread.name=TestNG-tests-1}, duration_ns=0, forceKeep=false
datadog.trace.civisibility.DDTestSuiteImpl.end(DDTestSuiteImpl.java:132)
datadog.trace.civisibility.events.TestEventsHandlerImpl.onTestSuiteFinish(TestEventsHandlerImpl.java:146)
datadog.trace.instrumentation.testng.TracingListener.onAfterClass(TracingListener.java:60)
datadog.trace.instrumentation.testng.TestNGClassListener.invokeAfterClass(TestNGClassListener.java:58)
org.testng.internal.invokers.TestMethodWorker.invokeAfterClassMethods(TestMethodWorker.java:217)
org.testng.internal.invokers.TestMethodWorker.run(TestMethodWorker.java:131)
java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
org.testng.TestRunner.privateRun(TestRunner.java:848)
org.testng.TestRunner.run(TestRunner.java:621)
org.testng.SuiteRunner.runTest(SuiteRunner.java:443)
org.testng.SuiteRunner$SuiteWorker.run(SuiteRunner.java:481)
org.testng.internal.thread.ThreadUtil.lambda$execute$0(ThreadUtil.java:58)
java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
java.base/java.lang.Thread.run(Thread.java:833)
nikita-tkachenko-datadog commented 1 year ago

Interestingly, this is the exact error that was fixed. The thing is that the fix is applied conditionally, if it is determined that the test suite is executed in parallel mode. Apparently the logic failed to determine that your tests are configured to run in parallel.

Could you please share the configuration that you use for parallel tests execution?

kylelyk commented 1 year ago

I’ve confirmed through debugging that the parallelized flag you added is indeed false. Looks like it’s an issue in TestNGClassListenerInstrumentation on line 69:

      boolean parallelized = parallel == XmlSuite.ParallelMode.METHODS;

testng_smoke.xml

<!DOCTYPE suite SYSTEM "https://testng.org/testng-1.0.dtd" >

<suite name="API Test Suite" parallel="tests" configfailurepolicy="continue">
    <listeners>
        <!-- Listeners added here should also be added to the gradle file. -->
        <listener class-name="..."/>
    </listeners>

    <test name="Smoke Suite">
        <groups>
            <run>
                <include name="Smoke"/>
            </run>
        </groups>
        <packages>
            <package name="com.warehouse"/>
            <package name="com.warehouse.test"/>
            <package name="..."/>
        </packages>
    </test>
</suite>
github-actions[bot] commented 1 year ago

:robot: This issue has been addressed in the latest release. See full details in the Release Notes.

nikita-tkachenko-datadog commented 1 year ago

@kylelyk, I tried to make my local setup as close to yours as I could, and fixed one more issue in the TestNG integration. Please try once more with the latest 1.15.1 release

kylelyk commented 1 year ago

It works! Thanks so much for your quick work on this.

dwilliams-gs commented 1 year ago

I've come across a similar issue and reverting back to 1.15.1 appears to fix it. We were previously using 1.20.0. Has the fix for this potentially regressed?

nikita-tkachenko-datadog commented 1 year ago

Hi @dwilliams-gs, could you please provide the full error message that you're seeing?

dwilliams-gs commented 1 year ago

Sure. This is the error message we were seeing

Caused by: org.apache.maven.surefire.booter.SurefireBooterForkException:There was an error in the forked processjava.lang.NullPointerExceptionat datadog.trace.instrumentation.testng.TestNGUtils.toSkippableTest(TestNGUtils.java:157)at org.testng.internal.invokers.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:89)at org.testng.internal.invokers.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:84)at org.testng.internal.invokers.MethodInvocationHelper.invokeMethodNoCheckedException(MethodInvocationHelper.java:53)at org.testng.internal.invokers.MethodInvocationHelper.invokeDataProvider(MethodInvocationHelper.java:157)at org.testng.internal.Parameters.handleParameters(Parameters.java:803)at org.testng.internal.FactoryMethod.invoke(FactoryMethod.java:160)at org.testng.internal.TestNGClassFinder.processFactory(TestNGClassFinder.java:174)at org.testng.internal.TestNGClassFinder.processMethod(TestNGClassFinder.java:138)at org.testng.internal.TestNGClassFinder.processClass(TestNGClassFinder.java:129)at org.testng.internal.TestNGClassFinder.(TestNGClassFinder.java:67)at org.testng.TestRunner.initMethods(TestRunner.java:461)at org.testng.TestRunner.init(TestRunner.java:356)at org.testng.TestRunner.init(TestRunner.java:309)at org.testng.TestRunner.(TestRunner.java:184)at org.testng.SuiteRunner$DefaultTestRunnerFactory.newTestRunner(SuiteRunner.java:652)at org.testng.SuiteRunner.init(SuiteRunner.java:224)at org.testng.SuiteRunner.(SuiteRunner.java:116)at org.testng.TestNG.createSuiteRunner(TestNG.java:1375)at org.testng.TestNG.createSuiteRunners(TestNG.java:1349)at org.testng.TestNG.runSuitesLocally(TestNG.java:1191)at org.testng.TestNG.runSuites(TestNG.java:1114)at org.testng.TestNG.run(TestNG.java:1082)at org.apache.maven.surefire.testng.TestNGExecutor.run(TestNGExecutor.java:155)at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.executeSingleClass(TestNGDirectoryTestSuite.java:102)at org.apache.maven.surefire.testng.TestNGDirectoryTestSuite.execute(TestNGDirectoryTestSuite.java:91)at org.apache.maven.surefire.testng.TestNGProvider.invoke(TestNGProvider.java:137)at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:385)at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:162)at org.apache.maven.surefire.booter.ForkedBooter.run(ForkedBooter.java:507)at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:495)

nikita-tkachenko-datadog commented 1 year ago

Thanks for the details! I am preparing a fix for your issue, but in the meantime I think there is something you could do to work around this problem.

There are two ways to instrument your tests with the tracer:

I assume you are using the first option. We currently consider it legacy, and encourage the users to choose the second option: it provides some additional features, since the build system itself is instrumented; it is also less intrusive, since it does not require modifying your build files. It also happens to not suffer from the problem you're experiencing.

I would therefore recommend you to give the second option a try. The details on how to set it up can be found in the documentation (in short, you just download the tracer JAR and augment your MAVEN_OPTS env variable with the path to it).

If you don't feel like doing this, you can of course just wait for the patch release that fixes the issue for the legacy instrumentation mode. I am going to roll it out in 1-2 days.

dwilliams-gs commented 1 year ago

Thank you for the quick turn around @nikita-tkachenko-datadog 👍

We've switched out and are now using the MAVEN_OPTS approach detailed above and can confirm this is working as expected. Thanks again!