DataDog / dd-trace-java

Datadog APM client for Java
https://docs.datadoghq.com/tracing/languages/java
Apache License 2.0
564 stars 282 forks source link

Quarkus 3.12.1 + Kafka + DD + Native = BOOM #7281

Closed luneo7 closed 4 days ago

luneo7 commented 2 months ago

Using Quarku 3.12.1 + Kafka + DD when building native we randomly get errors as:

Fatal error: com.oracle.graal.pointsto.util.AnalysisError$ParsingError: Error encountered while parsing org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.$get$__datadogContext$(KafkaAdminClient.java) 
Parsing context:
   at datadog.trace.bootstrap.FieldBackedContextStore.get(FieldBackedContextStore.java:17)
   at datadog.trace.bootstrap.instrumentation.java.concurrent.AdviceUtils.capture(AdviceUtils.java:65)
   at java.util.concurrent.ForkJoinTask.fork(ForkJoinTask.java:626)
   at java.math.BigInteger$RecursiveOp.forkOrInvoke(BigInteger.java:1918)
   at java.math.BigInteger$RecursiveOp.square(BigInteger.java:1960)
   at java.math.BigInteger.squareToomCook3(BigInteger.java:2383)
   at java.math.BigInteger.square(BigInteger.java:2227)
   at java.math.BigInteger.multiply(BigInteger.java:1637)
   at java.math.BigInteger.multiply(BigInteger.java:1594)
   at java.math.MutableBigInteger.divide3n2n(MutableBigInteger.java:1370)
   at java.math.MutableBigInteger.divide2n1n(MutableBigInteger.java:1335)
   at java.math.MutableBigInteger.divideAndRemainderBurnikelZiegler(MutableBigInteger.java:1290)
   at java.math.MutableBigInteger.divide(MutableBigInteger.java:1160)
   at java.math.BigInteger.smallToString(BigInteger.java:4164)
   at java.math.BigInteger.toString(BigInteger.java:4217)
   at java.math.BigInteger.toString(BigInteger.java:4112)
   at java.math.BigInteger.toString(BigInteger.java:4286)
   at java.util.Formatter$FormatSpecifier.print(Formatter.java:3451)
   at java.util.Formatter$FormatSpecifier.printInteger(Formatter.java:3064)
   at java.util.Formatter$FormatSpecifier.print(Formatter.java:3021)
   at java.util.Formatter.format(Formatter.java:2780)
   at java.util.Formatter.format(Formatter.java:2728)
   at java.lang.String.format(String.java:4390)
   at jdk.internal.util.Preconditions.outOfBoundsMessage(Preconditions.java:242)
   at jdk.internal.util.Preconditions$4.apply(Preconditions.java:213)
   at jdk.internal.util.Preconditions$4.apply(Preconditions.java:210)
   at jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:98)
   at jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:106)
   at jdk.internal.util.Preconditions.checkIndex(Preconditions.java:302)
   at java.lang.StringUTF16.checkIndex(StringUTF16.java:1613)
   at java.lang.StringUTF16.charAt(StringUTF16.java:1418)
   at java.lang.String.charAt(String.java:1557)
   at java.lang.Long.parseLong(Long.java:689)
   at java.lang.Long.valueOf(Long.java:1157)
   at static root method.(Unknown Source)

        at org.graalvm.nativeimage.pointsto/com.oracle.graal.pointsto.util.AnalysisError.parsingError(AnalysisError.java:149)
        at org.graalvm.nativeimage.pointsto/com.oracle.graal.pointsto.flow.MethodTypeFlow.createFlowsGraph(MethodTypeFlow.java:184)
        at org.graalvm.nativeimage.pointsto/com.oracle.graal.pointsto.flow.MethodTypeFlow.ensureFlowsGraphCreated(MethodTypeFlow.java:153)
        at org.graalvm.nativeimage.pointsto/com.oracle.graal.pointsto.flow.MethodTypeFlow.getOrCreateMethodFlowsGraphInfo(MethodTypeFlow.java:111)
        at org.graalvm.nativeimage.pointsto/com.oracle.graal.pointsto.typestate.DefaultVirtualInvokeTypeFlow.onObservedUpdate(DefaultVirtualInvokeTypeFlow.java:114)
        at org.graalvm.nativeimage.pointsto/com.oracle.graal.pointsto.flow.TypeFlow.update(TypeFlow.java:620)
        at org.graalvm.nativeimage.pointsto/com.oracle.graal.pointsto.PointsToAnalysis$1.run(PointsToAnalysis.java:491)
        at org.graalvm.nativeimage.pointsto/com.oracle.graal.pointsto.util.CompletionExecutor.executeCommand(CompletionExecutor.java:187)
        at org.graalvm.nativeimage.pointsto/com.oracle.graal.pointsto.util.CompletionExecutor.lambda$executeService$0(CompletionExecutor.java:171)
        at java.base/java.util.concurrent.ForkJoinTask$RunnableExecuteAction.exec(ForkJoinTask.java:1423)
        at java.base/java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:387)
        at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(ForkJoinPool.java:1312)
        at java.base/java.util.concurrent.ForkJoinPool.scan(ForkJoinPool.java:1843)
        at java.base/java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1808)
        at java.base/java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:188)
Caused by: com.oracle.graal.pointsto.util.AnalysisError: parsing had failed in another thread
        at org.graalvm.nativeimage.pointsto/com.oracle.graal.pointsto.util.AnalysisError.shouldNotReachHere(AnalysisError.java:173)
        at org.graalvm.nativeimage.pointsto/com.oracle.graal.pointsto.meta.AnalysisMethod.ensureGraphParsedHelper(AnalysisMethod.java:871)
        at org.graalvm.nativeimage.pointsto/com.oracle.graal.pointsto.meta.AnalysisMethod.ensureGraphParsed(AnalysisMethod.java:843)
        at org.graalvm.nativeimage.pointsto/com.oracle.graal.pointsto.flow.MethodTypeFlowBuilder.parse(MethodTypeFlowBuilder.java:186)
        at org.graalvm.nativeimage.pointsto/com.oracle.graal.pointsto.flow.MethodTypeFlowBuilder.apply(MethodTypeFlowBuilder.java:621)
        at org.graalvm.nativeimage.pointsto/com.oracle.graal.pointsto.flow.MethodTypeFlow.createFlowsGraph(MethodTypeFlow.java:167)
        ... 13 more
Caused by: com.oracle.graal.pointsto.constraints.UnresolvedElementException: Discovered unresolved field during parsing: org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable.__datadogContext$10. This error is reported at image build time because class org.apache.kafka.clients.admin.KafkaAdminClient$AdminClientRunnable is registered for linking at image build time by command line and command line.
        at org.graalvm.nativeimage.builder/com.oracle.svm.hosted.phases.SharedGraphBuilderPhase$SharedBytecodeParser.reportUnresolvedElement(SharedGraphBuilderPhase.java:550)
        at org.graalvm.nativeimage.builder/com.oracle.svm.hosted.phases.SharedGraphBuilderPhase$SharedBytecodeParser.reportUnresolvedElement(SharedGraphBuilderPhase.java:544)
        at org.graalvm.nativeimage.builder/com.oracle.svm.hosted.phases.SharedGraphBuilderPhase$SharedBytecodeParser.handleUnresolvedField(SharedGraphBuilderPhase.java:453)
        at org.graalvm.nativeimage.builder/com.oracle.svm.hosted.phases.SharedGraphBuilderPhase$SharedBytecodeParser.handleUnresolvedLoadField(SharedGraphBuilderPhase.java:363)
        at jdk.internal.vm.compiler/org.graalvm.compiler.java.BytecodeParser.genGetField(BytecodeParser.java:4776)
        at jdk.internal.vm.compiler/org.graalvm.compiler.java.BytecodeParser.genGetField(BytecodeParser.java:4767)
        at jdk.internal.vm.compiler/org.graalvm.compiler.java.BytecodeParser.processBytecode(BytecodeParser.java:5412)
        at jdk.internal.vm.compiler/org.graalvm.compiler.java.BytecodeParser.iterateBytecodesForBlock(BytecodeParser.java:3426)
        at org.graalvm.nativeimage.builder/com.oracle.svm.hosted.phases.SharedGraphBuilderPhase$SharedBytecodeParser.iterateBytecodesForBlock(SharedGraphBuilderPhase.java:743)
        at jdk.internal.vm.compiler/org.graalvm.compiler.java.BytecodeParser.handleBytecodeBlock(BytecodeParser.java:3386)
        at jdk.internal.vm.compiler/org.graalvm.compiler.java.BytecodeParser.processBlock(BytecodeParser.java:3228)
        at jdk.internal.vm.compiler/org.graalvm.compiler.java.BytecodeParser.build(BytecodeParser.java:1137)
        at org.graalvm.nativeimage.builder/com.oracle.svm.hosted.phases.SharedGraphBuilderPhase$SharedBytecodeParser.build(SharedGraphBuilderPhase.java:162)
        at jdk.internal.vm.compiler/org.graalvm.compiler.java.BytecodeParser.buildRootMethod(BytecodeParser.java:1029)
        at jdk.internal.vm.compiler/org.graalvm.compiler.java.GraphBuilderPhase$Instance.run(GraphBuilderPhase.java:101)
        at org.graalvm.nativeimage.builder/com.oracle.svm.hosted.phases.SharedGraphBuilderPhase.run(SharedGraphBuilderPhase.java:116)
        at jdk.internal.vm.compiler/org.graalvm.compiler.phases.Phase.run(Phase.java:49)
        at jdk.internal.vm.compiler/org.graalvm.compiler.phases.BasePhase.apply(BasePhase.java:434)
        at jdk.internal.vm.compiler/org.graalvm.compiler.phases.Phase.apply(Phase.java:42)
        at jdk.internal.vm.compiler/org.graalvm.compiler.phases.Phase.apply(Phase.java:38)
        at org.graalvm.nativeimage.pointsto/com.oracle.graal.pointsto.flow.AnalysisParsedGraph.parseBytecode(AnalysisParsedGraph.java:146)
        at org.graalvm.nativeimage.pointsto/com.oracle.graal.pointsto.meta.AnalysisMethod.parseGraph(AnalysisMethod.java:895)
        at org.graalvm.nativeimage.pointsto/com.oracle.graal.pointsto.meta.AnalysisMethod.ensureGraphParsedHelper(AnalysisMethod.java:860)
        at org.graalvm.nativeimage.pointsto/com.oracle.graal.pointsto.meta.AnalysisMethod.ensureGraphParsed(AnalysisMethod.java:843)
        at org.graalvm.nativeimage.pointsto/com.oracle.graal.pointsto.phases.InlineBeforeAnalysisGraphDecoder.lookupEncodedGraph(InlineBeforeAnalysisGraphDecoder.java:198)
        at jdk.internal.vm.compiler/org.graalvm.compiler.replacements.PEGraphDecoder.doInline(PEGraphDecoder.java:1211)
        at jdk.internal.vm.compiler/org.graalvm.compiler.replacements.PEGraphDecoder.tryInline(PEGraphDecoder.java:1194)
        at jdk.internal.vm.compiler/org.graalvm.compiler.replacements.PEGraphDecoder.trySimplifyInvoke(PEGraphDecoder.java:1049)
        at jdk.internal.vm.compiler/org.graalvm.compiler.replacements.PEGraphDecoder.handleInvokeWithCallTarget(PEGraphDecoder.java:1001)
        at jdk.internal.vm.compiler/org.graalvm.compiler.replacements.PEGraphDecoder.handleInvoke(PEGraphDecoder.java:987)
        at jdk.internal.vm.compiler/org.graalvm.compiler.nodes.GraphDecoder.processNextNode(GraphDecoder.java:921)
        at org.graalvm.nativeimage.pointsto/com.oracle.graal.pointsto.phases.InlineBeforeAnalysisGraphDecoder.processNextNode(InlineBeforeAnalysisGraphDecoder.java:378)
        at jdk.internal.vm.compiler/org.graalvm.compiler.nodes.GraphDecoder.decode(GraphDecoder.java:650)
        at jdk.internal.vm.compiler/org.graalvm.compiler.replacements.PEGraphDecoder.decode(PEGraphDecoder.java:892)
        at org.graalvm.nativeimage.pointsto/com.oracle.graal.pointsto.phases.InlineBeforeAnalysis.decodeGraph(InlineBeforeAnalysis.java:76)
        at org.graalvm.nativeimage.pointsto/com.oracle.graal.pointsto.flow.MethodTypeFlowBuilder.parse(MethodTypeFlowBuilder.java:195)
        at org.graalvm.nativeimage.pointsto/com.oracle.graal.pointsto.flow.MethodTypeFlowBuilder.apply(MethodTypeFlowBuilder.java:621)
        at org.graalvm.nativeimage.pointsto/com.oracle.graal.pointsto.flow.MethodTypeFlow.createFlowsGraph(MethodTypeFlow.java:167)
        at org.graalvm.nativeimage.pointsto/com.oracle.graal.pointsto.flow.MethodTypeFlow.ensureFlowsGraphCreated(MethodTypeFlow.java:153)
        at org.graalvm.nativeimage.pointsto/com.oracle.graal.pointsto.flow.MethodTypeFlow.getOrCreateMethodFlowsGraphInfo(MethodTypeFlow.java:111)
        at org.graalvm.nativeimage.pointsto/com.oracle.graal.pointsto.typestate.DefaultVirtualInvokeTypeFlow.onObservedUpdate(DefaultVirtualInvokeTypeFlow.java:114)
        at org.graalvm.nativeimage.pointsto/com.oracle.graal.pointsto.flow.TypeFlow.lambda$addObserver$0(TypeFlow.java:475)
        ... 8 more

The error keeps changing class.. but it is always related to Discovered unresolved field during parsing ... __datadogContext

It seems that the agent transformation might be adding methods before the field? Funny thing is that this doesn't occur with Quarkus 3.8...

I've made a reproducer: https://github.com/luneo7/kafka-with-quarkus You will need to change the application.properties quarkus.native.container-runtime-options config so the path to dd-java-agent-1.35.2.jar is the correct one... Later build with mvn clean package -Dquarkus.native.container-build=true -Pnative -DskipTests=true Sometimes the build passes, so you might need to kick it a few times... funny thing is that in the CI it fails (almost all the time) more than locally...

mcculls commented 2 months ago

Hi @luneo7 - thanks for reporting this, we'll add an update if we find anything

luneo7 commented 1 month ago

So, I've tracked that to the bump of kafka libs from 3.6.1 to 3.7.x... If you do something like: 1 - Change the pom.xml to force kafka 3.6.1

    <properties>
        <kafka3.version>3.6.1</kafka3.version>
    </properties>
...

   <dependencyManagement>
        <dependencies>
            <dependency>
                <groupId>org.apache.kafka</groupId>
                <artifactId>kafka-clients</artifactId>
                <version>${kafka3.version}</version>
            </dependency>
            <dependency>
                <groupId>org.apache.kafka</groupId>
                <artifactId>kafka-streams</artifactId>
                <version>${kafka3.version}</version>
            </dependency>
            <dependency>
                <groupId>org.apache.kafka</groupId>
                <artifactId>kafka-streams-test-utils</artifactId>
                <version>${kafka3.version}</version>
            </dependency>
            <dependency>
                <groupId>org.apache.kafka</groupId>
                <artifactId>kafka_2.13</artifactId>
                <version>${kafka3.version}</version>
                <exclusions>
                    <exclusion>
                        <groupId>com.google.code.findbugs</groupId>
                        <artifactId>jsr305</artifactId>
                    </exclusion>
                </exclusions>
            </dependency>
           ...
    </dependencyManagement>

2 - Tell Quarkus to remove the unsafe substitution that isn't needed to be done in the old kafka lib version so the build doesn't fail... change application.properties adding:

quarkus.class-loading.removed-resources."io.quarkus\:quarkus-kafka-client"=io/quarkus/kafka/client/runtime/graal/Target_org_apache_kafka_shaded_com_google_protobuf_UnsafeUtil.class

Then the build will always pass... so the incompatibility lies with kafka libs >= 3.7 ... not Quarkus changes per see...

luneo7 commented 1 month ago

@mcculls don't know if it ^ helps... so some change in the Kafka libs broke the DD agent instrumentation during native build... at least it removed Quarkus code changes from the equation :lol:

vandonr commented 1 month ago

@luneo7 what leads you to say that the incompatibility lies with kafka libs >= 3.7 ... not Quarkus changes per se ? You said earlier that this doesn't occur with Quarkus 3.8, so we cannot say if the issue if coming from Kafka or Quarkus, right ? (most likely it's coming from an interaction between the two)

luneo7 commented 1 month ago

@vandonr it is not related to Quarkus, since: 1 - Downgrading the lib, even if it is hacky, it makes the issue go away. 2 - If you check the changes in the Quarkus Kafka extension they are related to the version bump.

luneo7 commented 1 month ago

@vandonr it seems that somehow the methods are injected before the variables, or something so the Graal analyzer sees the method but it doesn't see the variable, or that variable naming, which adds somewhat of a counter to it is non deterministic so it is producing wrong var names

mcculls commented 1 month ago

@luneo7 - yes, I believe it's a timing window that Kafka 3.7 happens to increase the probability of it happening.

I'm trying out various options to avoid this timing window.

luneo7 commented 1 month ago

@mcculls I've managed to check what is happening... and did a draft for you to take a look: https://github.com/DataDog/dd-trace-java/pull/7402

FieldBackedContextInjector injects methods before all store fields are injected, so GraalVM might see the methods before the fields, a solution for that is to make the method injection happen only in the last injector and not in the first one.

One thing that I noted is that for the keyClassName: org.apache.kafka.clients.consumer.KafkaConsumer , contextClassName: datadog.trace.instrumentation.kafka_clients.KafkaConsumerInfo the visitEnd method is never triggered, so the thread local there for INJECTED_STORE_IDS is not removed (cleaned) and when the thread goes to another class it is being reused so the injected stores doesn't match... it will inject methods for store that won't exist there since it was a store for KafkaConsumer... so the instrumentation for KafkaConsumer might be failing somewhere... didn't dig to see what is failing...

Because of that in that draft I've added a thread local with the current class name, so it can detect when it moved to another class and visitEnd wasn't called, and it will do a reinit to the thread locals...

Does that help with your finding? That draft makes all our compilation pass, and fixes GraalVM, but doesn't fix what might be happening in the KafkaConsumer instrumentation

mcculls commented 1 month ago

Thanks @luneo7, I've been working on the same area of code. I'll look into that visitEnd issue because it should be called as part of the ASM contract, but maybe something is interrupting the processing.

It's odd about the ordering affecting GraalVM, given that the class-file is only written out after all the transformations have been applied. But maybe it expects class-files to follow a specific ordering/layout - even though that's not required by the usual JVM. I'll try to dig into why this makes a difference to GraalVM.

luneo7 commented 1 month ago

Yeah, the main issue might be related to visitEnd not being called, and the thread being used to handle another class and carrying on the thread local store ids and generating methods for stores that doesn't exist. Ideally that class visitor should be atomic in its operations, as it is writing all stores shared by a thread, each instance of it should write code related only to the context store it was created for.

mcculls commented 1 month ago

I tried a smaller patch to avoid the leak caused when visitEnd isn't called and it appears to fix the issue. I'm now digging into why we don't get that call, as it seems to only happen once - specifically just for that Kafka class. But regardless why it's happening I'll still add that check because we'll always want to reset things when processing a new class.

mcculls commented 1 month ago

BTW, the reason why we need to have these injectors collaborate is because there is only one getter/setter per class, which then handles all the different context stores at that level of the class hierarchy. This approach means we can have a single interface vs. having to generate a new interface for every single context store.

mcculls commented 1 month ago

I found the reason why we don't get a call to visitEnd:

java.lang.IllegalStateException: Cannot locate field named metadata for class org.apache.kafka.clients.consumer.KafkaConsumer
        at net.bytebuddy.asm.Advice$OffsetMapping$ForField$Unresolved.resolve(Advice.java:2556)
        at net.bytebuddy.asm.Advice$OffsetMapping$ForField.resolve(Advice.java:2481)
        at net.bytebuddy.asm.Advice$Dispatcher$Inlining$Resolved$ForMethodExit.doApply(Advice.java:9873)
        at net.bytebuddy.asm.Advice$Dispatcher$Inlining$Resolved$ForMethodExit.apply(Advice.java:9831)
        at net.bytebuddy.asm.Advice$Dispatcher$Inlining$Resolved$AdviceMethodInliner.visitMethod(Advice.java:9237)
        at net.bytebuddy.jar.asm.ClassReader.readMethod(ClassReader.java:1354)
        at net.bytebuddy.jar.asm.ClassReader.accept(ClassReader.java:745)
        ....

ASM has trouble reading that class and exits early - I imagine this parsing error didn't happen with previous versions of that class, which is why it only shows up recently. I'm going to put in a fix to reset the state of the injector when the class changes (as mentioned earlier these injectors need to work together because only one can create the shared getter/setter per-class)

luneo7 commented 1 month ago

Yeah I've got that, but I was thinking that maybe CombiningTransformerBuilder could call applyContextStoreInjection with the store set/list for a specific class instead of getting the map and doing a for each creating multiple FieldBackedContextInjector, and then the context injector would inject all variables and all methods necessary.

luneo7 commented 1 month ago

Could we also add the fix to FieldBackedContextStores? As it is using computeIfAbsent instead of putIfAbsent so we don't call create store without need, if the store was created before (during the locking mechanism)

luneo7 commented 1 month ago

Also I checked the changes there in the Kafka class, and it doesn't have that metadata field anymore, so it is not that ASM is having an issue there, the instrumentation is not valid anymore, so should It silently fail? Or should it just skip what it didn't found and not try to apply that...

mcculls commented 1 month ago

I'd prefer just to make the simple fix because it keeps everything related to injection local to that file and avoids having to maintain logic between CombiningTransformerBuilder and the various FieldBackedContextInjectors. The fix is simple and understandable, whereas trying to get CombiningTransformerBuilder to orchestrate the injection from outside could be tricky (since the different FieldBackedContextInjector will be operating several steps removed inside byte-buddy)

( the main issue is that some stores match on an interface, so you don't have an exact match from class to stores )

Regarding the other change to FieldBackedContextStores - it's intentional that if there is a collision (which will be rare) we leave the unused store in place to be used by the next request. That way there would only ever be at most one unused store, which is reasonable since each store is very lightweight. Was there a reason you need to avoid this?

Also re, the metadata field - instrumentations that rely on fields are supposed to have structure checks, so that's a bug in the instrumentation then. Either way we don't want those kind of issues to cause leaks in the field injection.

luneo7 commented 1 month ago

Yeah, the keep it simple approach is a good one. The only thing is that CombiningTransformerBuilder already has logic for context stores, and not stating that we need to change it there to get this fixed, just that maybe if it was to create the context field store only once there per class, instead of the for each, the field injection visitor would be simpler and easier to understand and troubleshoot later

mcculls commented 1 month ago

Except some stores have an interface key - so you can't immediately tell which stores will need injecting into a given class until you've run the matchers.

luneo7 commented 1 month ago

Gotcha... the logic to do the matching would also move to the class visitor... indeed that would complicate a little... I've updated that draft with what I've also tested locally, just cleaning the thread local... dunno if that is what you had there =)

mcculls commented 1 month ago

I used Pair to avoid introducing another thread-local: https://github.com/DataDog/dd-trace-java/pull/7403

luneo7 commented 1 month ago

Wonderful! Whenever this is merged I will do a backport of that final fix to the current released version so we can use a patched version internally until we have an official release, as that usually takes a while 😁

mcculls commented 1 month ago

FYI, https://github.com/DataDog/dd-trace-java/pull/7404 will let you remove

--initialize-at-run-time=org.apache.kafka.common.compress.KafkaLZ4BlockInputStream

from the reproducer

github-actions[bot] commented 4 days ago

:robot: This issue has been addressed in the latest release. See full details in the Release Notes.