factorhouse / kpow

Kpow for Apache Kafka
https://factorhouse.io/kpow
Other
37 stars 5 forks source link

java.lang.NoSuchMethodError When deserializing message with Protobuf schema from AWS Glue #15

Closed gabistoenescu closed 1 year ago

gabistoenescu commented 1 year ago

Version of Kpow Latest (as of August 3rd 2023)

Describe the bug When trying to inspect messages using a Protobuf Value Deserializer kpow throws a java.lang.NoSuchMethodError.

The application starts without exceptions using the following command (in a MacOS environment): docker run --pull=always -p 3000:3000 --env-file ~/kpow-config.env -m 2G -v ~/.aws:/root/.aws factorhouse/kpow-ce:latest

The given env configuration allows for kpow to succesfully connect to an MSK cluster and an AWS Glue registry. The configuration file looks like:

BOOTSTRAP=<our-bootstrap-servers>

<Our license properties>

ENVIRONMENT_NAME="Our-Environment"
ALLOW_TOPIC_CREATE=true
ALLOW_TOPIC_DELETE=true
ALLOW_TOPIC_EDIT=true
ALLOW_TOPIC_INSPECT=true
ALLOW_TOPIC_PRODUCE=true

SECURITY_PROTOCOL=SASL_SSL
SASL_MECHANISM=AWS_MSK_IAM
SASL_JAAS_CONFIG=software.amazon.msk.auth.iam.IAMLoginModule required awsRoleArn="arn:aws:iam::<our-account>:role/<our-role>";
SASL_CLIENT_CALLBACK_HANDLER_CLASS=software.amazon.msk.auth.iam.IAMClientCallbackHandler

SCHEMA_REGISTRY_NAME=op-schema-registry
SCHEMA_REGISTRY_ARN=arn:aws:glue:us-east-1:<our-account>:registry/op-schema-registry

The UI seems to function correctly including the data inspection features (without Protobuf deserialization) The connectivity to AWS Glue also seems correct and using the "Schema" feature displays the target Subject and the "Edit Subject" action retrieves the schema specification. Currently the specification is empty and looks like this:

syntax = "proto3";

option java_package = "com.<our.package>.message";

message Message {
  // TODO
}

The source topic has a couple of messages, all successfully encoded (by a running application) using the Protobuf schema. Using the Data Inspection with the String serializer renders messages with funny values (as expected)

topic: target-topic
partition: 5
offset: 2
timestamp: 1691033148359
age: 19h 47m 46s
headers:
{}
value: Z��pu�C� �́�[

Using the Data Inspection with the correspondent Protobof serializer results in errors: image

The application logs the following errors:

22:32:19.228 INFO  [pool-3-thread-6] operatr.kafka.data.sampler – [80874731-6065-40e5-98c8-4baf7aafb27b] scheduling job
22:32:19.234 INFO  [pool-5-thread-1] operatr.kafka.data.sampler – [80874731-6065-40e5-98c8-4baf7aafb27b] query cluster [FrwuhJStS3OshvW7LosoCw] topics #{"target-topic"}: commmencing job.
22:32:21.306 ERROR [sampler_consumer_thread166] operatr.kafka.data.sampler – failed to process assignments [{:topic "target-topic", :partition 5, :start 0, :end 3, :offset 0, :timestamp 1691032173641}]
com.google.common.util.concurrent.ExecutionError: java.lang.NoSuchMethodError: 'boolean com.squareup.wire.schema.CoreLoader.isWireRuntimeProto(com.squareup.wire.schema.Location)'
    at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2083)
    at com.google.common.cache.LocalCache.get(LocalCache.java:4011)
    at com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:4034)
    at com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:5010)
    at com.amazonaws.services.schemaregistry.deserializers.protobuf.ProtobufDeserializer.deserialize(ProtobufDeserializer.java:73)
    at com.amazonaws.services.schemaregistry.deserializers.GlueSchemaRegistryDeserializationFacade.deserialize(GlueSchemaRegistryDeserializationFacade.java:172)
    at com.amazonaws.services.schemaregistry.deserializers.GlueSchemaRegistryKafkaDeserializer.deserializeByHeaderVersionByte(GlueSchemaRegistryKafkaDeserializer.java:160)
    at com.amazonaws.services.schemaregistry.deserializers.GlueSchemaRegistryKafkaDeserializer.deserialize(GlueSchemaRegistryKafkaDeserializer.java:116)
    at operatr.kafka.serdes$fn__3825.invokeStatic(serdes.clj:95)
    at operatr.kafka.serdes$fn__3825.invoke(serdes.clj:87)
    at operatr.kafka.serdes$fn__3804$G__3799__3817.invoke(serdes.clj:84)
    at operatr.kafka.serdes$deserialize.invokeStatic(serdes.clj:647)
    at operatr.kafka.serdes$deserialize.invoke(serdes.clj:642)
    at operatr.kafka.serdes$deserialize_record$fn__4139.invoke(serdes.clj:674)
    at operatr.kafka.serdes$deserialize_record.invokeStatic(serdes.clj:674)
    at operatr.kafka.serdes$deserialize_record.invoke(serdes.clj:666)
    at clojure.core$partial$fn__5908.invoke(core.clj:2641)
    at operatr.kafka.data.sampler$record_xf$fn__16513.invoke(sampler.clj:454)
    at operatr.kafka.data.sampler$record_xf$fn__16516.invoke(sampler.clj:456)
    at operatr.kafka.data.sampler$process_item.invokeStatic(sampler.clj:205)
    at operatr.kafka.data.sampler$process_item.invoke(sampler.clj:201)
    at operatr.kafka.data.sampler$process_assignment.invokeStatic(sampler.clj:218)
    at operatr.kafka.data.sampler$process_assignment.invoke(sampler.clj:211)
    at operatr.kafka.data.sampler$process_assignments$iter__16325__16329$fn__16330.invoke(sampler.clj:234)
    at clojure.lang.LazySeq.sval(LazySeq.java:42)
    at clojure.lang.LazySeq.seq(LazySeq.java:51)
    at clojure.lang.RT.seq(RT.java:535)
    at clojure.core$seq__5467.invokeStatic(core.clj:139)
    at clojure.core$filter$fn__5962.invoke(core.clj:2826)
    at clojure.lang.LazySeq.sval(LazySeq.java:42)
    at clojure.lang.LazySeq.seq(LazySeq.java:51)
    at clojure.lang.RT.seq(RT.java:535)
    at clojure.core$seq__5467.invokeStatic(core.clj:139)
    at clojure.core$dorun.invokeStatic(core.clj:3134)
    at clojure.core$doall.invokeStatic(core.clj:3149)
    at clojure.core$doall.invoke(core.clj:3149)
    at operatr.kafka.data.sampler$process_assignments.invokeStatic(sampler.clj:233)
    at operatr.kafka.data.sampler$process_assignments.invoke(sampler.clj:228)
    at operatr.kafka.data.sampler$consumer_runnable$fn__16343.invoke(sampler.clj:252)
    at clojure.lang.AFn.run(AFn.java:22)
    at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.NoSuchMethodError: 'boolean com.squareup.wire.schema.CoreLoader.isWireRuntimeProto(com.squareup.wire.schema.Location)'
    at com.squareup.wire.schema.RootKt.roots(Root.kt:62)
    at com.squareup.wire.schema.SchemaLoader.allRoots(SchemaLoader.kt:172)
    at com.squareup.wire.schema.SchemaLoader.initRoots(SchemaLoader.kt:84)
    at com.amazonaws.services.schemaregistry.utils.apicurio.ProtobufSchemaLoader.loadSchema(ProtobufSchemaLoader.java:164)
    at com.amazonaws.services.schemaregistry.utils.apicurio.FileDescriptorUtils.toFileDescriptorProto(FileDescriptorUtils.java:158)
    at com.amazonaws.services.schemaregistry.utils.apicurio.FileDescriptorUtils.protoFileToFileDescriptor(FileDescriptorUtils.java:151)
    at com.amazonaws.services.schemaregistry.utils.apicurio.FileDescriptorUtils.protoFileToFileDescriptor(FileDescriptorUtils.java:142)
    at com.amazonaws.services.schemaregistry.deserializers.protobuf.ProtobufSchemaParser.parse(ProtobufSchemaParser.java:15)
    at com.amazonaws.services.schemaregistry.deserializers.protobuf.ProtobufDeserializer$ProtobufSchemaParserCache.load(ProtobufDeserializer.java:124)
    at com.amazonaws.services.schemaregistry.deserializers.protobuf.ProtobufDeserializer$ProtobufSchemaParserCache.load(ProtobufDeserializer.java:120)
    at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3570)
    at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2312)
    at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2189)
    at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2079)
    ... 40 common frames omitted
22:32:26.254 INFO  [pool-5-thread-1] operatr.kafka.data.sampler – [80874731-6065-40e5-98c8-4baf7aafb27b] query cluster [FrwuhJStS3OshvW7LosoCw] topics #{"target-topic"}: matched 0, limit 100, elapsed 7013ms.

Some help would be appreciated, as the stack trace appears to indicate that an incompatible dependency is loaded at runtime when the Protobuf deserialization is attempted.

d-t-w commented 1 year ago

Hi @gabistoenescu thanks for raising this ticket.

Thanks for your precise and detailed issue description, much appreciated.


One Line Answer

We can probably ship a tactical solution to our Community edition fairly quickly with a longer term strategic solution following in the course of our next few public releases.

Short Answer

This issue is due to dependency conflicts between Confluent and AWS Glue serdes libraries that we have known of for some time but have not resolved in our public builds due to AWS Glue+Proto support being relatively recent (2022) and little demand from our users.

Resolving this issue strategically requires us to publish a aws-glue-compatibility (or similarly named) version of each Kpow JAR/container across each of our product lines (Community, Standard, Enterprise, + each AWS Marketplace artefacts); to document why we do that; to maintain ongoing details of when a specific release requires a aws-glue-compatibility release; and finally to document the trade-offs of using the compatibility mode version.

We had been hoping the universe would align and this issue would be resolved by up-stream teams in their libraries but it appears that it will be an ongoing, fluctuating issue that will ebb and flow as those libraries advance.

Your ticket requires a general, public resolution so we will now commit to that work.

Detailed Answer

Kpow is an enterprise-grade solution that meets our users needs in a broad market by supporting a varied number of providers and their offerings in a single artefact. This includes support for:

  1. Apache Kafka 1.0+, including Apache Kafka Connect
  2. Confluent Platform, Confluent Cloud (with extended integrations for both).
  3. Confluent Schema Registry, Confluent Managed Connect, Confluent ksqlDB
  4. MSK, MSK Serverless (with extended integrations for both).
  5. AWS Glue Registry, AWS Managed Connect
  6. Redpanda
  7. Aiven, Instaclustr, CloudKarafka, etc

Kpow is built from a variety of dependencies to meet those requirements. We invest considerable time in understanding our dependencies intricately, and we take a determined approach to dependency management where we:

  1. Automatically scan for CVE via NVD on each commit/release and fail the build on detection.
  2. Favour advancing stable dependencies to latest releases as/when available to mitigate (1).
  3. Favour advancing core Kafka dependencies (which are also very stable) to gain new feature capabilities.
  4. Favour Confluent over Glue dependencies for quality and cadence reasons.

Historic Library Transitive Dependency Conflict

Kpow has supported AWS Glue since 2021. When Glue introduced protobuf support in 2022 initially all was well, but we realised in late 2022 that an advance in Confluent protobuf library version had introduced a conflict on a shared transitive dependency that broke protobuf in Glue in exactly the manner you have identified.

This PR is the crux of the issue: https://github.com/awslabs/aws-glue-schema-registry/pull/230

That ticket was resolved last month, and while the AWS team have updated the wire-schema dependency, Confluent have since moved on to a later version and exactly the same problem persists.

Current Latest Library Transitive Dependency Conflict

As of:

  1. io.confluent/kafka-protobuf-serializer "7.4.1" and
  2. software.amazon.glue/schema-registry-serde "1.1.16"

Confluent 7.4.1 requires wire 4.4.3

 [io.confluent/kafka-protobuf-serializer "7.4.1" :exclusions [[org.yaml/snakeyaml]]]
   [io.confluent/kafka-protobuf-provider "7.4.1"]
     [com.squareup.okio/okio-jvm "3.0.0"]
     [com.squareup.wire/wire-runtime-jvm "4.4.3" :exclusions [[org.jetbrains.kotlin/kotlin-stdlib]]]
     [com.squareup.wire/wire-schema-jvm "4.4.3" :exclusions [[org.jetbrains.kotlin/kotlin-stdlib]]]

AWS Glue 1.1.16 requires wire 4.3.0

   [com.squareup.wire/wire-compiler "4.3.0" :exclusions [[com.squareup.wire/wire-grpc-client] [com.charleskorn.kaml/kaml]]]
     [com.squareup.wire/wire-java-generator "4.3.0" :scope "runtime"]
     [com.squareup.wire/wire-kotlin-generator "4.3.0" :scope "runtime"]
       [com.squareup.wire/wire-grpc-client-jvm "4.3.0" :scope "runtime"]
         [com.squareup.okhttp3/okhttp "4.9.3" :scope "runtime"]
         [org.jetbrains.kotlinx/kotlinx-coroutines-core-jvm "1.5.2" :scope "runtime"]
       [com.squareup.wire/wire-grpc-server-generator "4.3.0" :scope "runtime"]
     [com.squareup.wire/wire-profiles "4.3.0" :scope "runtime"]
     [com.squareup.wire/wire-swift-generator "4.3.0" :scope "runtime"]
       [io.outfoxx/swiftpoet "1.3.1" :scope "runtime"]
   [com.squareup.wire/wire-schema "4.3.0"]
     [com.squareup.wire/wire-runtime "4.3.0" :scope "runtime"]

These two are not compatible, in the case of square being excluded from Glue we get:

(dev/send-glue-proto "glue_proto" "5")
Execution error (NoSuchMethodError) at com.amazonaws.services.schemaregistry.utils.apicurio.FileDescriptorUtils/toMessage (FileDescriptorUtils.java:881).
'void com.squareup.wire.schema.internal.parser.MessageElement.<init>(com.squareup.wire.schema.Location, java.lang.String, java.lang.String, java.util.List, java.util.List, java.util.List, java.util.List, java.util.List, java.util.List, java.util.List)'

This is the current normal case where we cannot read/write Glue Protobuf.

If we switch the deps around Glue can now produce protobuf, but Confluent cannot:

(dev/send-proto "proto_tx" "7")
Execution error (NoSuchMethodError) at io.confluent.kafka.schemaregistry.protobuf.ProtobufSchema/toMessage (ProtobufSchema.java:994).
'void com.squareup.wire.schema.internal.parser.MessageElement.<init>(com.squareup.wire.schema.Location, java.lang.String, java.lang.String, java.util.List, java.util.List, java.util.List, java.util.List, java.util.List, java.util.List, java.util.List, java.util.List)'

Current Compatibility Matrix

This leaves Kpow with these capabilities currently, which you have identified.

Kpow Version Consume AVRO Consume JSON Schema Consume Proto
91.6 Confluent
91.6 Glue
Kpow Version Produce AVRO Produce JSON Schema Produce Proto
91.6 Confluent
91.6 Glue
Kpow Version Edit AVRO Schema Edit JSON Schema Edit Proto Schema
91.6 Confluent
91.6 Glue

Tactical Solution

An intermediary solution is to select Confluent and Glue libraries based on the latest-highest version of each where they have a common shared wire-schema dependency and build an aws-glue-baseline release from those dependencies.

This solution is sub-optimal due to AWS Glue's glacial cadence and low-quality dependency management. We are required to effectively pin Confluent, Kafka, and other shared dependencies to old versions for the initial compatibility mode in a manner that does not sit well with our internal discipline for dependency hygiene.

This is the action we will investigate now, initially just for Community Edition to resolve this ticket without building out the build pipelines necessary to support it across our entire range of deliverables.

Strategic Solution

The resolution of the AWS Glue ticket did not fix the transitive dependency issue due to other libraries moving on.

Moving forward in each release we will,

  1. Check AWS Glue compatibility with favoured Confluent dependencies.
  2. If incompatible, produce an aws-glue-compatiblity release containing AWS Glue serdes only.
  3. If incompatible, indicate to users which is the most recent release with combined Confluent+Glue protobuf support.

Meaning our users with Glue+Protobuf requirements can choose to use either (2) or remain on an older release (3) depending on their requirements for mixed-schema or not.

Thanks again, Derek

d-t-w commented 1 year ago

Hey @gabistoenescu just an update that we will publish the tactical fix to factorhouse/kpow-ce:91.5-aws-glue-baseline early next week. I'll ack/close this ticket when it's available.

gabistoenescu commented 1 year ago

Hi @d-t-w Thank you very much for you prompt response and very valuable explanations. They are very much appreciated.

d-t-w commented 1 year ago

Hi @gabistoenescu, please update your container reference to the following:

factorhouse/kpow-ce:91.5.1-aws-glue-baseline

That is the new baseline image that supports both Glue and Confluent serdes.

I will close this ticket now, please just reach out if you need any further support. If you would like a POC license to evaluate authz/multi-cluster/etc just let me know.

Derek

gabistoenescu commented 1 year ago

Hi @d-t-w Thank you very much for the fix. I tested it out and confirmed that the Data Inspect functionality worked as expected.

topic: target-topic
partition: 5
offset: 0
timestamp: 1691032173641
age: 4d 17h 33m 01s
headers: {}
value: {}