apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.51k stars 2.25k forks source link

Missing `woodstox-core` transitive dependency results in `ClassNotFoundException: com.ctc.wstx.io.InputBootstrapper` in kafka connector distribution artifact #11489

Closed josepanguera closed 2 weeks ago

josepanguera commented 2 weeks ago

Apache Iceberg version

main (development)

Query engine

Kafka Connect

Catalog

Glue

Please describe the bug 🐞

After commit 7ac617a5a8b0dedbaaa6e19caedfd846968c7cac the dependency woodstox-core-6.7.0.jar is no longer included in the kafka-connect/kafka-connect-runtime/build/distributions/iceberg-kafka-connect-runtime-X.Y.Z-SNAPSHOT.zip and when deploying the connector to AWS MSK Connect it fails at startup with:

ERROR WorkerSinkTask{id=REDACTED-0} Task threw an uncaught and unrecoverable exception. Task is being killed and will not recover until manually restarted (org.apache.kafka.connect.runtime.WorkerTask:193)
java.lang.NoClassDefFoundError: com/ctc/wstx/io/InputBootstrapper
    at java.base/java.lang.Class.forName0(Native Method)
    at java.base/java.lang.Class.forName(Class.java:398)
    at org.apache.iceberg.common.DynClasses$Builder.impl(DynClasses.java:68)
    at org.apache.iceberg.connect.CatalogUtils.loadHadoopConfig(CatalogUtils.java:53)
    at org.apache.iceberg.connect.CatalogUtils.loadCatalog(CatalogUtils.java:45)
    at org.apache.iceberg.connect.IcebergSinkTask.open(IcebergSinkTask.java:56)
    at org.apache.kafka.connect.runtime.WorkerSinkTask.openPartitions(WorkerSinkTask.java:641)
    at org.apache.kafka.connect.runtime.WorkerSinkTask.access$1100(WorkerSinkTask.java:71)
    at org.apache.kafka.connect.runtime.WorkerSinkTask$HandleRebalance.onPartitionsAssigned(WorkerSinkTask.java:706)
    at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.invokePartitionsAssigned(ConsumerCoordinator.java:293)
    at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.onJoinComplete(ConsumerCoordinator.java:430)
    at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.joinGroupIfNeeded(AbstractCoordinator.java:449)
    at org.apache.kafka.clients.consumer.internals.AbstractCoordinator.ensureActiveGroup(AbstractCoordinator.java:365)
    at org.apache.kafka.clients.consumer.internals.ConsumerCoordinator.poll(ConsumerCoordinator.java:508)
    at org.apache.kafka.clients.consumer.KafkaConsumer.updateAssignmentMetadataIfNeeded(KafkaConsumer.java:1257)
    at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1226)
    at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1206)
    at org.apache.kafka.connect.runtime.WorkerSinkTask.pollConsumer(WorkerSinkTask.java:458)
    at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:325)
    at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:232)
    at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:201)
    at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:191)
    at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:240)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: java.lang.ClassNotFoundException: com.ctc.wstx.io.InputBootstrapper
    at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:594)
    at org.apache.kafka.connect.runtime.isolation.PluginClassLoader.loadClass(PluginClassLoader.java:104)
    at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:527)
    ... 28 more

This doesn't happen in the integrations tests (and most likely in other environments such as Confluent cloud either) because in the confluentinc/cp-kafka-connect docker image this dependency is already included, see:

$ docker run -ti --rm --user root confluentinc/cp-kafka-connect find /usr/share/ -name woodstox-core-6.5.1.jar
/usr/share/java/kafka-serde-tools/woodstox-core-6.5.1.jar
/usr/share/java/confluent-control-center/woodstox-core-6.5.1.jar

In the Hive variant of the distribution artifact there's an older version of the dependency (woodstox-core-5.4.0.jar) but I don't think using this variant should be the solution, as it is meant for Iceberg installations using a Hive catalog.

Willingness to contribute

josepanguera commented 2 weeks ago

Fixed with #11516