apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.49k stars 2.24k forks source link

Kafka Connect: fix Hadoop dependency exclusion #11516

Closed bryanck closed 1 week ago

bryanck commented 1 week ago

This PR removes the exclusion of the woodstox libraries from the Hadoop transitive dependencies when building the Kafka Connect distribution, as they are needed to load Hadoop's Configuration. Previously the required libraries were brought in by the Azure dependencies until that was upgraded in this commit, so the connector did not have any issues.

This addresses https://github.com/apache/iceberg/issues/11489.

bryanck commented 1 week ago

I asked the user to test this PR on MSK (where the issue happens), so just waiting on that.

josepanguera commented 1 week ago

I asked the user to test this PR on MSK (where the issue happens), so just waiting on that.

I'm deploying it since MSK Connect it's a black box completely. I'll know in a minute.

josepanguera commented 1 week ago

I'm including here my build scripts to make it work for now if it's of any usefulness.

Makefile

SHELL=/bin/bash

SCHEMA_REGISTRY_CONVERTER_VERSION := 1.1.20
SCHEMA_REGISTRY_CONVERTER_ARTIFACT := software.amazon.glue:schema-registry-kafkaconnect-converter:${SCHEMA_REGISTRY_CONVERTER_VERSION}
SCHEMA_REGISTRY_CONVERTER_JAR := schema-registry-kafkaconnect-converter-${SCHEMA_REGISTRY_CONVERTER_VERSION}.jar

# This JAR and its dependency won't be necessary when this issue is fixed
# https://github.com/apache/iceberg/issues/11489
WOODSTOX_VERSION := 6.7.0
WOODSTOX_ARTIFACT := com.fasterxml.woodstox:woodstox-core:${WOODSTOX_VERSION}
WOODSTOX_JAR := woodstox-core-${WOODSTOX_VERSION}.jar

STAX_VERSION := 4.2.2
STAX_ARTIFACT := org.codehaus.woodstox:stax2-api:${STAX_VERSION}
STAX_JAR := stax2-api-${STAX_VERSION}.jar

CONNECTOR_ZIP := iceberg-sink-*-*.zip

.DEFAULT_GOAL := all
.PHONY: all create-aws-plugin clean

all: $(CONNECTOR_ZIP) $(SCHEMA_REGISTRY_CONVERTER_JAR) $(WOODSTOX_JAR) $(STAX_JAR)
    zip $(CONNECTOR_ZIP) $(SCHEMA_REGISTRY_CONVERTER_JAR) $(WOODSTOX_JAR) $(STAX_JAR)

$(CONNECTOR_ZIP):
    ./build.sh

$(SCHEMA_REGISTRY_CONVERTER_JAR):
    # You can run this inside a `maven` docker image if you don't have `mvn` locally
    mvn dependency:copy -Dartifact=$(SCHEMA_REGISTRY_CONVERTER_ARTIFACT) -DoutputDirectory=$(pwd)

$(WOODSTOX_JAR):
    # You can run this inside a `maven` docker image if you don't have `mvn` locally
    mvn dependency:copy -Dartifact=$(WOODSTOX_ARTIFACT) -DoutputDirectory=$(pwd)

$(STAX_JAR):
    # You can run this inside a `maven` docker image if you don't have `mvn` locally
    mvn dependency:copy -Dartifact=$(STAX_ARTIFACT) -DoutputDirectory=$(pwd)

clean:
    @rm -f $(CONNECTOR_ZIP) $(SCHEMA_REGISTRY_CONVERTER_JAR) $(WOODSTOX_JAR) $(STAX_JAR)

build.sh

#!/usr/bin/env bash

set -eo pipefail

ICEBERG_VERSION=1.8.0

[[ -n $ICEBERG_LOCATION ]] || {
  echo "ICEBERG_LOCATION environment variable not defined"
  exit 1
}

[[ -d $ICEBERG_LOCATION ]] || {
  echo "No directory exists at $ICEBERG_LOCATION"
  exit 1
}

pushd "$ICEBERG_LOCATION"
./gradlew -x test -x integrationTest build
GIT_COMMIT=$(git rev-parse --short HEAD)
popd

ARTIFACT_FOLDER="kafka-connect/kafka-connect-runtime/build/distributions"
ARTIFACT_PATH="$ICEBERG_LOCATION/$ARTIFACT_FOLDER/iceberg-kafka-connect-runtime-$ICEBERG_VERSION-SNAPSHOT.zip"
cp "$ARTIFACT_PATH" "iceberg-sink-$ICEBERG_VERSION-$GIT_COMMIT.zip"
bryanck commented 1 week ago

Thanks for testing @josepanguera , any chance you could test one more time? I removed another exclude. If that doesn't work I'll debug on MSK myself.

bryanck commented 1 week ago

The unfortunate fact is many Hadoop/Hive dependencies have security vulnerabilities, which is why we have 2 connector distributions (one with Hive, one without). I pushed an update to force the version to 6.7.0, if you want to try again...

josepanguera commented 1 week ago

The unfortunate fact is many Hadoop/Hive dependencies have security vulnerabilities, which is why we have 2 connector distributions (one with Hive, one without). I pushed an update to force the version to 6.7.0, if you want to try again...

Oh, I see.

I cannot try it until later, I'll update the issue as soon as I check it.

RussellSpitzer commented 1 week ago

I'm a little lost on this, why is this difficult on MSK? If we don't know the classpath is there ever a way we can really test things?

bryanck commented 1 week ago

The full classpath depends on the Connect framework being used to run the connector (MSK, Strimzi, Confluent, etc), so in this case we need to deploy to MSK to really test this. Josep was kind enough to do that. We use the Confluent image for the integration tests, and that already has this dependency so the tests were passing.

RussellSpitzer commented 1 week ago

The full classpath depends on the Connect framework being used to run the connector (MSK, Strimzi, Confluent, etc), so in this case we need to deploy to MSK to really test this. Josep was kind enough to do that. We use the Confluent image for the integration tests, and that already has this dependency so the tests were passing.

If this is the case, can ever actually stay compatible? Aren't we at the mercy of these other builds?

bryanck commented 1 week ago

To some extent we are at their mercy. But in this case, it is really our bug, because we were relying on a library to be on the classpath that we shouldn't have relied on. We got "lucky" before on MSK because the Azure dependency happened to also require this library so it was being included in the runtime. When Azure was updated, it was no longer included and caused the error. The library is actually required by our Hadoop dependency, not Connect.

bryanck commented 1 week ago

Awesome, thanks @josepanguera for reporting this and testing the fix, and thanks @nastra @Fokko @RussellSpitzer @ajantha-bhat and @singhpk234 for the review!