awslabs / aws-glue-libs

AWS Glue Libraries are additions and enhancements to Spark for ETL operations.
Other
635 stars 299 forks source link

Latest AWSGlueETL-4.0.0 is not compatible with latest AWSGlueDynamicSchema-0.9.0 #207

Closed alexrashed closed 3 months ago

alexrashed commented 4 months ago

Issue

We have recently seen issues where we had an unexpected NoSuchMethodError when testing Glue ETL jobs:

java.lang.NoSuchMethodError: 'void com.amazonaws.services.glue.schema.types.Field.<init>(java.lang.String, com.amazonaws.services.glue.schema.types.DataType, com.amazonaws.services.glue.schema.SchemaProperties, java.lang.String)'

This issue started appearing around the 28th of March. After quite some investigation I think the issue is caused by incompatible updates of the JAR files in the maven repository: https://aws-glue-etl-artifacts.s3.amazonaws.com/

Hypothesis

My hypothesis is the the following events caused this issue (based on certain observations and the timestamps in the maven repo index at https://aws-glue-etl-artifacts.s3.amazonaws.com/):

Reproducer

Slight modification of the POM file of this repo to fix another incompatibility (log4j upgrade to 2.17.2):

...
  <dependencies>
    <dependency>
      <groupId>com.amazonaws</groupId>
      <artifactId>AWSGlueETL</artifactId>
      <version>${project.version}</version>
    </dependency>

    <dependency>
      <groupId>org.apache.logging.log4j</groupId>
      <artifactId>log4j-api</artifactId>
      <version>2.17.2</version>
    </dependency>

    <dependency>
      <groupId>org.apache.logging.log4j</groupId>
      <artifactId>log4j-core</artifactId>
      <version>2.17.2</version>
    </dependency>

    <dependency>
      <groupId>org.apache.logging.log4j</groupId>
      <artifactId>log4j</artifactId>
      <version>2.17.0</version>
      <type>pom</type>
    </dependency>
  </dependencies>
  <repositories>
    <repository>
      <id>aws-glue-etl-artifacts</id>
      <url>https://aws-glue-etl-artifacts.s3.amazonaws.com/release/</url>
    </repository>
  </repositories>
...

Main to reproduce the issue:

import com.amazonaws.services.glue.AWSGlue;
import com.amazonaws.services.glue.AWSGlueClientBuilder;
import com.amazonaws.services.glue.model.Column;
import com.amazonaws.services.glue.util.DataCatalogWrapper;
import scala.collection.immutable.List;
import scala.collection.immutable.Nil$;

class Scratch {
    public static void main(String[] args) {
        Column column = new Column();
        column.setName("Test");
        column.setType("boolean");
        List nil = Nil$.MODULE$; // the empty list
        List cols = nil.$colon$colon(column); // column::nil
        AWSGlue client = AWSGlueClientBuilder.standard().withRegion("us-east-1").build();
        DataCatalogWrapper wrapper = new DataCatalogWrapper(client);
        wrapper.getFieldsFromColumns(cols);
    }
}

Output:

SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/localstack/.m2/repository/org/apache/logging/log4j/log4j-slf4j-impl/2.17.2/log4j-slf4j-impl-2.17.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/localstack/.m2/repository/org/slf4j/slf4j-reload4j/1.7.36/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
ANTLR Tool version 4.3 used for code generation does not match the current runtime version 4.7.2ANTLR Tool version 4.3 used for code generation does not match the current runtime version 4.7.2Exception in thread "main" java.lang.NoSuchMethodError: 'void com.amazonaws.services.glue.schema.types.Field.<init>(java.lang.String, com.amazonaws.services.glue.schema.types.DataType, com.amazonaws.services.glue.schema.SchemaProperties, java.lang.String)'
    at com.amazonaws.services.glue.util.DataCatalogWrapperUtils.$anonfun$getFieldsFromColumns$1(DataCatalogWrapper.scala:536)
    at scala.collection.immutable.List.map(List.scala:282)
    at com.amazonaws.services.glue.util.DataCatalogWrapperUtils.getFieldsFromColumns(DataCatalogWrapper.scala:535)
    at com.amazonaws.services.glue.util.DataCatalogWrapperUtils.getFieldsFromColumns$(DataCatalogWrapper.scala:535)
    at com.amazonaws.services.glue.util.DataCatalogWrapper.getFieldsFromColumns(DataCatalogWrapper.scala:166)
    at Scratch.main(scratch_2.java:23)
alexrashed commented 4 months ago

I created a reproducer which allows you to easily reproduce this issue: https://github.com/alexrashed/reproducer-aws-glue-etl-4-incompatibility It just defines com.amazonaws.AWSGlueETL-4.0.0 as dependency in maven and the main function runs into a NoSuchMethodError.

My ask is to update the latest version of either AWSGlueETL-4.0.0 or AWSGlueDynamicSchema-0.9.0 such that they are compatible with each other agin.

anthonybastidas49 commented 3 months ago

I have the same problem, as @alexrashed describes there is an incompatibility between versions. I obtained a later jar which is compatible (due to the number of parameters and works correctly), suddenly the necessary update has been made? Another question that arises is because this problem only occurs locally, when I upload the job to AWS it works fine. I have created other jobs with glue 4, does AWS keep a type of cache for the jars and that is why when I upload to that account it takes the jar with the correct versions?

alexrashed commented 3 months ago

Thanks for the pointer, @anthonybastidas49! I can approve that the maven repository has been updated and that my test cases are working again with the latest jar files currently available in the maven repo. I'll close this issue for now, but the question from @anthonybastidas49 is still open...