awslabs / aws-glue-libs

AWS Glue Libraries are additions and enhancements to Spark for ETL operations.
Other
636 stars 300 forks source link

NoClassDefFoundError when calling dropFields - dependency issue with commons-collections? #77

Open taylorbarstow opened 3 years ago

taylorbarstow commented 3 years ago

I've recently started getting the following error when using drop_fields with aws-glue-libs via the aws_glue_libs docker image:

An error was encountered:
An error occurred while calling o48.dropFields.
: java.lang.NoClassDefFoundError: org/apache/commons/collections4/IteratorUtils
    at com.amazonaws.services.glue.schema.types.ChoiceType.iterator(ChoiceType.java:91)
    at java.lang.Iterable.forEach(Iterable.java:74)
    at com.amazonaws.services.glue.schema.Schema.applyPreorderInternal(Schema.java:455)
    at com.amazonaws.services.glue.schema.Schema.lambda$applyPreorderInternal$7(Schema.java:442)
    at java.lang.Iterable.forEach(Iterable.java:75)
    at com.amazonaws.services.glue.schema.Schema.applyPreorderInternal(Schema.java:441)
    at com.amazonaws.services.glue.schema.Schema.applyPreorder(Schema.java:471)
    at com.amazonaws.services.glue.DynamicFrame.findEmptyStruct(DynamicFrame.scala:947)
    at com.amazonaws.services.glue.DynamicFrame.dropFieldsInternal(DynamicFrame.scala:867)
    at com.amazonaws.services.glue.DynamicFrame.dropFields(DynamicFrame.scala:857)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.ClassNotFoundException: org.apache.commons.collections4.IteratorUtils
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    ... 21 more

Any hints or pointers on how to dig into this? Nothing has changed with the docker image, so my hunch is the issue stems from an upstream change in the glue ETL jars. I've tried adding commons-collections4 as a dependency in pom.xml and then running mvn package but that doesn't solve it.

Any help or directional advice would be appreciated!

taylorbarstow commented 3 years ago

UPDATE

I was able to successfully work around this by:

  1. Adding commons-collections4 as a dependency in pom.xml
  2. Running mvn install
  3. Running mvn -f ${GLUE_ROOT}/pom.xml -DoutputDirectory=${SPARK_ROOT}/jars dependency:copy-dependencies where GLUE_ROOT is the root of this project, and SPARK_ROOT is the root of my spark install

If the maintainers think a fix within aws-glue-libs is warranted, I'd be happy to submit a PR. However I have a hunch that this is due to broken dependencies in the glue ETL jars, in which case this issue may simply go away once the upstream dependency issues are resolved.

PPFilip commented 3 years ago

Oh wow, thanks for the hint. The same issue is actually present in the aws glue docker image (safe to assume it is built from this repository) and I've been banging my head over it. I fixed it just by downloading Apache Commons Collections 4.4 website, unpacking and putting into jar repository.

# wget https://downloads.apache.org//commons/collections/binaries/commons-collections4-4.4-bin.zip
# unzip commons-collections4-4.4-bin.zip
# cp commons-collections4-4.4/commons-collections4-4.4.jar /home/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8/jars/ 

restart docker image and voila, it works and I can finally work with my data.

matiassciencenow commented 3 years ago

I had a very similar issue using the AWS Glue docker container (glue 1.0). I couldn't load data from XML files using glueContext.create_dynamic_frame_from_options. I fixed it following @PPFilip steps to include Apache Commons Collections 4.4 in the jars. Restarted the docker image and it worked.

Thanks a lot @PPFilip and @taylorbarstow , you guys made my day.

danielpazeto commented 2 years ago

I'm having the same issue but when using the resolveChoice method from DynamicDataframe.

df.resolveChoice(choice = "cast:string")

I'm trying to understand where I insert that jar.