awslabs / aws-glue-libs

AWS Glue Libraries are additions and enhancements to Spark for ETL operations.
Other
642 stars 304 forks source link

glue-3.0 amzn spark tarball does not include pyspark #94

Closed jesusch closed 2 years ago

jesusch commented 3 years ago

From README.md: https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-3.0/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3

Does not contain pyspark binary/executable

castillo-luis commented 3 years ago

Don't bother with the glue-3.0 branch....even with the pyspark "binary" (its really just a script) which I copied from the EMR6.3 base image...it's still not working.

Spark starts then this gets thrown..as soon as it hits glueContext = GlueContext(sc)

Exception in thread "Thread-6" java.lang.NoClassDefFoundError: com/amazonaws/services/lakeformation/model/CommitTransactionResult at java.lang.Class.getDeclaredMethods0(Native Method) at java.lang.Class.privateGetDeclaredMethods(Class.java:2701) at java.lang.Class.privateGetPublicMethods(Class.java:2902) at java.lang.Class.getMethods(Class.java:1615) at py4j.reflection.ReflectionEngine.getMethodsByNameAndLength(ReflectionEngine.java:345) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:305) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:274) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.ClassNotFoundException: com.amazonaws.services.lakeformation.model.CommitTransactionResult at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 12 more ERROR:root:Exception while sending command.

jesusch commented 3 years ago

while trying to pass a python job via gluesparksubmit test.py --JOB_NAME test I tried openjdk8 and openjdk11

Would be good to get any idea what java version should be used?

Traceback (most recent call last):
  File "/Users/jesusch/git/aws-glue-libs/test.py", line 11, in <module>
    sc = SparkContext()
  File "/Users/jesusch/Downloads/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3/python/lib/pyspark.zip/pyspark/context.py", line 146, in __init__
  File "/Users/jesusch/Downloads/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3/python/lib/pyspark.zip/pyspark/context.py", line 209, in _do_init
  File "/Users/jesusch/Downloads/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3/python/lib/pyspark.zip/pyspark/context.py", line 329, in _initialize_context
  File "/Users/jesusch/Downloads/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1568, in __call__
  File "/Users/jesusch/Downloads/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NoSuchMethodError: 'void io.netty.util.concurrent.SingleThreadEventExecutor.<init>(io.netty.util.concurrent.EventExecutorGroup, java.util.concurrent.Executor, boolean, java.util.Queue, io.netty.util.concurrent.RejectedExecutionHandler)'
    at io.netty.channel.SingleThreadEventLoop.<init>(SingleThreadEventLoop.java:65)
    at io.netty.channel.nio.NioEventLoop.<init>(NioEventLoop.java:138)
    at io.netty.channel.nio.NioEventLoopGroup.newChild(NioEventLoopGroup.java:146)
    at io.netty.channel.nio.NioEventLoopGroup.newChild(NioEventLoopGroup.java:37)
    at io.netty.util.concurrent.MultithreadEventExecutorGroup.<init>(MultithreadEventExecutorGroup.java:84)
    at io.netty.util.concurrent.MultithreadEventExecutorGroup.<init>(MultithreadEventExecutorGroup.java:58)
    at io.netty.util.concurrent.MultithreadEventExecutorGroup.<init>(MultithreadEventExecutorGroup.java:47)
    at io.netty.channel.MultithreadEventLoopGroup.<init>(MultithreadEventLoopGroup.java:59)
    at io.netty.channel.nio.NioEventLoopGroup.<init>(NioEventLoopGroup.java:86)
    at io.netty.channel.nio.NioEventLoopGroup.<init>(NioEventLoopGroup.java:81)
    at io.netty.channel.nio.NioEventLoopGroup.<init>(NioEventLoopGroup.java:68)
    at org.apache.spark.network.util.NettyUtils.createEventLoop(NettyUtils.java:66)
    at org.apache.spark.network.client.TransportClientFactory.<init>(TransportClientFactory.java:106)
    at org.apache.spark.network.TransportContext.createClientFactory(TransportContext.java:142)
    at org.apache.spark.rpc.netty.NettyRpcEnv.<init>(NettyRpcEnv.scala:77)
    at org.apache.spark.rpc.netty.NettyRpcEnvFactory.create(NettyRpcEnv.scala:493)
    at org.apache.spark.rpc.RpcEnv$.create(RpcEnv.scala:57)
    at org.apache.spark.SparkEnv$.create(SparkEnv.scala:266)
    at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:189)
    at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:277)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:458)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:238)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.base/java.lang.Thread.run(Thread.java:829)
castillo-luis commented 3 years ago

Well if you dump the java version in a Glue 3.0 script this is what it spits out....

b'openjdk version "1.8.0_282"\nOpenJDK Runtime Environment (build 1.8.0_282-b08)\nOpenJDK 64-Bit Server VM (build 25.282-b08, mixed mode)\n'

so openjdk8 seems to be what they are using on the workers

cvwjensen commented 3 years ago

When I click the link to either of the references s3 objects, I get:

<Error>
<Code>NoSuchKey</Code>
<Message>The specified key does not exist.</Message>
<Key>glue-3.0/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3</Key>
<RequestId>TZ094KCPBAHFN4W0</RequestId>
<HostId>w9M8wLBNkGI7lZ7oSHdaXJE0uUr1Z8mTHDnClPi0hOxZOvG6ckS3m20ccKMzKOZaha8xlFLf0ZQ=</HostId>
</Error>

Eg: https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-3.0/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3

iRemjeyX commented 3 years ago

When I click the link to either of the references s3 objects, I get:

<Error>
<Code>NoSuchKey</Code>
<Message>The specified key does not exist.</Message>
<Key>glue-3.0/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3</Key>
<RequestId>TZ094KCPBAHFN4W0</RequestId>
<HostId>w9M8wLBNkGI7lZ7oSHdaXJE0uUr1Z8mTHDnClPi0hOxZOvG6ckS3m20ccKMzKOZaha8xlFLf0ZQ=</HostId>
</Error>

Eg: https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-3.0/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3 Append .tgz, i.e. https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-3.0/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3.tgz

moomindani commented 3 years ago

Thank you for reporting the issue.

We have updated the following tarball to include pyspark package. https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-3.0/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3.tgz We also fixed the README.md with the correct URL for the package.

Please let us know if you still see the issues. For new issues which is different from pyspark packaging, it would be great if you can create a separate issue.

moomindani commented 3 years ago

Let us keep this issue open for a while to see if there are any additional issues.

moomindani commented 3 years ago

I double checked that the original issue has been resolved. Closing.

cbishop commented 2 years ago

I need Glue 3 to investigate some Governed Tables operations.

I followed the instructions for Developing Locally with Python, https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html

Only when I run ./bin/gluepyspark, it fails with the error below. I see someone in this thread had a similar issue.

I'm using Oracle Java 1.8.0_311. Would OpenJDK be a better choice? And I got the latest files from the link above for Maven and Spark. I also tried the Glue ETL file uploaded by moomindani.

Any suggestions on how to resolve this problem?

[INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 3.041 s [INFO] Finished at: 2021-11-09T17:59:53-05:00 [INFO] ------------------------------------------------------------------------ mkdir: /Users/cbishop/dev/aws-glue-libs/conf: File exists /Users/cbishop/dev/aws-glue-libs Python 3.9.7 (default, Nov 9 2021, 08:38:13) [Clang 13.0.0 (clang-1300.0.29.3)] on darwin Type "help", "copyright", "credits" or "license" for more information. SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/Users/cbishop/dev/aws-glue-libs/jarsv1/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/Users/cbishop/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 21/11/09 17:59:57 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 21/11/09 17:59:57 WARN SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext should be running in this JVM (see SPARK-2243). The other SparkContext was created at: org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58) sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) java.lang.reflect.Constructor.newInstance(Constructor.java:423) py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) py4j.Gateway.invoke(Gateway.java:238) py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) py4j.GatewayConnection.run(GatewayConnection.java:238) java.lang.Thread.run(Thread.java:748) /Users/cbishop/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3/python/pyspark/shell.py:42: UserWarning: Failed to initialize Spark session. warnings.warn("Failed to initialize Spark session.") Traceback (most recent call last): File "/Users/cbishop/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3/python/pyspark/shell.py", line 38, in <module> spark = SparkSession._create_shell_session() # type: ignore File "/Users/cbishop/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3/python/pyspark/sql/session.py", line 553, in _create_shell_session return SparkSession.builder.getOrCreate() File "/Users/cbishop/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3/python/pyspark/sql/session.py", line 228, in getOrCreate sc = SparkContext.getOrCreate(sparkConf) File "/Users/cbishop/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3/python/pyspark/context.py", line 392, in getOrCreate SparkContext(conf=conf or SparkConf()) File "/Users/cbishop/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3/python/pyspark/context.py", line 146, in __init__ self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer, File "/Users/cbishop/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3/python/pyspark/context.py", line 209, in _do_init self._jsc = jsc or self._initialize_context(self._conf._jconf) File "/Users/cbishop/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3/python/pyspark/context.py", line 329, in _initialize_context return self._jvm.JavaSparkContext(jconf) File "/Users/cbishop/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1568, in __call__ return_value = get_return_value( File "/Users/cbishop/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value raise Py4JJavaError( py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : java.lang.NoSuchMethodError: io.netty.util.concurrent.SingleThreadEventExecutor.<init>(Lio/netty/util/concurrent/EventExecutorGroup;Ljava/util/concurrent/Executor;ZLjava/util/Queue;Lio/netty/util/concurrent/RejectedExecutionHandler;)V at io.netty.channel.SingleThreadEventLoop.<init>(SingleThreadEventLoop.java:65) at io.netty.channel.nio.NioEventLoop.<init>(NioEventLoop.java:138) at io.netty.channel.nio.NioEventLoopGroup.newChild(NioEventLoopGroup.java:146) at io.netty.channel.nio.NioEventLoopGroup.newChild(NioEventLoopGroup.java:37) at io.netty.util.concurrent.MultithreadEventExecutorGroup.<init>(MultithreadEventExecutorGroup.java:84) at io.netty.util.concurrent.MultithreadEventExecutorGroup.<init>(MultithreadEventExecutorGroup.java:58) at io.netty.util.concurrent.MultithreadEventExecutorGroup.<init>(MultithreadEventExecutorGroup.java:47) at io.netty.channel.MultithreadEventLoopGroup.<init>(MultithreadEventLoopGroup.java:59) at io.netty.channel.nio.NioEventLoopGroup.<init>(NioEventLoopGroup.java:86) at io.netty.channel.nio.NioEventLoopGroup.<init>(NioEventLoopGroup.java:81) at io.netty.channel.nio.NioEventLoopGroup.<init>(NioEventLoopGroup.java:68) at org.apache.spark.network.util.NettyUtils.createEventLoop(NettyUtils.java:66) at org.apache.spark.network.client.TransportClientFactory.<init>(TransportClientFactory.java:106) at org.apache.spark.network.TransportContext.createClientFactory(TransportContext.java:142) at org.apache.spark.rpc.netty.NettyRpcEnv.<init>(NettyRpcEnv.scala:77) at org.apache.spark.rpc.netty.NettyRpcEnvFactory.create(NettyRpcEnv.scala:493) at org.apache.spark.rpc.RpcEnv$.create(RpcEnv.scala:57) at org.apache.spark.SparkEnv$.create(SparkEnv.scala:266) at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:189) at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:277) at org.apache.spark.SparkContext.<init>(SparkContext.scala:458) at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:423) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:238) at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80) at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748)

skycmoon commented 2 years ago

Could we reopen this ticket? This issue seems not resolved.

I'm still experiencing the same issue with JDK 1.8.0_292 as https://github.com/awslabs/aws-glue-libs/issues/94.

[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  2.386 s
[INFO] Finished at: 2021-12-17T02:32:44-08:00
[INFO] ------------------------------------------------------------------------
mkdir: /Users/skym/dev/workspaces/aws-glue-libs/conf: File exists
/Users/skym/dev/workspaces/volta-etl
Picked up JAVA_TOOL_OPTIONS: -Djavax.net.ssl.trustStoreType=KeychainStore
Python 3.7.12 (default, Dec 17 2021, 02:24:21) 
[Clang 12.0.0 (clang-1200.0.32.29)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
Picked up JAVA_TOOL_OPTIONS: -Djavax.net.ssl.trustStoreType=KeychainStore
Picked up JAVA_TOOL_OPTIONS: -Djavax.net.ssl.trustStoreType=KeychainStore
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/skym/dev/workspaces/aws-glue-libs/jarsv1/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/skym/dev/workspaces/aws-glue-libs/jarsv1/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/skym/dev/tools/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
21/12/17 02:32:47 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
21/12/17 02:32:47 WARN SparkContext: Another SparkContext is being constructed (or threw an exception in its constructor). This may indicate an error, since only one SparkContext should be running in this JVM (see SPARK-2243). The other SparkContext was created at:
org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
java.lang.reflect.Constructor.newInstance(Constructor.java:423)
py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
py4j.Gateway.invoke(Gateway.java:238)
py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
py4j.GatewayConnection.run(GatewayConnection.java:238)
java.lang.Thread.run(Thread.java:748)
/Users/skym/dev/tools/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3/python/pyspark/shell.py:42: UserWarning: Failed to initialize Spark session.
  warnings.warn("Failed to initialize Spark session.")
Traceback (most recent call last):
  File "/Users/skym/dev/tools/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3/python/pyspark/shell.py", line 38, in <module>
    spark = SparkSession._create_shell_session()  # type: ignore
  File "/Users/skym/dev/tools/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3/python/pyspark/sql/session.py", line 553, in _create_shell_session
    return SparkSession.builder.getOrCreate()
  File "/Users/skym/dev/tools/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3/python/pyspark/sql/session.py", line 228, in getOrCreate
    sc = SparkContext.getOrCreate(sparkConf)
  File "/Users/skym/dev/tools/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3/python/pyspark/context.py", line 392, in getOrCreate
    SparkContext(conf=conf or SparkConf())
  File "/Users/skym/dev/tools/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3/python/pyspark/context.py", line 147, in __init__
    conf, jsc, profiler_cls)
  File "/Users/skym/dev/tools/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3/python/pyspark/context.py", line 209, in _do_init
    self._jsc = jsc or self._initialize_context(self._conf._jconf)
  File "/Users/skym/dev/tools/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3/python/pyspark/context.py", line 329, in _initialize_context
    return self._jvm.JavaSparkContext(jconf)
  File "/Users/skym/dev/tools/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1569, in __call__
    answer, self._gateway_client, None, self._fqn)
  File "/Users/skym/dev/tools/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.NoSuchMethodError: io.netty.util.concurrent.SingleThreadEventExecutor.<init>(Lio/netty/util/concurrent/EventExecutorGroup;Ljava/util/concurrent/Executor;ZLjava/util/Queue;Lio/netty/util/concurrent/RejectedExecutionHandler;)V
    at io.netty.channel.SingleThreadEventLoop.<init>(SingleThreadEventLoop.java:65)
    at io.netty.channel.nio.NioEventLoop.<init>(NioEventLoop.java:138)
    at io.netty.channel.nio.NioEventLoopGroup.newChild(NioEventLoopGroup.java:146)
    at io.netty.channel.nio.NioEventLoopGroup.newChild(NioEventLoopGroup.java:37)
    at io.netty.util.concurrent.MultithreadEventExecutorGroup.<init>(MultithreadEventExecutorGroup.java:84)
    at io.netty.util.concurrent.MultithreadEventExecutorGroup.<init>(MultithreadEventExecutorGroup.java:58)
    at io.netty.util.concurrent.MultithreadEventExecutorGroup.<init>(MultithreadEventExecutorGroup.java:47)
    at io.netty.channel.MultithreadEventLoopGroup.<init>(MultithreadEventLoopGroup.java:59)
    at io.netty.channel.nio.NioEventLoopGroup.<init>(NioEventLoopGroup.java:86)
    at io.netty.channel.nio.NioEventLoopGroup.<init>(NioEventLoopGroup.java:81)
    at io.netty.channel.nio.NioEventLoopGroup.<init>(NioEventLoopGroup.java:68)
    at org.apache.spark.network.util.NettyUtils.createEventLoop(NettyUtils.java:66)
    at org.apache.spark.network.client.TransportClientFactory.<init>(TransportClientFactory.java:106)
    at org.apache.spark.network.TransportContext.createClientFactory(TransportContext.java:142)
    at org.apache.spark.rpc.netty.NettyRpcEnv.<init>(NettyRpcEnv.scala:77)
    at org.apache.spark.rpc.netty.NettyRpcEnvFactory.create(NettyRpcEnv.scala:493)
    at org.apache.spark.rpc.RpcEnv$.create(RpcEnv.scala:57)
    at org.apache.spark.SparkEnv$.create(SparkEnv.scala:266)
    at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:189)
    at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:277)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:458)
    at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:238)
    at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
    at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
moomindani commented 2 years ago

Reopening.

skycmoon commented 2 years ago

After some research and testing, it seems related to netty dependency version problem imported by com.amazonaws.AWSGlueETL. After adding below two lines in glue-setup.sh, the problem was disappeared for me. Please make proper update on dependencies in com.amazonaws.AWSGlueETL.

# Run mvn copy-dependencies target to get the Glue dependencies locally
mvn -f $ROOT_DIR/pom.xml -DoutputDirectory=$ROOT_DIR/jarsv1 dependency:copy-dependencies
rm $GLUE_JARS_DIR/javax.servlet-3.*
rm $GLUE_JARS_DIR/netty-*
moomindani commented 2 years ago

For customers who are facing netty related errors, please try adding following setting in spark-defaults.conf to avoid netty dependency issue.

spark-defaults.conf

spark.driver.extraClassPath /path_to_spark/jars/*:/path-to-aws-glue-libs/jars/*
spark.executor.extraClassPath /path_to_spark/jars/*:/path-to-aws-glue-libs/jars/*
skycmoon commented 2 years ago

Hmm, I located spark-defaults.conf under $SPARK_HOME/conf, but it did not solve the problem. What do you try to achieve with the configuration? Loading libraries found under $SPARK_HOME/jars first then ${aws-glue-libs project root}/jars?

If that's what you wanted, the proper place is not in each individual's spark config file, but this repo. You also have a script to create spark-defaults.conf in this repo. Check this PR https://github.com/awslabs/aws-glue-libs/pull/115 as an example, you don't need to approve it though.

moomindani commented 2 years ago

@skycmoon, Thanks for the correction, you are right, the change needs to be located in glue-setup.sh. I confirmed that this PR solved the NoSuchMethod error in gluepyspark command.

skycmoon commented 2 years ago

@moomindani, Happy to contribute! Could you merge my PR then?

moomindani commented 2 years ago

Now we have completed the review, and merged your pull-request. We really appreciate your contribution!