awslabs / aws-glue-libs

AWS Glue Libraries are additions and enhancements to Spark for ETL operations.
Other
635 stars 299 forks source link

PySpark library working in Glue Spark version 3.0 not working anymore in Glue Spark 4.0 #204

Open lorenzo-necto opened 5 months ago

lorenzo-necto commented 5 months ago

from awsglueml.transform import EntityDetector does not work anymore in Glue version 4.0 - what is the replacement. AWS docs only are covering Scala for PII detection outside of Glue studio and via libraries and not Pyspark

awongCM commented 5 months ago

In addition to this, I followed the setup guide here - https://github.com/awslabs/aws-glue-libs?tab=readme-ov-file#setup-guide, using the latest master branch and when I tried to run a simple glue script as below

gluesparksubmit main.py --JOB_NAME=test1

Traceback (most recent call last):
  File "/Users/andywongcheeming/Projects/poc/local-aws-glue-jobs/aws-glue-local/main.py", line 8, in <module>
    sc = SparkContext.getOrCreate()
  File "/Users/andywongcheeming/Projects/spark/python/lib/pyspark.zip/pyspark/context.py", line 491, in getOrCreate
  File "/Users/andywongcheeming/Projects/spark/python/lib/pyspark.zip/pyspark/context.py", line 197, in __init__
  File "/Users/andywongcheeming/Projects/spark/python/lib/pyspark.zip/pyspark/context.py", line 282, in _do_init
  File "/Users/andywongcheeming/Projects/spark/python/lib/pyspark.zip/pyspark/context.py", line 410, in _initialize_context
  File "/Users/andywongcheeming/Projects/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1585, in __call__
  File "/Users/andywongcheeming/Projects/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.ExceptionInInitializerError
        at org.apache.spark.unsafe.array.ByteArrayMethods.<clinit>(ByteArrayMethods.java:56)
        at org.apache.spark.memory.MemoryManager$.getPageSizeBytes(MemoryManager.scala:287)
        at org.apache.spark.memory.MemoryManager.<init>(MemoryManager.scala:250)
        at org.apache.spark.memory.UnifiedMemoryManager.<init>(UnifiedMemoryManager.scala:58)
        at org.apache.spark.memory.UnifiedMemoryManager$.apply(UnifiedMemoryManager.scala:207)
        at org.apache.spark.SparkEnv$.create(SparkEnv.scala:324)
        at org.apache.spark.SparkEnv$.createDriverEnv(SparkEnv.scala:198)
        at org.apache.spark.SparkContext.createSparkEnv(SparkContext.scala:280)
        at org.apache.spark.SparkContext.<init>(SparkContext.scala:465)
        at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
        at java.base/jdk.internal.reflect.DirectConstructorHandleAccessor.newInstance(DirectConstructorHandleAccessor.java:62)
        at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:502)
        at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:486)
        at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
        at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
        at py4j.Gateway.invoke(Gateway.java:238)
        at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
        at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
        at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
        at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
        at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: java.lang.IllegalStateException: java.lang.NoSuchMethodException: java.nio.DirectByteBuffer.<init>(long,int)
        at org.apache.spark.unsafe.Platform.<clinit>(Platform.java:113)
        ... 21 more
Caused by: java.lang.NoSuchMethodException: java.nio.DirectByteBuffer.<init>(long,int)
        at java.base/java.lang.Class.getConstructor0(Class.java:3761)
        at java.base/java.lang.Class.getDeclaredConstructor(Class.java:2930)
        at org.apache.spark.unsafe.Platform.<clinit>(Platform.java:71)
        ... 21 more

It's saying sc = SparkContext.getOrCreate() method does not exist. I'm confused cause that function has always been there since the last two major aws glue lib versions.

I'm running on Apple Macbook M2 Pro using Spark version spark-3.3.0-amzn-1-bin-3.3.3-amzn-0 btw.

So I'm wondering is it because I'm running on arm64 based machine that's why it's not working as expected vs non-arm64 based machines.

PS: I'm using binary distribution version of the libray, not the docker-based image version. Just want to clarify this upfront.