awslabs / aws-glue-libs

AWS Glue Libraries are additions and enhancements to Spark for ETL operations.
Other
635 stars 300 forks source link

docker image amazon/aws-glue-libs:glue_libs_3.0.0_image_01 only works for very reduced examples #128

Closed SiddyP closed 2 years ago

SiddyP commented 2 years ago

I've been experimenting with this for a few days and it only seems to work for very reduced "toy" examples.

example read a few parquet files from s3 (~30 MB on disk) into a pyspark dataframe and deduplicate:

df = spark.read.format("parquet").load("/somePath")
df.dropDuplicates(['id']).show()

yields errors like

org.apache.spark.SparkException: Job aborted due to stage failure: ShuffleMapStage 89 (count at NativeMethodAccessorImpl.java:0) has failed the maximum allowable number of times: 4. Most recent failure reason: org.apache.spark.shuffle.FetchFailedException: Stream is corrupted

and

java.lang.OutOfMemoryError: Java heap space
INFO LineBufferedStream: at org.xerial.snappy.SnappyInputStream.hasNextChunk(SnappyInputStream.java:422)

A few things I've experimented with that didn't seem to do much

Related: Distinct with more than 37 distinct values fails for some reason: https://github.com/awslabs/aws-glue-libs/issues/127

svajiraya commented 2 years ago

@SiddyP This looks like it might be related to SPARK-34790. Can you please try adding Spark conf --conf spark.io.encryption.enabled=false

For example, you can launch Glue v3 (Spark 3.1.1) REPL shell using:

docker run -it --rm -p 4040:4040 -e CUSTOM_SPARK_HISTORY_KEYSTORE=s3://BUCKET/Prefix/glue_key.aws.internal.jks -e CUSTOM_SPARK_HISTORY_KEYSTOREPW=XXXXXXXXXXXX --name glue_v3 public.ecr.aws/glue/aws-glue-libs:glue_libs_3.0.0_image_01 pyspark --conf spark.io.encryption.enabled=false
SiddyP commented 2 years ago

@svajiraya Thanks. Disabling encryption does work.

So it seems the spark-defaults.conf baked into the docker will not work as-is. One either has to mount-overwrite /home/glue_user/spark/conf/spark-defaults.conf or supply --conf on the command line (which I assume takes precedence over the config baked into the container).

Is it realistic to suggest bumping the glue-lib-image to spark 3.1.2 or that the default baked-in conf (i.e. /home/glue_user/spark/conf/spark-defaults.conf) is changed to set false?

I will suggest #127 to try this out too.