Closed SiddyP closed 2 years ago
@SiddyP This looks like it might be related to SPARK-34790. Can you please try adding Spark conf --conf spark.io.encryption.enabled=false
For example, you can launch Glue v3 (Spark 3.1.1
) REPL shell using:
docker run -it --rm -p 4040:4040 -e CUSTOM_SPARK_HISTORY_KEYSTORE=s3://BUCKET/Prefix/glue_key.aws.internal.jks -e CUSTOM_SPARK_HISTORY_KEYSTOREPW=XXXXXXXXXXXX --name glue_v3 public.ecr.aws/glue/aws-glue-libs:glue_libs_3.0.0_image_01 pyspark --conf spark.io.encryption.enabled=false
@svajiraya Thanks. Disabling encryption does work.
So it seems the spark-defaults.conf baked into the docker will not work as-is. One either has to mount-overwrite /home/glue_user/spark/conf/spark-defaults.conf or supply --conf on the command line (which I assume takes precedence over the config baked into the container).
Is it realistic to suggest bumping the glue-lib-image to spark 3.1.2 or that the default baked-in conf (i.e. /home/glue_user/spark/conf/spark-defaults.conf) is changed to set false?
I will suggest #127 to try this out too.
I've been experimenting with this for a few days and it only seems to work for very reduced "toy" examples.
example read a few parquet files from s3 (~30 MB on disk) into a pyspark dataframe and deduplicate:
yields errors like
and
A few things I've experimented with that didn't seem to do much
run the container with more memory, cpu and writable mounts..
override spark config in the container by mounting my own spark-defaults.conf:
overriding the existing file in the container by adding this to the run command:
Related: Distinct with more than 37 distinct values fails for some reason: https://github.com/awslabs/aws-glue-libs/issues/127