awslabs / amazon-s3-tagging-spark-util

Apache License 2.0
10 stars 2 forks source link

Getting error while using s3 spark tagging util #4

Closed kremrikpatel closed 1 year ago

kremrikpatel commented 2 years ago

Hi,

I am running spark jobs on glue 3.0 with pyspark, and spark tagging util jar file download from release page https://github.com/awslabs/amazon-s3-tagging-spark-util/releases , amazon-s3-tagging-spark-util-assembly_2.12-1.0.jar. I am passing the jar as external argument of glue job as "--extra-jars" : "s3://$BUCKET/$PREFIX/amazon-s3-tagging-spark-util-assembly_2.12-1.0.jar".

glue start job command : $ aws glue start-job-run --job-name "CSV to CSV" --arguments='--extra-jars="s3://$BUCKET/$PREFIX/amazon-s3-tagging-spark-util-assembly_2.12-1.0.jar"'

The jar register successfully in glue job , able to see the jars in spark config ('spark.glue.extra-jars', 's3://$BUCKET/$PREFIX/amazon-s3-tagging-spark-util-assembly_2.12-1.0.jar')

First I am try to reading files from s3 bucket, and reading file successfully.

df=spark.read.csv('s3://file',header=True,inferschema=True)

and then after Writing the file back to s3

df.write .format("s3.csv") .option("tag", "{\"ProjectTeam\": \"Team-A\", \"FileType\":\"parquet\"}") .save("s3://$DATA_BUCKET/$TABLE_NAME")

But getting error during the write the file: File "/opt/amazon/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328 in get_return_value format(target_id,".",name),value)

py4j.protocol.Py4JJavaError: An error occurred while calling o165.save : java.lang.NoClassDefFoundError: org/apache/spark/sql/execution/datasources/csv/CSVOptions

Can you please help me out on this?

rumeshkrish commented 2 years ago

@kremrikpatel Thanks for detailed explanation. Currently i am working on it, We can expect new release in few weeks.

rumeshkrish commented 1 year ago

Please check the new [release V2.0] (https://github.com/awslabs/amazon-s3-tagging-spark-util/releases/tag/v2.0) page and README.md. Thanks for your patience.!

Closing this issue.