How to Enable Glue Hive MetaStore with EMR CLI

awslabs / amazon-emr-cli

A command-line interface for packaging, deploying, and running your EMR Serverless Spark jobs

Apache License 2.0

35 stars 12 forks source link

How to Enable Glue Hive MetaStore with EMR CLI #18

Closed soumilshah1995 closed 1 year ago

soumilshah1995 commented 1 year ago

AWS has announced AWS EMR CLI

https://aws.amazon.com/blogs/big-data/build-deploy-and-run-spark-jobs-on-amazon-emr-with-the-open-source-emr-cli-tool/

I have tried and CLi works great simplifies submitting jobs However, could you tell us how to enable the Glue Hive meta store when submitting a job via CLI

Here is a sample of how we are submitting jobs

emr run --entry-point entrypoint.py --application-id --job-role <arn> --s3-code-uri s3:///emr_scripts/ --spark-submit-opts "--conf spark.jars=/usr/lib/hudi/hudi-spark-bundle.jar --conf spark.serializer=org.apache.spark.serializer.KryoSerializer" --build ` --wait

If you can kindly get back to us on issue that would be great 😃 @dacort

dacort commented 1 year ago

Hi @soumilshah1995 - The option to enable the Glue catalog is a spark-submit config option. You can add it using --spark-submit-opts.

The specific option is:

--conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory

So in your case, instead of --spark-submit-opts "--conf spark.jars=/usr/lib/hudi/hudi-spark-bundle.jar --conf spark.serializer=org.apache.spark.serializer.KryoSerializer", you'll have --spark-submit-opts "--conf spark.jars=/usr/lib/hudi/hudi-spark-bundle.jar --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"

Thanks for trying out the CLI! Let me know if anything else comes up. :)

dacort commented 1 year ago

Couple other comments:

Your SparkSession will need to be created with .enableHiveSupport()

spark = (
SparkSession.builder.appName("SparkSQL")
.enableHiveSupport()
.getOrCreate()
)

I created a sample sql.py script in my personal cli-examples repo: https://github.com/dacort/emr-cli-examples#spark-sql--glue-data-catalog-support

soumilshah1995 commented 1 year ago

@dacort thanks a alot my issue has been resolved i am also creating crash course for community on using EMR Serverless

Capture