awslabs / amazon-emr-cli

A command-line interface for packaging, deploying, and running your EMR Serverless Spark jobs
Apache License 2.0
35 stars 12 forks source link

How to Enable Glue Hive MetaStore with EMR CLI #18

Closed soumilshah1995 closed 1 year ago

soumilshah1995 commented 1 year ago

AWS has announced AWS EMR CLI

https://aws.amazon.com/blogs/big-data/build-deploy-and-run-spark-jobs-on-amazon-emr-with-the-open-source-emr-cli-tool/

I have tried and CLi works great simplifies submitting jobs However, could you tell us how to enable the Glue Hive meta store when submitting a job via CLI

image

Here is a sample of how we are submitting jobs

emr run --entry-point entrypoint.py --application-id --job-role <arn> --s3-code-uri s3:///emr_scripts/ --spark-submit-opts "--conf spark.jars=/usr/lib/hudi/hudi-spark-bundle.jar --conf spark.serializer=org.apache.spark.serializer.KryoSerializer" --build ` --wait

If you can kindly get back to us on issue that would be great 😃 @dacort

dacort commented 1 year ago

Hi @soumilshah1995 - The option to enable the Glue catalog is a spark-submit config option. You can add it using --spark-submit-opts.

The specific option is:

--conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory

So in your case, instead of --spark-submit-opts "--conf spark.jars=/usr/lib/hudi/hudi-spark-bundle.jar --conf spark.serializer=org.apache.spark.serializer.KryoSerializer", you'll have --spark-submit-opts "--conf spark.jars=/usr/lib/hudi/hudi-spark-bundle.jar --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"

Thanks for trying out the CLI! Let me know if anything else comes up. :)

dacort commented 1 year ago

Couple other comments:

soumilshah1995 commented 1 year ago

@dacort thanks a alot my issue has been resolved i am also creating crash course for community on using EMR Serverless

Capture