Closed soumilshah1995 closed 1 year ago
Hi @soumilshah1995 - The option to enable the Glue catalog is a spark-submit config option. You can add it using --spark-submit-opts
.
The specific option is:
--conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
So in your case, instead of --spark-submit-opts "--conf spark.jars=/usr/lib/hudi/hudi-spark-bundle.jar --conf spark.serializer=org.apache.spark.serializer.KryoSerializer"
, you'll have --spark-submit-opts "--conf spark.jars=/usr/lib/hudi/hudi-spark-bundle.jar --conf spark.serializer=org.apache.spark.serializer.KryoSerializer --conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"
Thanks for trying out the CLI! Let me know if anything else comes up. :)
Couple other comments:
Your SparkSession will need to be created with .enableHiveSupport()
spark = (
SparkSession.builder.appName("SparkSQL")
.enableHiveSupport()
.getOrCreate()
)
I created a sample sql.py
script in my personal cli-examples repo: https://github.com/dacort/emr-cli-examples#spark-sql--glue-data-catalog-support
@dacort thanks a alot my issue has been resolved i am also creating crash course for community on using EMR Serverless
AWS has announced AWS EMR CLI
https://aws.amazon.com/blogs/big-data/build-deploy-and-run-spark-jobs-on-amazon-emr-with-the-open-source-emr-cli-tool/
I have tried and CLi works great simplifies submitting jobs However, could you tell us how to enable the Glue Hive meta store when submitting a job via CLI
Here is a sample of how we are submitting jobs
emr run /emr_scripts/
--entry-point entrypoint.py
--application-id--job-role <arn>
--s3-code-uri s3://--spark-submit-opts "--conf spark.jars=/usr/lib/hudi/hudi-spark-bundle.jar --conf spark.serializer=org.apache.spark.serializer.KryoSerializer"
--build ` --waitIf you can kindly get back to us on issue that would be great 😃 @dacort