aws-samples / emr-serverless-samples

Example code for running Spark and Hive jobs on EMR Serverless.
https://aws.amazon.com/emr/serverless/
MIT No Attribution
155 stars 77 forks source link

EMR Serverless Adding Option to Boto3 for Glue Catlog #53

Closed soumilshah1995 closed 8 months ago

soumilshah1995 commented 1 year ago

AWS has announced AWS EMR CLI

https://aws.amazon.com/blogs/big-data/build-deploy-and-run-spark-jobs-on-amazon-emr-with-the-open-source-emr-cli-tool/

 

I have tried and CLi works great simplifies submitting jobs image

However, could you tell us how to enable the Glue Hive meta store when submitting a job via CLI or in Boto3 i have looked at documentation i don't see an argument for supplying use Glue CatLog option on boto3

image

Here is a sample of how we are submitting jobs qith EMR-CLI

emr run     --entry-point entrypoint.py     --application-id     --job-role <arn>     --s3-code-uri s3:///emr_scripts/     --spark-submit-opts "--conf spark.jars=/usr/lib/hudi/hudi-spark-bundle.jar --conf spark.serializer=org.apache.spark.serializer.KryoSerializer"     --build `     --wait

 

Created A Github Issue https://github.com/awslabs/amazon-emr-cli/issues/18

If you can kindly get back to us on issue that would be great 😃

Neubauer-A commented 8 months ago

I've been able to get this to work by specifying in sparkSubmitParameters. For example:

job_run = client.start_job_run(
    applicationId=application_id,
    executionRoleArn=job_role_arn,
    jobDriver={
        'sparkSubmit': {
            'entryPoint': 's3://bucket/your_script.py',
            'entryPointArguments': [],
            'sparkSubmitParameters': '--conf spark.hadoop.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory'
        },
    }
)