EMR Toolkit is a VS Code Extension to make it easier to develop Spark jobs on EMR.
The Amazon EMR Explorer allows you to browse job runs and steps across EMR on EC2, EMR on EKS, and EMR Serverless. To see the Explorer, choose the EMR icon in the Activity bar.
Note: If you do not have default AWS credentials or
AWS_PROFILE
environment variable, use theEMR: Select AWS Profile
command to select your profile.
The Glue Catalog Explorer displays databases and tables in the Glue Data Catalog. By right-clicking on a table, you can select View Glue Table
that will show the table columns.
The toolkit provides an EMR: Create local Spark environment
command that creates a development container based off of an EMR on EKS image for the EMR version you choose. This container can be used to develop Spark and PySpark code locally that is fully compatible with your remote EMR environment.
You choose a region and EMR version you want to use, and the extension creates the relevant Dockerfile
and devcontainer.json
.
Once the container is created, follow the instructions in the emr-local.md
file to authenticate to ECR and use the Dev--Containers: Reopen in Container
command to build and open your local Spark environment.
You can choose to configure AWS authentication in the container in 1 of 3 ways:
.devcontainer/aws.env
file that you can populate with AWS credentials.The EMR Development Container is configured to run Spark in local mode. You can use it like any Spark-enabled environment. Inside the VS Code Terminal, you can use the pyspark
or spark-shell
commands to start a local Spark session.
By default, the EMR Development Container also supports Jupyter. Use the Create: New Jupyter Notebook command to create a new Jupyter notebook. The following code snippet shows how to initialize a Spark Session inside the notebook. By default, the Container environment is also configured to use the Glue Data Catalog so you can use spark.sql
commands against Glue tables.
from pyspark.sql import SparkSession
spark = (
SparkSession.builder.appName("EMRLocal")
.getOrCreate()
)
You can deploy and run a single PySpark file on EMR Serverless with the EMR Serverless: Deploy and run PySpark job command. You'll be prompted for the following information:
https://user-images.githubusercontent.com/1512/195953681-4e7e7102-4974-45b1-a695-195e91d45124.mp4
I'm looking for feedback in a few different areas:
See CONTRIBUTING for more information.
This project is licensed under the Apache-2.0 License.