How to set custom EMR classification

OElesin commented 4 years ago

Is your idea related to a problem? Please describe. I already made use of the library and it was super helpful. I tried setting custom EMR classifications so as to make use of EMR 6.0.0, but I could not set custom classification. Is it possible to do this currently, or it has to be a feature request?

igorborgest commented 4 years ago

Hi @OElesin! Thanks for reaching out, really relevant topic.

Currently there is no support for "container-executor", "docker" and custom classifications.

But we will address all for sure.

igorborgest commented 4 years ago

Hi @OElesin!

I just added support for Docker and Custom Classification.

Docker example:

import awswrangler as wr

cluster_id = wr.emr.create_cluster(
    subnet_id="SUBNET_ID",
    spark_docker=True,
    spark_docker_image="{ACCOUNT_ID}.dkr.ecr.{REGION}.amazonaws.com/{IMAGE_NAME}:{TAG}",
    ecr_credentials_step=True
)

Custom Classification example:

cluster_id = wr.emr.create_cluster(
    subnet_id="SUBNET_ID",
    custom_classifications=[
        {
            "Classification": "livy-conf",
            "Properties": {
                "livy.spark.master": "yarn",
                "livy.spark.deploy-mode": "cluster",
                "livy.server.session.timeout": "16h",
            },
        }
    ],
)

I also create two new tutorials about it:

To install the related branch: pip install git+https://github.com/awslabs/aws-data-wrangler.git@emr-6

Please, could you test it and give us feedback?

OElesin commented 4 years ago

This is excellent! I will give this a try.

Is there a plan to add this to the master branch?

igorborgest commented 4 years ago

@OElesin The plain is to release this features on version 1.1.0 on next Friday!

Will be really nice if you could help us with some feedback. Thanks!

OElesin commented 4 years ago

@igorborgest, Thanks for this. Tested it in the following conditions:

Using your example, it worked but only started the cluster with master instance only.

Tested with a master instance and core instance, see below:

cluster_id = wr.emr.create_cluster(
cluster_name="my-demo-cluster-v2",
logging_s3_path=f"s3://my-logs-bucket/emr-logs/",
emr_release="emr-6.0.0",
subnet_id="SUBNET_ID",
emr_ec2_role="EMR_EC2_DefaultRole",
emr_role="EMR_DefaultRole",
instance_type_master="m5.2xlarge",
instance_type_core="m5.2xlarge",
instance_ebs_size_master=50,
instance_ebs_size_core=50,
instance_num_on_demand_master=0,
instance_num_on_demand_core=0,
instance_num_spot_master=1,
instance_num_spot_core=2,
spot_bid_percentage_of_on_demand_master=50,
spot_bid_percentage_of_on_demand_core=50,
spot_provisioning_timeout_master=5,
spot_provisioning_timeout_core=5,
spot_timeout_to_on_demand_master=False,
spot_timeout_to_on_demand_core=False,
python3=True,
ecr_credentials_step=True,
spark_docker=True,
spark_docker_image=DOCKER_IMAGE,
spark_glue_catalog=True,
hive_glue_catalog=True,
presto_glue_catalog=True,
debugging=True,
applications=["Hadoop", "Spark", "Hive", "Zeppelin", "Livy"],
visible_to_all_users=True,
maximize_resource_allocation=True,
keep_cluster_alive_when_no_steps=True,
termination_protected=False,
spark_pyarrow=True
)

Error message:

/bin/bash: docker: command not found
Command exiting with ret '127'

igorborgest commented 4 years ago

Hi @OElesin, thanks a lot for the quick response!

You are right, I just figured out EMR does not have docker installed in the master node, only in the core ones. Due that, we will not be able to refresh the ECR credentials programatically without an external file on s3.

I revisited the implementation and the tutorial and now the expected usage is:

import awswrangler as wr

cluster_id = wr.emr.create_cluster(subnet, docker=True)

wr.emr.submit_ecr_credentials_refresh(cluster_id, path="s3://bucket/emr/")

wr.emr.submit_spark_step(
    cluster_id,
    "s3://bucket/app.py",
    docker_image=DOCKER_IMAGE
)

What you think?

P.S. The custom_classifications usage keeps the same.

igorborgest commented 4 years ago

Available on version 1.1.0

aws / aws-sdk-pandas

How to set custom EMR classification #193