Closed OElesin closed 4 years ago
Hi @OElesin! Thanks for reaching out, really relevant topic.
Currently there is no support for "container-executor", "docker" and custom classifications.
But we will address all for sure.
Hi @OElesin!
I just added support for Docker and Custom Classification.
Docker example:
import awswrangler as wr
cluster_id = wr.emr.create_cluster(
subnet_id="SUBNET_ID",
spark_docker=True,
spark_docker_image="{ACCOUNT_ID}.dkr.ecr.{REGION}.amazonaws.com/{IMAGE_NAME}:{TAG}",
ecr_credentials_step=True
)
Custom Classification example:
cluster_id = wr.emr.create_cluster(
subnet_id="SUBNET_ID",
custom_classifications=[
{
"Classification": "livy-conf",
"Properties": {
"livy.spark.master": "yarn",
"livy.spark.deploy-mode": "cluster",
"livy.server.session.timeout": "16h",
},
}
],
)
I also create two new tutorials about it:
To install the related branch:
pip install git+https://github.com/awslabs/aws-data-wrangler.git@emr-6
Please, could you test it and give us feedback?
This is excellent! I will give this a try.
Is there a plan to add this to the master branch?
@OElesin The plain is to release this features on version 1.1.0
on next Friday!
Will be really nice if you could help us with some feedback. Thanks!
@igorborgest, Thanks for this. Tested it in the following conditions:
cluster_id = wr.emr.create_cluster(
cluster_name="my-demo-cluster-v2",
logging_s3_path=f"s3://my-logs-bucket/emr-logs/",
emr_release="emr-6.0.0",
subnet_id="SUBNET_ID",
emr_ec2_role="EMR_EC2_DefaultRole",
emr_role="EMR_DefaultRole",
instance_type_master="m5.2xlarge",
instance_type_core="m5.2xlarge",
instance_ebs_size_master=50,
instance_ebs_size_core=50,
instance_num_on_demand_master=0,
instance_num_on_demand_core=0,
instance_num_spot_master=1,
instance_num_spot_core=2,
spot_bid_percentage_of_on_demand_master=50,
spot_bid_percentage_of_on_demand_core=50,
spot_provisioning_timeout_master=5,
spot_provisioning_timeout_core=5,
spot_timeout_to_on_demand_master=False,
spot_timeout_to_on_demand_core=False,
python3=True,
ecr_credentials_step=True,
spark_docker=True,
spark_docker_image=DOCKER_IMAGE,
spark_glue_catalog=True,
hive_glue_catalog=True,
presto_glue_catalog=True,
debugging=True,
applications=["Hadoop", "Spark", "Hive", "Zeppelin", "Livy"],
visible_to_all_users=True,
maximize_resource_allocation=True,
keep_cluster_alive_when_no_steps=True,
termination_protected=False,
spark_pyarrow=True
)
Error message:
/bin/bash: docker: command not found
Command exiting with ret '127'
Hi @OElesin, thanks a lot for the quick response!
You are right, I just figured out EMR does not have docker installed in the master node, only in the core ones. Due that, we will not be able to refresh the ECR credentials programatically without an external file on s3.
I revisited the implementation and the tutorial and now the expected usage is:
import awswrangler as wr
cluster_id = wr.emr.create_cluster(subnet, docker=True)
wr.emr.submit_ecr_credentials_refresh(cluster_id, path="s3://bucket/emr/")
wr.emr.submit_spark_step(
cluster_id,
"s3://bucket/app.py",
docker_image=DOCKER_IMAGE
)
What you think?
P.S. The custom_classifications
usage keeps the same.
Available on version 1.1.0
Is your idea related to a problem? Please describe. I already made use of the library and it was super helpful. I tried setting custom EMR classifications so as to make use of EMR 6.0.0, but I could not set custom classification. Is it possible to do this currently, or it has to be a feature request?