stepfunctions-tasks: EmrCreateClusterProps clusterRole should be instance profile not IRole

kevinfguo commented 4 years ago

Seems that the API for AWS CDK with AWS Step Function to create an EMR cluster is wrong.

With an EMR service object, we can provide RunJobFlowInput to runJobFlow, that specified JobFlowRole: instanceProfile.ref, with instanceProfile as a profile with a single role:

// Example role, our role is constructed with more permissions
const myRole = new iam.Role(stack, 'my-role', {
    assumedBy: new iam.ServicePrincipal('ec2.amazonaws.com'),
    managedPolicies: [iam.ManagedPolicy.fromAwsManagedPolicyName('service-role/AmazonElasticMapReduceforEC2Role')],
  });

const instanceProfile = new iam.CfnInstanceProfile(stack, `emr-ec2-instance-profile`, {
    roles: [myRole.roleName],
  });

With stepfunctionsTasks.EmrCreateCluster, however, the required EmrCreateClusterProps, have renamed JobFlowRole into clusterRole, and expect clusterRole to be of type iam.IRole instead of a reference to the previous iam.CfnInstanceProfile. Attempting to pass the IAM Role that the profile was created with (e.g. myRole), results in an Invalid InstanceProfile error when running Step Function.

This seems to indicate that despite trying to pass a IRole, what the API should be requesting is the reference to a CfnInstanceProfile.

Reproduction Steps

// Example role, our role is constructed with more permissions
const myRole = new iam.Role(stack, 'my-role', {
    assumedBy: new iam.ServicePrincipal('ec2.amazonaws.com'),
    managedPolicies: [iam.ManagedPolicy.fromAwsManagedPolicyName('service-role/AmazonElasticMapReduceforEC2Role')],
  });

const instanceProfile = new iam.CfnInstanceProfile(stack, `emr-ec2-instance-profile`, {
    roles: [myRole.roleName],
  });

const startClusterTask = new steps.Task(stack, 'Create Cluster', {
    task: new tasks.EmrCreateCluster({
      name: 'myCluster',
      visibleToAllUsers: true,
      releaseLabel: 'emr-6.0.0',
      instances: {
        masterInstanceType: 'm5.xlarge',
        slaveInstanceType: 'm5.xlarge',
        instanceCount: 3,
      },
      clusterRole: myRole, // aka JowFlowRole or instanceProfile.ref
      configurations: [
        {
          classification: 'spark-env',
          properties: {
            PYSPARK_PYTHON: '/usr/bin/python3',
          },
        },
      ],
      applications: [
        {
          name: 'Spark',
        },
      ],
    }),
  });

const stopClusterTask = new steps.Task(stack, 'Task', {
    task: new tasks.EmrTerminateCluster({
      clusterId: 'myCluster',
    }),
  });

const workflow = steps.Chain.start(startClusterTask).next(stopClusterTask);

const stateMachine = new steps.StateMachine(stack, 'myStateMachine', {
    stateMachineType: steps.StateMachineType.STANDARD,
    definition: workflow,
  });

Error Log

Attempting to run the step function will yield:

{
  "error": "EMR.AmazonElasticMapReduceException",
  "cause": "Invalid InstanceProfile: ${profile}. (Service: AmazonElasticMapReduce; Status Code: 400; Error Code: ValidationException; Request ID: ${Request})"
}

Environment

CLI Version : aws-cli/1.16.227
Framework Version: 1.37.0
OS : MacOS Catalina 10.15.4
Language : Typescript

Other

The type of clusterRole in EmrCreateClusterProps should not be IRole, but a reference to the instance profile, as expected in the EMR specification.

This is :bug: Bug Report

kevinfguo commented 4 years ago

I was able to start a cluster, by setting the instance profile's name to the same name as the role:

const instanceProfile = new iam.CfnInstanceProfile(stack, `emr-ec2-instance-profile`, {
    roles: [myRole.roleName],
    instanceProfileName: myRole.roleName,
  });

Then redeployed my stack and executed my step function. The EMR cluster successfully was created.

Viewing the associated CloudFormation, the Physical ID used for the instance profile is set by using instanceProfileName, otherwise one is generated. It appears that by doing this, providing the IRole's roleName will coincide with the instanceProfileName.

This is neither what one would expect to happen, nor is this obvious from any documentation that it is essentially required to name the instance profile the same as a role. I am also not clear now if we can provide instance profiles with more than one role listed?

This still seems like a bug in both the CDK API and its implementation.

shivlaks commented 4 years ago

@kevinfguo thanks for the detailed bug report and reproduction steps! I'll work through it and take a closer look. stay tuned!

shivlaks commented 4 years ago

@kevinfguo We would not want to expose CfnInstanceProfile as a property type, the type should be an IRole as that's consistent with how we accept instance profiles in other areas (i.e. .an auto-scaling group). For now, your current workaround is reasonable.

I believe you're right that there's a bug in the CDK in how we handle the role that was passed in. If a role is not passed in we create the instance profile. We don't seem to be creating an instance profile if a role was passed in.

heads up that in a step towards getting this integration towards stability, we will be making the integrations constructs rather than embedding a task with integration properties inside of a Task state. A draft PR of upcoming changes to the Emr service integration to a more consistent style is #8177

For now, the EMR and SageMaker service integrations are experimental as they are not backed by a higher level construct as they don't exist for aws-emr and aws-sagemaker in the CDK just yet.

I should be able to fix this forward after we've completed the change to make the integration tasks constructs rather than classes.

Alternatively, if the CDK is getting in your way - there's also an escape hatch we recently introduced so you can directly use Amazon States Language to define a task. I have noticed a couple of other limitations in other service integrations where this might be helpful while we expand on our coverage and align our APIs. You can learn more about how the escape hatch can be used if you run into a limitation without a workaround.

I'm dropping this issue down to a p2 as creating an instance profile will allow you to work around the current limitation. Let me know if you have any other questions or concerns here and I'm happy to dig a little more.

github-actions[bot] commented 3 years ago

This issue has not received any attention in 1 year. If you want to keep this issue open, please leave a comment below and auto-close will be canceled.

kevinfguo commented 3 years ago

Any updates on this?

shivlaks commented 3 years ago

@BenChaimberg - now that the @experimental annotations have been dropped from all constructs, what's the path forward for adding support here in a stable manner?

The Emr constructs are still not backed by L2s, which is one of the reasons it was marked as experimental. Would we need to add new properties so that we remain backwards compatible?

patruff commented 3 years ago

Why not revert back or allow instance profile to be set in the construct?

BenChaimberg commented 3 years ago

Yes, I think unfortunately we'd need to add a new property here. We can go back to jobFlowRole, which is the name from the API.

If no role is provided for clusterRole or jobFlowRole, then we create a role and create an instance profile with the same name as the role.
If a role is just provided for clusterRole, we do not create a role or an instance profile.
If a role is just provided for jobFlowRole, we create an instance profile but do not create a role.
If roles are provided for clusterRole and jobFlowRole, then we throw an error.

Appropriate documentation should be added to clusterRole to explain that a separate instance profile is required to be created, but regardless the property is deprecated and jobFlowRole should be used instead.

BenChaimberg commented 3 years ago

@patruff We cannot simply create an instance profile for a role passed in through clusterRole, for the following reasons. Any customer that is currently using EmrCreateCluster and passing in a role needs to create an instance profile with the same name in order for their state machine to succeed. If we create an instance profile for them, it will fail deployment since an instance profile with the same name already exists. This would be a breaking change, obviously, which we cannot allow since the module is marked as "stable". If this were a regression of previously working behaviour, we could revert, but unfortunately this was broken when the module was marked as stable, meaning it is a (undocumented) requirement of our stable API that the user to create their own instance profile.

BenChaimberg commented 3 years ago

As Shiv mentioned above, this issue is marked as p2, meaning that we are unable to work on this immediately. I am unassigning myself, but please feel free to work on this issue yourself and request my review if you submit a PR for the changes I outlined above! See CONTRIBUTING.md for guidelines.

We use +1s to help prioritize our work, and are happy to reevaluate the priority of this issue based on community feedback. You can reach out to the cdk.dev community on Slack to solicit support for reprioritization.

github-actions[bot] commented 2 years ago

This issue has not received any attention in 1 year. If you want to keep this issue open, please leave a comment below and auto-close will be canceled.

kevinfguo commented 2 years ago

Two years here after this issue has been opened, This happened back at framework v1 and I see the project is now at v2, but based on the current docs it looks like nothing has changed about the API; not sure if this is still a problem.

If it is, and there is such an issue with making this change, as two maintainer have mentioned, I would recommend at least updating the docs to make note of these necessary constraints before closing this out.

chrisammon3000 commented 1 year ago

I'm trying to deploy a cluster with managed autoscaling and am running into an error during deployment that I think is related to this:

9:47:56 PM | CREATE_FAILED        | AWS::EMR::Cluster         | EmrCluster01806AD3
Invalid InstanceProfile: EmrClusterEMRScalingInstanceProfileC3BCA307 (Service: AmazonElasticMapReduce; Status Code: 400; Error Code: ValidationException; ...

Here is the full code:

export class EmrCluster extends Construct {

    public readonly cluster: emr.CfnCluster;

    constructor(scope: Construct, id: string, props: EmrClusterProps) {
        super(scope, id);

        const emrScalingInstanceRole = new iam.Role(this, 'EmrScalingInstanceRole', {
            assumedBy: new iam.ServicePrincipal('ec2.amazonaws.com'),
            managedPolicies: [
                iam.ManagedPolicy.fromAwsManagedPolicyName('service-role/AmazonElasticMapReduceforEC2Role'),
            ]
        });

        // Blocked: Invalid InstanceProfile when deploying with CDK
        const emrScalingInstanceProfile = new iam.CfnInstanceProfile(this, 'EMRScalingInstanceProfile', {
            roles: [emrScalingInstanceRole.roleName],
            instanceProfileName: emrScalingInstanceRole.roleName
        });
        emrScalingInstanceProfile.node.addDependency(emrScalingInstanceRole);

        const emrScalingClusterServiceRole = new iam.Role(this, 'EmrScalingClusterServiceRole', {
            assumedBy: new iam.ServicePrincipal('elasticmapreduce.amazonaws.com'),
            managedPolicies: [
                iam.ManagedPolicy.fromAwsManagedPolicyName('service-role/AmazonElasticMapReduceRole'),
            ]
        });

        const masterInstanceGroupConfig: emr.CfnCluster.InstanceGroupConfigProperty = {
            name: 'Master',
            instanceCount: 1,
            instanceType: 'm5.xlarge',
            market: 'ON_DEMAND'
        }
        const coreInstanceGroupConfig: emr.CfnCluster.InstanceGroupConfigProperty = {
            name: 'Core',
            instanceCount: 1,
            instanceType: 'r5.xlarge',
            market: 'ON_DEMAND'
        }
        const taskInstanceGroupConfig: emr.CfnCluster.InstanceGroupConfigProperty = {
            name: 'Task',
            instanceCount: 1,
            instanceType: 'r5.xlarge',
            market: 'ON_DEMAND',
        }

        this.cluster = new emr.CfnCluster(this, 'EmrCluster', {
            name: props.clusterName,
            releaseLabel: 'emr-6.2.0',
            applications: [
                {
                    name: 'Hadoop',
                },
                {
                    name: 'Hive',
                },
                {
                    name: 'Tez',
                },
                {
                    name: 'Hue',
                },
                {
                    name: 'Spark',
                },
                {
                    name: 'Livy',
                },
            ],
            instances: {
                masterInstanceGroup: masterInstanceGroupConfig,
                coreInstanceGroup: coreInstanceGroupConfig,
                taskInstanceGroups: [taskInstanceGroupConfig],
                ec2SubnetId: props.subnetId,
                keepJobFlowAliveWhenNoSteps: true,
                terminationProtected: false,
            },
            serviceRole: emrScalingClusterServiceRole.roleName,
            jobFlowRole: emrScalingInstanceProfile.logicalId,
            visibleToAllUsers: true,
            managedScalingPolicy: {
                computeLimits: {
                    unitType: 'Instances',
                    minimumCapacityUnits: 1,
                    maximumCapacityUnits: 5,
                    maximumOnDemandCapacityUnits: 2
                },
            },
        });
    }
}

I tried @kevinfguo's workaround but no success. I also tried declaring the role as a dependency to the profile:

emrScalingInstanceProfile.node.addDependency(emrScalingInstanceRole);

But I can't get it to work, going to have to use CloudFormation for this.

aws / aws-cdk