Step Functions - support calling nested dynamic State Machine

sabarnac commented 4 years ago

Description

Note: Bug occurs with the experimental library @aws-cdk/aws-stepfunctions

When creating a state machine with a nested workflow task, whose StateMachineArn is a dynamic value set using a JSONPath query, the stepfunctions CDK library recognizes the nested workflow task ARN as dynamic and suffixes StateMachineArn with .$ correctly.

However, the library doesn't realize that the ARN provided is a dynamic value when generating default policies for the state machine. As a result, it generates a policy statement for the state machine with the action "states:StartExecution" so that the state machine can execute nested workflows (which is a good thing), but sets the resource field to the dynamic value (which is a bad thing).

Reproduction Steps

Reproduction Repo: https://github.com/sabarnac/cdk-stepfunctions-bug-repo

Steps:

Run npm run cdk synth/cdk synth
Open ./cdk.out/CdkStepfunctionsBugRepoStack.template.json
Search for SampleStateMachineRoleDefaultPolicy

Check the first policy statement

{
"Action": "states:StartExecution",
"Effect": "Allow",
"Resource": "$.dynamicArn"
}

Stack Code:

import * as sfn from "@aws-cdk/aws-stepfunctions";
import * as tasks from "@aws-cdk/aws-stepfunctions-tasks";
import * as cdk from "@aws-cdk/core";

export class CdkStepfunctionsBugRepoStack extends cdk.Stack {
  constructor(scope: cdk.App, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    const stepFunctionsTask = new sfn.Task(this, "NestedWorkflowTask", {
      task: new tasks.StartExecution(
        sfn.StateMachine.fromStateMachineArn(this, "NestedWorkflowStateMachine", sfn.Data.stringAt("$.dynamicArn")),
        {
          input: {
            "input.$": "$.dynamicInput",
            "AWS_STEP_FUNCTIONS_STARTED_BY_EXECUTION_ID.$": "$$.Execution.Id",
          },
          name: sfn.Data.stringAt("$.dynamicName"),
          integrationPattern: sfn.ServiceIntegrationPattern.SYNC,
        },
      ),
      resultPath: "$.workflowResult",
    })

    new sfn.StateMachine(this, "SampleStateMachine", {
      definition: stepFunctionsTask,
    });
  }
}

Error Log

None.

Environment

CLI Version : 1.22.0 (build 309ac1b)
Framework Version:
- @aws-cdk/aws-stepfunctions-tasks: 1.22.0
- @aws-cdk/aws-stepfunctions: 1.22.0
- @aws-cdk/core: 1.22.0
OS : Mac OS Mojave [10.14.6 (18G2022)]
Language : TypeScript

Other

No other details

This is :bug: Bug Report

sabarnac commented 4 years ago

I would consider this an edge-case to some extent, since I wouldn't think a lot of people would have such a use-case.

There is a hacky workaround for those who want it:

const UNSAFE__removeBadDefaultPolicyDocumentStatement = (stateMachine: StateMachine) => {
  const stateMachineDefaultPolicyDocument = (stateMachine.role as any).defaultPolicy.document;
  stateMachineDefaultPolicyDocument.statements = stateMachineDefaultPolicyDocument.statements.filter(
    (statement: any) => statement.action.indexOf("states:StartExecution") === -1,
  );
};

You can pass in your own role when creating the state machine with a proper policy document statement, and use this function to remove the bad default one.

I've committed the bug workaround code to the reproduction repo under branch bug-workaround

nija-at commented 4 years ago

@sabarnac - Thanks for filing the issue.

What would you expect the Resource property of the IAM policy to contain in such a case?

nija-at commented 4 years ago

Looking at your sample code once again, I think the issue is that the API, specifically StateMachine.fromStateMachineArn(), was never designed for dynamic state machine ARNs. All of CDK's import APIs (fromXXXArn()) are designed for ARNs being static at the time of CloudFormation deployment.

For your use case, I would suggest to use an InvokeFunction step function task to invoke a lambda function that calls StepFunction's StartExecution API using the AWS SDK. With this you will have control over what permissions need to be attached to the Lambda function. Alternatively, you can extend your own IStepFunctionsTask similar to the model of StartExecution and implement your logic there.

I would like to turn this into a feature request for the CDK to allow the StartExecution task to support dynamic ARNs.

sabarnac commented 4 years ago

The fact that when providing a dynamic value to StateMachine.fromStateMachineArn() resulted in the ExecutionName field being appended with .$ made me assume that dynamic ARN support was intended (to a certain extent). If this was unintended, then it makes a lot more sense why the policy statement bug exists.

The issue with having a Lambda function manage calling nested workflows is that the nested workflow needs to be executed synchronously, and can take an arbitrary amount of time to execute. Having a lambda start a new workflow, check its status, and then return a response to continue the workflow execution feels unnecessary to implement since Step Functions can do this for you instead, and can be provided the same permissions as I would have to provide to the Lambda anyways.

In my use-case, the nested workflows cannot notify the parent workflow about their progress and completion and cannot be modified in any way either. They are written to be a completely independent, and should have no idea about any orchestration that goes around it (if any workflows have started it, whether it should notify/start any workflows on completion, etc).

We've already figured out the permissions we would need to provide to let Step Functions call the nested workflows, so we can define the policies separately and attach it to the Step Functions state machine.

The suggestion for extending IStepFunctionsTask and adding functionality there is a good point, and will look into it. I'll add it to my reproduction repo if I get around to implementing it. Are there any guides or references I could use to help with this?

I think dynamic ARN support in CDK would be a pretty useful feature (especially since Step Functions supports it, or at least supports the use-case I have shown. Tested it through the AWS console), so I agree with turning this into a feature request.

As for what should be set for Resource, I think in the cases of dynamic ARNs, the states:StartExecution policy statement should not be created at all. Since the ARN is dynamic, the only possible value you can set for Resource is *, or at the very least some resource value/policy statement that restricts the ability to start executions to within the same AWS account + region as the workflow itself. I wouldn't really consider this the safest option.

The safest option is to expect the user to provide the correct policy statement instead. The user who is creating such a workflow would have the knowledge of the state machines that this workflow would end up executing, and so should provide the policy statement that provides the ability to call them.

However, if you need to have some sort of default policy statement kept in case the user doesn't provide their own policy statement, then having the resource to something like arn:aws:states:${AWS::Region}:${AWS::Account}:stateMachine:* so that permissions are limited to within the same account + region should be ok? However, it should be removed if the user does provide their own states:StartExecution policy statement. I would still prefer that this not be done to be honest.

nija-at commented 4 years ago

The suggestion for extending IStepFunctionsTask and adding functionality there is a good point, and will look into it. I'll add it to my reproduction repo if I get around to implementing it. Are there any guides or references I could use to help with this?

The aws-stepfunctions-tasks module should be a good one.

Thanks for your thoughts on the policy and your use case. I'm switching this issue from being a bug to a feature request!

milesgranger commented 2 years ago

I also experience this, but the workaround doesn't appear to work for Python, namely nothing involving default policy attribute on the state machine's role.

Furthermore, when trying to circumvent this by providing the role directly, the documentation for the role parameter says default behavior is to create a role. However, even when providing the role directly, it still creates the default policy, with the wrong resource, and attaches it to the role provided:

MyStateMachineRoleDefaultPolicy1A388074:
  Type: AWS::IAM::Policy
  Properties:
    PolicyDocument:
      Statement:
        - Action: states:StartExecution
          Effect: Allow
          Resource: $.stateMachineArn  # <- Fail.
      Version: "2012-10-17"
    PolicyName: MyStateMachineRoleDefaultPolicy1A388074

nikita-sheremet-clearscale commented 2 years ago

Any updates? Any workarounds about creating state machine which calls another one? How generate "StateMachineArn.$" ? Like that:

{  
   "Type":"Task",
   "Resource":"arn:aws:states:::states:startExecution.sync",
   "Parameters":{  
      "Input":{
        "Comment": "Hello world!",
        "AWS_STEP_FUNCTIONS_STARTED_BY_EXECUTION_ID.$": "$$.Execution.Id"
       },
      "StateMachineArn.$":"$.arn",
   },
   "End":true
}

sabarnac commented 2 years ago

I used the following code to remove the default policy statements that the Step Function CDK generates for the state machine when providing a dynamic ARN.

private readonly policyActionsToFilter = [
    "states:StartExecution",
    "states:DescribeExecution",
    "states:StopExecution",
  ]

private UNSAFE__hackStateMachineDefaultPolicyDocument = (stateMachine: StateMachine) => {
    const stateMachineDefaultPolicyDocument = (stateMachine.role as any).defaultPolicy.document;
    stateMachineDefaultPolicyDocument.statements = stateMachineDefaultPolicyDocument.statements.filter(
      (statement: any) => this.policyActionsToFilter.every((action) => !statement.action.includes(action)),
    );
  };

After this, then you can add in your own policy statement for what other state machine ARNs the current one is allowed to call. Make sure to provide policy statements for all the actions that were removed.

milesgranger commented 2 years ago

@sabarnac I don't believe that works in Python world.

hutchy2570 commented 1 year ago

Changes are needed to the workaround from @sabarnac to work with CDK 2.32.0 (action is now _action, and is of type OrderedSet so we need to access the internal array with _action.direct() to do the .includes() check).

private readonly policyActionsToFilter = [
    "states:StartExecution",
    "states:DescribeExecution",
    "states:StopExecution",
  ]

private UNSAFE__hackStateMachineDefaultPolicyDocument = (stateMachine: StateMachine) => {
    const stateMachineDefaultPolicyDocument = (stateMachine.role as any).defaultPolicy.document;
    stateMachineDefaultPolicyDocument.statements = stateMachineDefaultPolicyDocument.statements.filter(
      (statement: any) => this.policyActionsToFilter.every((action) => !statement._action.direct().includes(action)),
    );
  };

aws / aws-cdk