awslabs / generative-ai-cdk-constructs

AWS Generative AI CDK Constructs are sample implementations of AWS CDK for common generative AI patterns.
https://awslabs.github.io/generative-ai-cdk-constructs/
Apache License 2.0
327 stars 46 forks source link

(CustomSageMakerEndpoint): (node.add_dependency not working) #432

Closed airmonitor closed 1 week ago

airmonitor commented 3 months ago

Describe the bug

The generated CloudFormation stack for the below 2 resources fails to deploy

        iam_role = iam.Role(
            self,
            id="sagemaker_execution_role",
            assumed_by=iam.ServicePrincipal("sagemaker.amazonaws.com"),
            managed_policies=[iam.ManagedPolicy.from_aws_managed_policy_name("AmazonSageMakerFullAccess")],
            role_name=f"{config_vars.stage}-{config_vars.project}-{model_name}-execution-role",
        )

        custom_sagemaker_endpoint = CustomSageMakerEndpoint(
            self,
            id=model_name,
            container=ContainerImage.from_ecr_repository(repository=self.whisper_ecr_repo, tag=config_vars.image_tag),
            enable_operational_metric=True,
            endpoint_name=f"{config_vars.stage}-{config_vars.project}-{model_name}",
            instance_count=1,
            instance_type=SageMakerInstanceType.ML_G5_2_XLARGE  # type: ignore
            model_data_url=f"s3://{self.model_data_s3_bucket.bucket_name}/model.tar.gz",
            model_id=f"{config_vars.stage}-{config_vars.project}-{model_name}",
            role=iam_role,
            environment={"SAGEMAKER_TRITON_DEFAULT_MODEL_NAME": f"whisper-{model_name}"},
        )

The stack is rolled back because the iam_role is not ready for custom_sagemaker_endpoint to be used.

Adding explicit dependency via the below code is also not helping

custom_sagemaker_endpoint.node.add_dependency(iam_role)

as the generated cloudformation output is missing DependsOn section in the resource manifest for "AWS::SageMaker::EndpointConfig"

Expected Behavior

The stack deployment will not fail due to missing/not ready IAM Role.

Current Behavior

Could not access model data at s3://dev-backend-whisper-model-data/model.tar.gz. Please ensure that the role "arn:aws:iam::112233:role/dev-backend-whisper-large-v2-vanilla-exe
cution-role" exists and that its trust relationship policy allows the action "sts:AssumeRole" for the service principal "sagemaker.amazonaws.com". Also ensure that the role has "s3:
GetObject" permissions and that the object is located in us-east-1. If your Model uses multiple models or uncompressed models, please ensure that the role has "s3:ListBucket" permis
sion. (Service: AmazonSageMaker; Status Code: 400; Error Code: ValidationException; Request ID: 7570c458-14c4-4c9a-bc97-a5287e74faaa; Proxy: null)

Reproduction Steps

Using the following code

        iam_role = iam.Role(
            self,
            id="sagemaker_execution_role",
            assumed_by=iam.ServicePrincipal("sagemaker.amazonaws.com"),
            managed_policies=[iam.ManagedPolicy.from_aws_managed_policy_name("AmazonSageMakerFullAccess")],
            role_name=f"{config_vars.stage}-{config_vars.project}-{model_name}-execution-role",
        )

        custom_sagemaker_endpoint = CustomSageMakerEndpoint(
            self,
            id=model_name,
            container=ContainerImage.from_ecr_repository(repository=self.whisper_ecr_repo, tag=config_vars.image_tag),
            enable_operational_metric=True,
            endpoint_name=f"{config_vars.stage}-{config_vars.project}-{model_name}",
            instance_count=1,
            instance_type=SageMakerInstanceType.ML_G5_2_XLARGE  # type: ignore
            model_data_url=f"s3://{self.model_data_s3_bucket.bucket_name}/model.tar.gz",
            model_id=f"{config_vars.stage}-{config_vars.project}-{model_name}",
            role=iam_role,
            environment={"SAGEMAKER_TRITON_DEFAULT_MODEL_NAME": f"whisper-{model_name}"},
        )

generate output and deploy via cdk deploy

Possible Solution

the method add_dependency will update the DependsOn section in the generated template correctly

Additional Information/Context

No response

CDK CLI Version

2.140.0 (build 46168aa)

Framework Version

No response

Node.js Version

v18.20.2

OS

macOS Sonoma 14.4.1

Language

Python

Language Version

No response

Region experiencing the issue

us-east-1

Code modification

N/A

Other information

No response

Service quota

krokoko commented 3 months ago

Hi @airmonitor , thank you for reporting this. Did you have a look at the associated sample we have for this construct ? https://github.com/aws-samples/generative-ai-cdk-constructs-samples/blob/main/samples/sagemaker_custom_endpoint/lib/sagemaker_custom_endpoint-stack.ts

airmonitor commented 3 months ago

Hello @krokoko

Thx for reaching out.

I will try with your suggestion.

Do you have also a working example for adding application autoscaling?

I need to have scalability added there so the number of endpoint instances will scale base on metric.

I've tried with application autoscaling but it was failing due to not working add_dependency method.

In short the endpoint creation was still in progress when application autoscaling tried to add required functionality.

AWS support added DependsOn section to the synthesised template but I was not able to implement the same using CDK due to missing add dependency method.

Regards Tom

airmonitor commented 3 months ago

@krokoko

I've tested the proposed solution to use the IAM role that is created with CustomSageMakerEndpoint but apparently, I'm doing something wrong here

For the below scenario I'm getting errors from the cloudformation:

        custom_sagemaker_endpoint = CustomSageMakerEndpoint(
            self,
            id=model_name,
            container=ContainerImage.from_ecr_repository(repository=self.whisper_ecr_repo, tag=config_vars.image_tag),
            enable_operational_metric=True,
            endpoint_name=f"{config_vars.stage}-{config_vars.project}-{model_name}",
            instance_count=1,
            instance_type=SageMakerInstanceType.ML_G5_2_XLARGE  # type: ignore
            if config_vars.stage == "prod"
            else SageMakerInstanceType.ML_G5_XLARGE,  # type: ignore
            model_data_url=f"s3://{self.model_data_s3_bucket.bucket_name}/model.tar.gz",
            model_id=f"{config_vars.stage}-{config_vars.project}-{model_name}",
            environment={"SAGEMAKER_TRITON_DEFAULT_MODEL_NAME": f"whisper-{model_name}"},
        )

        self.whisper_ecr_repo.grant_pull(custom_sagemaker_endpoint.role) # ECR Repository
        self.model_data_s3_bucket.grant_read_write(custom_sagemaker_endpoint.role) # S3 bucket

Error from CloudFormation

9:56:57 AM | CREATE_FAILED        | AWS::SageMaker::Model                       | devbackendwhisperl...odellargev2vanilla
Could not access model data at s3://s3-bucket/model.tar.gz. Please ensure that the role "arn:aws:iam::11223344:role/dev-backend-whisper-core--largev2vanilla
RoleC7CE1EE-xaSDM2XaG6fE" exists and that its trust relationship policy allows the action "sts:AssumeRole" for the service principal "sagemaker.amazonaws.com". Also ensure that the
role has "s3:GetObject" permissions and that the object is located in us-east-1. If your Model uses multiple models or uncompressed models, please ensure that the role has "s3:ListB
ucket" permission. (Service: AmazonSageMaker; Status Code: 400; Error Code: ValidationException; Request ID: 591d86b1-0e24-4d83-a7ec-f3f12b30402d; Proxy: null)

 ❌  backend-whisper-pipeline/dev-backend-whisper-core-stage/core-stack (dev-backend-whisper-core-stage-core-stack) failed: Error: The stack named dev-backend-whisper-core-stage-core-stack failed creation, it may need to be manually deleted from the AWS console: ROLLBACK_COMPLETE: Could not access model data at s3://s3-bucket/model.tar.gz. Please ensure that the role "arn:aws:iam::11223344:role/dev-backend-whisper-core--largev2vanillaRoleC7CE1EE-xaSDM2XaG6fE" exists and that its trust relationship policy allows the action "sts:AssumeRole" for the service principal "sagemaker.amazonaws.com". Also ensure that the role has "s3:GetObject" permissions and that the object is located in us-east-1. If your Model uses multiple models or uncompressed models, please ensure that the role has "s3:ListBucket" permission. (Service: AmazonSageMaker; Status Code: 400; Error Code: ValidationException; Request ID: 591d86b1-0e24-4d83-a7ec-f3f12b30402d; Proxy: null)
    at FullCloudFormationDeployment.monitorDeployment (/opt/homebrew/lib/node_modules/aws-cdk/lib/index.js:435:10568)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async Object.deployStack2 [as deployStack] (/opt/homebrew/lib/node_modules/aws-cdk/lib/index.js:438:199515)
    at async /opt/homebrew/lib/node_modules/aws-cdk/lib/index.js:438:181237

 ❌ Deployment failed: Error: The stack named dev-backend-whisper-core-stage-core-stack failed creation, it may need to be manually deleted from the AWS console: ROLLBACK_COMPLETE: Could not access model data at s3://s3-bucket/model.tar.gz. Please ensure that the role "arn:aws:iam::11223344:role/dev-backend-whisper-core--largev2vanillaRoleC7CE1EE-xaSDM2XaG6fE" exists and that its trust relationship policy allows the action "sts:AssumeRole" for the service principal "sagemaker.amazonaws.com". Also ensure that the role has "s3:GetObject" permissions and that the object is located in us-east-1. If your Model uses multiple models or uncompressed models, please ensure that the role has "s3:ListBucket" permission. (Service: AmazonSageMaker; Status Code: 400; Error Code: ValidationException; Request ID: 591d86b1-0e24-4d83-a7ec-f3f12b30402d; Proxy: null)
    at FullCloudFormationDeployment.monitorDeployment (/opt/homebrew/lib/node_modules/aws-cdk/lib/index.js:435:10568)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async Object.deployStack2 [as deployStack] (/opt/homebrew/lib/node_modules/aws-cdk/lib/index.js:438:199515)
    at async /opt/homebrew/lib/node_modules/aws-cdk/lib/index.js:438:181237

The stack named dev-backend-whisper-core-stage-core-stack failed creation, it may need to be manually deleted from the AWS console: ROLLBACK_COMPLETE: Could not access model data at s3://s3-bucket/model.tar.gz. Please ensure that the role "arn:aws:iam::11223344:role/dev-backend-whisper-core--largev2vanillaRoleC7CE1EE-xaSDM2XaG6fE" exists and that its trust relationship policy allows the action "sts:AssumeRole" for the service principal "sagemaker.amazonaws.com". Also ensure that the role has "s3:GetObject" permissions and that the object is located in us-east-1. If your Model uses multiple models or uncompressed models, please ensure that the role has "s3:ListBucket" permission. (Service: AmazonSageMaker; Status Code: 400; Error Code: ValidationException; Request ID: 591d86b1-0e24-4d83-a7ec-f3f12b30402d; Proxy: null)
krokoko commented 3 months ago

Hi @airmonitor , did you provide your role with permissions to access the S3 bucket storing your artifacts ? We have this demonstrated in the sample here: https://github.com/aws-samples/generative-ai-cdk-constructs-samples/blob/main/samples/sagemaker_custom_endpoint/lib/sagemaker_custom_endpoint-stack.ts#L43C5-L57C7

airmonitor commented 3 months ago

Hey @krokoko.

I'm not sure if I get your intention here.

The example for extending IAM permissions is using non optimal way for granting necessary permissions

    customEndpoint.addToRolePolicy(
      new iam.PolicyStatement({
        effect: iam.Effect.ALLOW,
        actions: [
          's3:GetObject',
          's3:GetObject*',
          's3:GetBucket*',
          's3:List*',
        ],
        resources: [
          'BUCKET_ARN',
          'BUCKET_ARN/*',
        ],
      }),
    );

In CDK we have a better way, shorter way to achieve the required results using the following examples

        self.whisper_ecr_repo.grant_pull(custom_sagemaker_endpoint.role) # ECR Repository
        self.model_data_s3_bucket.grant_read_write(custom_sagemaker_endpoint.role) # S3 bucket

grant_pull, grant_read_write method for ecr repository and s3 bucket should do the trick. I'm not sure why we need to get back mentally a few years later when those methods weren't available in CDK?

Regards Tom

krokoko commented 3 months ago

Hi @airmonitor , sorry I just missed the last line of your previous example, I was thinking that permissions were missing on the role to access the model artifacts in S3. Is the endpoint created in the same region where the bucket was created ?

airmonitor commented 3 months ago

Hello,

Yes, all resources are deployed in us-east-1

github-actions[bot] commented 1 month ago

This issue is now marked as stale because it hasn't seen activity for a while. Add a comment or it will be closed soon. If you wish to exclude this issue from being marked as stale, add the "backlog" label.

krokoko commented 1 week ago

I am not able to reproduce the issue. Tested both with the provided role by the construct, and by creating a role outside of the construct and passing it through constructor:

const roleSG = new iam.Role(this, 'Role', {
      assumedBy: new iam.ServicePrincipal('sagemaker.amazonaws.com'),
      managedPolicies: [
        iam.ManagedPolicy.fromAwsManagedPolicyName('AmazonSageMakerFullAccess'),
      ],
    });

const endpoint = new genai.CustomSageMakerEndpoint(this, 'testllavaendpoint', {
...
role: roleSG
...
}

In both cases, the construct is successfully deployed.

One thing is for the samples we have and the tests I did, I am using the default bucket created by the SageMaker SDK (see https://sagemaker.readthedocs.io/en/stable/api/utility/session.html, default_bucket) when preparing the model. It might be worth on your end to create the same default session bucket and compare it with the bucket you provide to the construct, and see if any configuration difference stands out.

I will close the issue, if you get more details and/or can upload a sample stack to a public repo which we can use to consistently reproduce the issue, please reopen this ticket. Thank you !