aws-samples / amazon-sagemaker-studio-secure-data-science-workshop

Secure data science with Amazon SageMaker Studio workshop. This workshop creates a reference architecture with security and controls to perform machine learning tasks securely with Amazon SageMaker Studio.
MIT No Attribution
31 stars 17 forks source link

Recurring "Resource of type 'AWS::SageMaker::Domain' with identifier 'ds-studio-domain' did not stabilize." #10

Closed Analect closed 1 year ago

Analect commented 1 year ago

Describe the bug In Lab 2, when trying to provision a Sagemaker Studio Product ... it keeps failing with Resource handler returned message: "Resource of type 'AWS::SageMaker::Domain' with identifier 'ds-studio-domain' did not stabilize.". It's unclear to me how to debug this.

Here is the full error message:

Failed to launch provisioned product
Errors from CloudFormation: [{LogicalResourceId : SC-xxx-pp-gbmpqjipskpmg, ResourceType : AWS::CloudFormation::Stack, StatusReason : The following resource(s) failed to create: [SageMakerStudioDomain]. Rollback requested by user.}, {LogicalResourceId : SageMakerStudioDomain, ResourceType : AWS::SageMaker::Domain, StatusReason : Resource handler returned message: "Resource of type 'AWS::SageMaker::Domain' with identifier 'ds-studio-domain' did not stabilize." (RequestToken: 9f7f7782-7dec-8682-afa3-6d996b05982e, HandlerErrorCode: NotStabilized)}, {LogicalResourceId : SageMakerStudioDomain, ResourceType : AWS::SageMaker::Domain, StatusReason : Resource creation Initiated}, {LogicalResourceId : KeyAlias, ResourceType : AWS::KMS::Alias, StatusReason : Resource creation Initiated}, {LogicalResourceId : DataScientistDefaultRoleArn, ResourceType : AWS::SSM::Parameter, StatusReason : Resource creation Initiated}, {LogicalResourceId : SageMakerCustomImageAppConfig, ResourceType : AWS::SageMaker::AppImageConfig, StatusReason : Resource creation Initiated}, {LogicalResourceId : SageMakerCustomImageVersion, ResourceType : AWS::SageMaker::ImageVersion, StatusReason : Resource creation Initiated}, {LogicalResourceId : SageMakerCustomImage, ResourceType : AWS::SageMaker::Image, StatusReason : Resource creation Initiated}, {LogicalResourceId : SagemakerStudioKMS, ResourceType : AWS::KMS::Key, StatusReason : Resource creation Initiated}, {LogicalResourceId : DataScientistDefaultRole, ResourceType : AWS::IAM::Role, StatusReason : Resource creation Initiated}, {LogicalResourceId : SC-xxx-pp-gbmpqjipskpmg, ResourceType : AWS::CloudFormation::Stack, StatusReason : User Initiated}]

To Reproduce I have followed steps in Lab1 and Lab2 here.

These were the parameters in the launch template

image

Expected behavior For the ds-studio-domain to have been provisioned without error.

Screenshots This is the point in the CloudFormation events when it fails and reverses course, deleting resources previously created. I retried 3x times, but to no avail.

image

Any suggestions on how I might troubleshoot this more effectively. Thanks

Analect commented 1 year ago

The error surfaced from the Service Catalog suggests that the stack name is somehow not appropriately meeting a regex contstraint.

image

It's not clear to me where this constraint is getting applied. It doesn't appear to be in this repo's code.

image

sachin-pharande commented 1 year ago

In Lab 2, when trying to provision a Sagemaker Studio Product ... it keeps failing with Resource handler returned message: "Resource of type 'AWS::SageMaker::Domain' with identifier 'ds-studio-domain' did not stabilize."

The error message is as below:

Failed to launch provisioned product Errors from CloudFormation: [{LogicalResourceId : SC-xxx-pp-gbmpqjipskpmg, ResourceType : AWS::CloudFormation::Stack, StatusReason : The following resource(s) failed to create: [SageMakerStudioDomain]. Rollback requested by user.}, {LogicalResourceId : SageMakerStudioDomain, ResourceType : AWS::S Lab-2-SageMaker_Studio_Product Launch error Lab-2-SageMaker_Studio_Product Launch error ageMaker::Domain, StatusReason : Resource handler returned message: "Resource of type 'AWS::SageMaker::Domain' with identifier 'ds-studio-domain' did not stabilize." (RequestToken: 9f7f7782-7dec-8682-afa3-6d996b05982e, HandlerErrorCode: NotStabilized)}, {LogicalResourceId : SageMakerStudioDomain, ResourceType : AWS::SageMaker::Domain, StatusReason : Resource creation Initiated}, {LogicalResourceId : KeyAlias, ResourceType : AWS::KMS::Alias, StatusReason : Resource creation Initiated}, {LogicalResourceId : DataScientistDefaultRoleArn, ResourceType : AWS::SSM::Parameter, StatusReason : Resource creation Initiated}, {LogicalResourceId : SageMakerCustomImageAppConfig, ResourceType : AWS::SageMaker::AppImageConfig, StatusReason : Resource creation Initiated}, {LogicalResourceId : SageMakerCustomImageVersion, ResourceType : AWS::SageMaker::ImageVersion, StatusReason : Resource creation Initiated}, {LogicalResourceId : SageMakerCustomImage, ResourceType : AWS::SageMaker::Image, StatusReason : Resource creation Initiated}, {LogicalResourceId : SagemakerStudioKMS, ResourceType : AWS::KMS::Key, StatusReason : Resource creation Initiated}, {LogicalResourceId : DataScientistDefaultRole, ResourceType : AWS::IAM::Role, StatusReason : Resource creation Initiated}, {LogicalResourceId : SC-xxx-pp-gbmpqjipskpmg, ResourceType : AWS::CloudFormation::Stack, StatusReason : User Initiated}]

A request to provide the guidelines on above aspect...

cremich commented 1 year ago

I faced the same issue. @Analect did you find a workaround?

cremich commented 1 year ago

@Analect AWS support helped me to fix it. In my case the error was related to a missing permission:

User: arn:aws:sts::xxxxx:assumed-role/DSSharedServices-ServiceCatalogLaunchRole-xxxxx/servicecatalog is not authorized to perform: elasticfilesystem:TagResource on the specified resource"

I updated the permissions in the role. After this, the stack was created.

sandwi commented 1 year ago

Thanks @cremich for helping the community by sharing the solution to the issue. I have added this role to SCLaunchRole, commit ad2d034.

@Analect @sachin-pharande can you do a pull of latest and validate that if this also resolves your issue? I will leave this issue open until you or @cremich confirm. Thanks for the help.

sandwi commented 1 year ago

commit ad2d034 fixed the issue.