aws-samples / amazon-sagemaker-safe-deployment-pipeline

Safe blue/green deployment of Amazon SageMaker endpoints using AWS CodePipeline, CodeBuild and CodeDeploy.
https://aws.amazon.com/blogs/machine-learning/safely-deploying-and-monitoring-amazon-sagemaker-endpoints-with-aws-codepipeline-and-aws-codedeploy/
MIT No Attribution
104 stars 238 forks source link

nyctaxi-deploy-prd fails #30

Closed ehsanmok closed 3 years ago

ehsanmok commented 3 years ago

The pipeline fails to create the prod stack in SagemakerMonitoringSchedule because of

Resource handler returned message: "Error occurred during operation 'CREATE'." (RequestToken: 40af8897-76d2-abb5-6efc-ef8c6948d42b, HandlerErrorCode: GeneralServiceException)

Note that everything else is successful and the works in us-east-1

brightsparc commented 3 years ago

Hi @ehsanmok, are you using the latest code in master. The deploy role requires permissions to create monitoring schedule. The specific errors are not visible from CFN.

ehsanmok commented 3 years ago

Yes, it's the latest CFT from the one-click launch button. The error is too generic and I can't find more details about it as well.

brightsparc commented 3 years ago

Hi @ehsanmok the CFN stack in s3 was out of date with the repository pipeline.yml. It has now been updated, but you can fix your stack by updating it with the pipeline.yml in the master branch.

This will update the DeployRole with the permissions sufficient to create the monitoring schedule.

ehsanmok commented 3 years ago

Just updated with the master but still failed with the same error.

brightsparc commented 3 years ago

Hi @ehsanmok please ensure you updated the main nyctaxi stack, this will update the DeployRole which is used by the nyctaxi-deploy-prd stack. I've re-tested this from scratch and validate the the pipeline works, so perhaps start again with a clean CFN setup to re-test if still having issues.

ehsanmok commented 3 years ago

Yes, updated the main CFT and released the changes.

First initial attempt to delete the main stack gave this error:

mlops-nyctaxi-deploy-role is invalid or cannot be assumed

though second attempt worked but had to delete all the artifacts, s3 bucket, endpoint, model etc. manually (can be automated with lambda and crhelper package). After recreating the entire stack again and running the mlops notebook, the pipeline fails to create nyctaxi-workflow with

Resource handler returned message: "State Machine is being deleted: 'arn:aws:states:us-east-1:ACCOUNT:stateMachine:nyctaxi' (Service: AWSStepFunctions; Status Code: 400; Error Code: StateMachineDeleting; Request ID: 218c294f-53a2-44ba-9256-4cb227b43fa9; Proxy: null)" (RequestToken: 66428fdb-9fb6-3309-5ed8-04e7d868dbd1, HandlerErrorCode: GeneralServiceException)

For the third time, deleted everything and recreated the stack. Now the prod is successful! Thanks for the very useful design :)