aws / aws-step-functions-data-science-sdk-python

Step Functions Data Science SDK for building machine learning (ML) workflows and pipelines on AWS
Apache License 2.0
285 stars 87 forks source link

chore: add retry to SageMaker steps in integration tests #162

Closed shivlaks closed 3 years ago

shivlaks commented 3 years ago

Summary

Currently, we observe a few different failures that occur during integration tests, which get executed as a part of the PR build as well as pushes to branches.

createModel and createEndpoint particularly see failures most frequently and they are primarily:

This change defines a default retry strategy that makes 5 attempts, over an interval of 5 seconds, which backs off with a multiplier of 2. The methodology behind this strategy is naive and may need some calibration. It should reduce the frequency of failures in the short term.

We can adjust the retry strategy as we go and expand to something more API specific as the need arises.

Testing

rendered retry from the StateMachine definition on sagemaker steps:

"Retry": [
        {
          "ErrorEquals": [
            "SageMaker.AmazonSageMakerException"
          ],
          "IntervalSeconds": 5,
          "MaxAttempts": 5,
          "BackoffRate": 2
        }
      ]
StepFunctions-Bot commented 3 years ago

AWS CodeBuild CI Report

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository