Currently, we observe a few different failures that occur during integration tests, which get
executed as a part of the PR build as well as pushes to branches.
createModel and createEndpoint particularly see failures most frequently and they are
primarily:
Rate exceeded - ThrottlingException.
This change defines a default retry strategy that makes 5 attempts, over an interval of 5
seconds, which backs off with a multiplier of 2. The methodology behind this strategy is
naive and may need some calibration. It should reduce the frequency of failures in the
short term.
We can adjust the retry strategy as we go and expand to something more API specific as
the need arises.
Testing
ran integ tests a few times locally - ensured they had the retry in the ASL definition and
executed through successfully.
rendered retry from the StateMachine definition on sagemaker steps:
Summary
Currently, we observe a few different failures that occur during integration tests, which get executed as a part of the PR build as well as pushes to branches.
createModel
andcreateEndpoint
particularly see failures most frequently and they are primarily:ThrottlingException
.This change defines a default retry strategy that makes 5 attempts, over an interval of 5 seconds, which backs off with a multiplier of 2. The methodology behind this strategy is naive and may need some calibration. It should reduce the frequency of failures in the short term.
We can adjust the retry strategy as we go and expand to something more API specific as the need arises.
Testing
rendered retry from the StateMachine definition on sagemaker steps:
also pushed some dummy / trivial commits to this PR to trigger simultaneous builds. haven't seen any state machine failures yet 🤞
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.