Open akselleirv opened 1 month ago
This issue can cause a lot of pain in that since TF writes a state file on every attempt, if you have that state file in a soft-delete blob storage (which is probably a good idea), it creates EXCESSIVE blobs to buildup which slows down the whole performance of the state file storage. We had an incident on this.
Currently it is only possible to define a retry with
spec.retryInterval
and set the value to a static of amount of time.However, in case of multiple Terraform resources which fails at the same time this can create a lot of noise and will constantly retry based on the provided retry interval. A more graceful approach could be to add exponentially backoff of the retry interval.
A new field named
spec.retryStrategy
can be introduced and the default value would beStaticInterval
to keep it backward compatible, or the user can chooseExponentialBackoff
. The first retry would be after 15 seconds and the next one at 30 seconds etc. and then set a maximum requeue time.