flux-iac / tofu-controller

A GitOps OpenTofu and Terraform controller for Flux
https://flux-iac.github.io/tofu-controller/
Apache License 2.0
1.2k stars 131 forks source link

Exponentially backoff on reconciliation failure #1335

Open akselleirv opened 1 month ago

akselleirv commented 1 month ago

Currently it is only possible to define a retry with spec.retryInterval and set the value to a static of amount of time.

However, in case of multiple Terraform resources which fails at the same time this can create a lot of noise and will constantly retry based on the provided retry interval. A more graceful approach could be to add exponentially backoff of the retry interval.

A new field named spec.retryStrategy can be introduced and the default value would be StaticInterval to keep it backward compatible, or the user can choose ExponentialBackoff. The first retry would be after 15 seconds and the next one at 30 seconds etc. and then set a maximum requeue time.

don4of4 commented 1 month ago

This issue can cause a lot of pain in that since TF writes a state file on every attempt, if you have that state file in a soft-delete blob storage (which is probably a good idea), it creates EXCESSIVE blobs to buildup which slows down the whole performance of the state file storage. We had an incident on this.