Venafi / terraform-provider-venafi

HashiCorp Terraform provider that uses Venafi to streamline machine identity (certificate and key) acquisition.
https://www.terraform.io/docs/providers/venafi/
Mozilla Public License 2.0
16 stars 20 forks source link

Option to gracefully handle certificate renewal failure #104

Open danjer2 opened 1 year ago

danjer2 commented 1 year ago

BUSINESS PROBLEM We are using Venafi provider to create and refresh certificates. We take advantage of expiration_window parameter to automatically renew the certificate when we are getting close to expiration time.

We recently had an unfortunate chain of events. During a release, the certificate was in the expiration window, and Terraform attempted to refresh it. At the perfect time (after terraform plan executed), the Venafi API became unavailable. The terraform apply failed when it tried to create the new certificate, leaving our application in a non-functional state.

We thought we could address this in the future with

lifecycle {
   create_before_destroy = true
}

but looking at the plans generated it would not help, because Terraform first unlinks the certificate. So even if the creation stops the run before destruction happens, the app is still left without a certificate. Moreover, since the apply does not complete, it's possible that other infrastructure changes did not get applied, so the application is left in an inconsistent state.

PROPOSED SOLUTION Add a configuration parameter to Venafi Terraform provider to ignore failure of preventive (i.e. prior to expiration) certificate refresh.

By adding the lifecycle block above we'd force the new certificate to be created first. If certificate creation fails and the config param is turned on, the provider could return the current (still valid) certificate and let Terraform complete all other changes. The application would be left in a functional, consistent state.

A warning that the cert refresh failed would be helpful.

CURRENT ALTERNATIVES The only alternative is to attempt to restore the application configuration manually. Restoring a certificate that is still valid is not too complicated, but if there are other changes that were not applied, the process of identifying and applying them is more complicated and error-prone.

VENAFI EXPERIENCE I have been using the Venafi Terraform provider for more than a year.

luispresuelVenafi commented 1 year ago

Hi @danjer2 thank you for reaching out.

We are sorry to hear that you experienced that. On which platform you were working with, when it happened that it became unavailable? was it TPP or VaaS?

danjer2 commented 1 year ago

I'm not familiar with the service setup, but based on the fact that the values for both url and trust_bundle are internal, I'm guessing we are hosting it on-prem.

danjer2 commented 1 year ago

I got additional information on the event. It wasn't a Venafi API outage. The problem was with the certificate issuer's back end, So the plan worked fine, but during apply it failed to create the new certificate.

luispresuelVenafi commented 1 year ago

@danjer2 Hi, sorry for late response.

I'm trying to understand better your situation. When you mentioned this:

but looking at the plans generated it would not help, because Terraform first unlinks the certificate. So even if the creation stops the run before destruction happens, the app is still left without a certificate. Moreover, since the apply does not complete, it's possible that other infrastructure changes did not get applied, so the application is left in an inconsistent state.

Did you try during that time, to re-run the plan and Terraform didn't allow you right? I'm asking since I'd have thought that Terraform should not have deleted or modified your state at all, since the issuance didn't complete

danjer2 commented 1 year ago

@luispresuelVenafi - I feel like you're asking multiple questions in one.

I just had a long discussion with the team that encountered this, and we concluded that we would have two ways to make certificate renewal via Terraform reliable:

  1. Some change (like the one I proposed) to make the Venafi provider handle the error gracefully
  2. Separate the certificate creation/renewal from the rest of the infrastructure. This can be either manual or a separate Terraform. The separation would ensure that if the cert renewal fails, it does not affect the deployment of the infrastructure.