cloudfoundry-community / terraform-provider-cloudfoundry

Terraform Cloud Foundry Provider
https://registry.terraform.io/providers/cloudfoundry-community/cloudfoundry/latest
Mozilla Public License 2.0
73 stars 87 forks source link

fail_when_catalog_not_accessible behaves as if true during plan #329

Open mogul opened 3 years ago

mogul commented 3 years ago

When an app, previously registered as a service broker, is deleted outside of terraform, it's possible to get into an unrecoverable state where when you run terraform plan you see:

module.broker_solr.cloudfoundry_service_broker.space_scoped_broker["gsa-datagov/management-staging"]: Refreshing state... [id=632b66d2-14d0-4c7f-be69-e1671163865b]

Error: Error when getting catalog signature: Status code: 404 Not Found, Body: 404 Not Found: Requested route ('ssb-new-weevil.app.cloud.gov') does not exist.

This error is encountered even when you set fail_when_catalog_not_accessible to false. This was surprising since this issue was supposed to be resolved by the PR referenced here : PR https://github.com/cloudfoundry-community/terraform-provider-cloudfoundry/pull/300

Confirmed as still happening with the latest provider version, 0.14.0.

ArthurHlt commented 3 years ago

I think you're talking about this github action: https://github.com/GSA/datagov-ssb/runs/2224255533?check_suite_focus=true

It doesn't look like that it contains changes in this terraform plan (we don't even see the switch between true and false for fail_when_catalog_not_accessible), are you sure you don't have another issue here ?

mogul commented 3 years ago

The GitHub Action audit sequence is hard to understand since I later had to nuke the tfstate to get out of the problem situation. Then I rebased and force-pushed the branch. These are three commits kept during the rebase.

If you look at the set of commit runs you get a better sense of the commits prior to the rebase. The series of plan runs gives you a sense of where we were ran into trouble.

  1. The first run where we started seeing the 404 on the catalog causing plan to error out was right after we pushed an empty commit, just to get GitHub Actions to repopulate some things that had been removed manually.
  2. I made the commit that removed randomness from the routes, and we remained in that state at the next run.
  3. That's when we made the commit to turn fail_when_catalog_not_accessible to false. Even after that we were still encountering the problem during plan.
  4. The third commit is us making sure we were using the most recent version of the provider, and still seeing the problem.

At that point we pored over your PR on the provider to try to figure out how failNotAccessible was still ending up true on line246 and couldn't figure it out. We finally gave up and put in this issue.

After that I started working with the environment from my local machine. I went through a bunch of different attempts to try to fix the problem: Tainting the broker in question, running plan with -target, then applying the plan output, etc. I just couldn't get the apply to work because no matter what I did the provider still ended up trying to query the catalog. If I told the apply not to refresh, then it complained about a mismatch with the plan for the obsolete ssb-new-weevil.app.cloud.gov route/catalog, which I just couldn't get it to forget about.

Then I ended up trying to edit the .tfstate to manually remove the Solr broker (no easy feat when the path for finding the tfstate for that workspace in S3 was obscure). Pretty soon I'd trashed the state and finally just deleted that workspace state, manually deleted the resources, and ran apply again. This was OK to do because the workspace referred to a staging environment, but I'm nervous about the potential to have to do this for production too if we end up in this state in the future...! 😓

mogul commented 3 years ago

After a bunch of other wrangling, I'm at the point where both the change of fail_when_catalog_not_accessible to false and the change to non-random routes are hitting our production space.

After most of the apply was done, including the replacement of the routes, we saw the apply fail before completion:

Error: Provider produced inconsistent final plan

When expanding the plan for
module.broker_eks.cloudfoundry_service_broker.space_scoped_broker["gsa-datagov/prod"]
to include new values learned so far during apply, provider
"registry.terraform.io/cloudfoundry-community/cloudfoundry" produced an
invalid new value for .url: was
cty.StringVal("https://ssb-smart-garfish.app.cloud.gov"), but now
cty.StringVal("https://ssb-eks-gsa-datagov-management.app.cloud.gov").

This is a bug in the provider, which should be reported in the provider's own
issue tracker.

Error: Provider produced inconsistent final plan

When expanding the plan for
module.broker_eks.cloudfoundry_service_broker.space_scoped_broker["gsa-datagov/development"]
to include new values learned so far during apply, provider
"registry.terraform.io/cloudfoundry-community/cloudfoundry" produced an
invalid new value for .url: was
cty.StringVal("https://ssb-smart-garfish.app.cloud.gov"), but now
cty.StringVal("https://ssb-eks-gsa-datagov-management.app.cloud.gov").

This is a bug in the provider, which should be reported in the provider's own
issue tracker.

Error: Provider produced inconsistent final plan

When expanding the plan for
module.broker_eks.cloudfoundry_service_broker.space_scoped_broker["gsa-datagov/management"]
to include new values learned so far during apply, provider
"registry.terraform.io/cloudfoundry-community/cloudfoundry" produced an
invalid new value for .url: was
cty.StringVal("https://ssb-smart-garfish.app.cloud.gov"), but now
cty.StringVal("https://ssb-eks-gsa-datagov-management.app.cloud.gov").

This is a bug in the provider, which should be reported in the provider's own
issue tracker.

Error: Provider produced inconsistent final plan

When expanding the plan for
module.broker_aws.cloudfoundry_service_broker.space_scoped_broker["gsa-datagov/prod"]
to include new values learned so far during apply, provider
"registry.terraform.io/cloudfoundry-community/cloudfoundry" produced an
invalid new value for .url: was
cty.StringVal("https://ssb-intimate-mink.app.cloud.gov"), but now
cty.StringVal("https://ssb-aws-gsa-datagov-management.app.cloud.gov").

This is a bug in the provider, which should be reported in the provider's own
issue tracker.

Error: Provider produced inconsistent final plan

When expanding the plan for
module.broker_eks.cloudfoundry_service_broker.space_scoped_broker["gsa-datagov/staging"]
to include new values learned so far during apply, provider
"registry.terraform.io/cloudfoundry-community/cloudfoundry" produced an
invalid new value for .url: was
cty.StringVal("https://ssb-smart-garfish.app.cloud.gov"), but now
cty.StringVal("https://ssb-eks-gsa-datagov-management.app.cloud.gov").

This is a bug in the provider, which should be reported in the provider's own
issue tracker.

Error: Provider produced inconsistent final plan

When expanding the plan for
module.broker_aws.cloudfoundry_service_broker.space_scoped_broker["gsa-datagov/staging"]
to include new values learned so far during apply, provider
"registry.terraform.io/cloudfoundry-community/cloudfoundry" produced an
invalid new value for .url: was
cty.StringVal("https://ssb-intimate-mink.app.cloud.gov"), but now
cty.StringVal("https://ssb-aws-gsa-datagov-management.app.cloud.gov").

This is a bug in the provider, which should be reported in the provider's own
issue tracker.

Error: Provider produced inconsistent final plan

When expanding the plan for
module.broker_aws.cloudfoundry_service_broker.space_scoped_broker["gsa-datagov/management"]
to include new values learned so far during apply, provider
"registry.terraform.io/cloudfoundry-community/cloudfoundry" produced an
invalid new value for .url: was
cty.StringVal("https://ssb-intimate-mink.app.cloud.gov"), but now
cty.StringVal("https://ssb-aws-gsa-datagov-management.app.cloud.gov").

This is a bug in the provider, which should be reported in the provider's own
issue tracker.

Error: Provider produced inconsistent final plan

When expanding the plan for
module.broker_aws.cloudfoundry_service_broker.space_scoped_broker["gsa-datagov/development"]
to include new values learned so far during apply, provider
"registry.terraform.io/cloudfoundry-community/cloudfoundry" produced an
invalid new value for .url: was
cty.StringVal("https://ssb-intimate-mink.app.cloud.gov"), but now
cty.StringVal("https://ssb-aws-gsa-datagov-management.app.cloud.gov").

This is a bug in the provider, which should be reported in the provider's own
issue tracker.

Then the very next plan failed, again with catalog 404s causing the failure:

Error: Error when getting catalog signature: Status code: 404 Not Found, Body: 404 Not Found: Requested route ('ssb-intimate-mink.app.cloud.gov') does not exist.
Error: Error when getting catalog signature: Status code: 404 Not Found, Body: 404 Not Found: Requested route ('ssb-intimate-mink.app.cloud.gov') does not exist.
Error: Error when getting catalog signature: Status code: 404 Not Found, Body: 404 Not Found: Requested route ('ssb-intimate-mink.app.cloud.gov') does not exist.
Error: Error when getting catalog signature: Status code: 404 Not Found, Body: 404 Not Found: Requested route ('ssb-intimate-mink.app.cloud.gov') does not exist.
Error: Error when getting catalog signature: Status code: 404 Not Found, Body: 404 Not Found: Requested route ('ssb-smart-garfish.app.cloud.gov') does not exist.
Error: Error when getting catalog signature: Status code: 404 Not Found, Body: 404 Not Found: Requested route ('ssb-smart-garfish.app.cloud.gov') does not exist.
Error: Error when getting catalog signature: Status code: 404 Not Found, Body: 404 Not Found: Requested route ('ssb-smart-garfish.app.cloud.gov') does not exist.
Error: Error when getting catalog signature: Status code: 404 Not Found, Body: 404 Not Found: Requested route ('ssb-smart-garfish.app.cloud.gov') does not exist.

It seems that the first problem (of using the old routes rather than the new in the apply) is resulting in the TF state continuing to refer to the old routes for the brokers. The old routes were definitely removed during the apply, yet definitely still appear in the service-broker registrations:

% cf routes
Getting routes for org gsa-datagov / space management as [redacted]...

space        host                              domain          port   path   protocol   apps
management   ssb-aws-gsa-datagov-management    app.cloud.gov                 http       ssb-aws
management   ssb-solr-gsa-datagov-management   app.cloud.gov                 http       ssb-solr
management   ssb-eks-gsa-datagov-management    app.cloud.gov                 http       ssb-eks

% cf service-brokers
Getting service brokers as [redacted]...
name                                      url
ssb-ssb-aws-gsa-datagov-staging           https://ssb-intimate-mink.app.cloud.gov
ssb-ssb-aws-gsa-datagov-prod              https://ssb-intimate-mink.app.cloud.gov
ssb-ssb-aws-gsa-datagov-development       https://ssb-intimate-mink.app.cloud.gov
ssb-ssb-aws-gsa-datagov-management        https://ssb-intimate-mink.app.cloud.gov
ssb-ssb-eks-gsa-datagov-staging           https://ssb-smart-garfish.app.cloud.gov
ssb-ssb-eks-gsa-datagov-prod              https://ssb-smart-garfish.app.cloud.gov
ssb-ssb-eks-gsa-datagov-management        https://ssb-smart-garfish.app.cloud.gov
ssb-ssb-eks-gsa-datagov-development       https://ssb-smart-garfish.app.cloud.gov
ssb-solr-gsa-datagov-development          https://ssb-improved-bunny.app.cloud.gov
ssb-solr-gsa-datagov-staging              https://ssb-improved-bunny.app.cloud.gov
ssb-solr-gsa-datagov-management           https://ssb-improved-bunny.app.cloud.gov
ssb-solr-gsa-datagov-prod                 https://ssb-improved-bunny.app.cloud.gov

And again: I expected that having fail_when_catalog_not_accessible value being set to false would prevent the failure upon encountering the 404s (in which case everything could probably still recover at the next apply), but that's clearly not happening.

I'm leaving it in this state and am very open to debugging it interactively with you via a Slack call when our timezones overlap! (I'm @mogul in the Cloud Foundry Slack, and I'm in UTC-7.)

ArthurHlt commented 3 years ago

For the state you could also simply change fail_when_catalog_not_accessible to set false inside it.

I understand your frustration so I went deeper, it looks like I can't access to changes during read when I made the change I clearly can, maybe my version of terraform was allowing than.

I've tried many way to get this information during read and it's impossible from what cli give to the provider.

So, I've another proposal, I would like to add in provider config this value: force_broker_not_fail_when_catalog_not_accessible associated to env var CF_FORCE_BROKER_NOT_FAIL_CATALOG and when set to true this will enforce fail_when_catalog_not_accessible to be false.

I've tried this configuration and it's work. What do you think about it ?

ArthurHlt commented 3 years ago

please see the pull request and give it a try

mogul commented 3 years ago

Confirmed working in the PR.