Error: 429 error without the "X-Redlock-Status" header

erikpaasonen commented 1 year ago

Describe the bug

Terraform fails because the data sources don't return valid data.

Expected behavior

should be able to rerun terraform plan and terraform apply against a reasonable number of accounts at once, as often as we choose.

Current behavior

after several successes, we receive this error on all the others:

Error: 429 error without the "X-Redlock-Status" header - returned HTML:

since the data source fails, the Terraform stack cannot calculate a plan and exits nonzero (failure).

Possible solutions

increase capacity of the Prisma Cloud API to be able to handle many more GET API calls driven by the increase in volume of data source retrievals
provide the AWS external ID as a distinct API call, which would then be able to yield its own data source; build this new API endpoint for performance and allow a very high threshold of simultaneous GET requests (this is a public Terraform provider, after all)
unknown if there are multiple API calls per data source, but if there are >1 API calls per data source, then perhaps refactor the API to accomplish everything in only a single API call per data source? that could have an exponentially positive effect on performance at scale (depending on current situation)

Steps to reproduce

refactor a Terraform stack that was using prismacloud_cloud_account to start using prismacloud_cloud_account_v2. as part of this refactor, introduce two new data sources to the stack: prismacloud_aws_cft_generator and prismacloud_account_supported_features.
run terraform apply against approx. 50 AWS accounts in short succession.

note: for reproducing this error, it does not matter whether the prismacloud_cloud_account_v2 resource is already created, or fails to create, or what. even previously-successful terraform apply runs will fail on subsequent runs because both data sources still need to succeed their underlying API calls.

Context

I work at a big company (though I'm posting this issue here personally). we recently began using the newly-released prismacloud_cloud_account_v2 resource. we are unable to deploy this refactor to all of our accounts because of this error.

this worked for us up until recently, while we were on the v1 version of the resource. I suspect it's because we didn't have to hit a data source on every run. it was a one-and-done TF resource that did nothing (meaning, no API calls at all) when we reran Terraform against the account in the future. now with v2 requiring an external ID value which is provided only via data source, our Terraform stack needs to run that data source on every run in order to have the value ready for the resource (which it doesn't need/throws away on every subsequent run after the initial resource is created, of course).

Terraform is often used as a drift detection tool. rerunning a stack which is not expected to have any changes is one way of achieving a high confidence in the resources managed by Terraform. the more often Terraform runs and detects no changes between the code and the environment, the greater the confidence in the environment. thus, a high load can be expected on data sources, since they need to complete successfully for even a no-changes plan or apply run to not fail with an error.

Your Environment

I think I'm safe to say somewhere in the range of 100 to 2000 AWS accounts

Version used:
- Terraform: v1.2.2
- CPU arch: linux_amd64
- prismacloud Terraform provider: v1.3.7
Operating System and version (desktop or mobile): Docker image derived ultimately from the official Terraform Docker image corresponding to the version of Terraform

erikpaasonen commented 1 year ago

appears to be related to #200 ?

possibility of a "noisy neighbor" problem? i.e. some Prisma Cloud customers' volume affecting other customers in the same tenancy? not sure, but might be worth checking the aggregate API traffic stats.

erikpaasonen commented 1 year ago

I put in some calories trying to engineer a way to count the data source on/off for future/subsequent runs. (been writing Terraform code full-time since 2017 so able to dig fairly deep.) unfortunately, all of the functions pause their calculation for the placeholder value for a not-yet-known value and resume again when the value becomes known, which is during the apply. currently there is no way to evaluate against the unknown value condition directly, i.e. the "unknown value" state is not a comparable thing to be checked against like null is.

every attempt to count or for_each off the data source based on attributes of the resource yields this error:

The "count" value depends on resource attributes that cannot be determined until apply, so Terraform cannot predict how many instances will be created. To work around this, use the -target argument to first apply only the resources that the count depends on.

they're saying they need the calculated attributes of the resource in order to calculate how many of the data source to make. they don't know I'm just trying to count between 1 and 0, the functions are engineered for the generic case.

we run many Terraform stacks in an orchestrated manner which does not support the separate attempt using -target= as suggested by the error message. all other Terraform providers we use "just work" when it comes to data source lookups.

the most promising was the time_static resource:

resource "time_static" "foo" {
}

data "prismacloud_aws_cft_generator" "foo" {
    count = timecmp(time_static.foo.rfc3339, timestamp()) == 0 ? 1 : 0
    ...
}

this conditional logic keeps track of the timestamp when it first gets created and never runs again. unfortunately, there are two issues:

it's still subject to the chicken-and-egg problem trying to get that first run to create the prismacloud_cloud_account_v2 resource (being a resource itself), and
it isn't coupled at all to the prismacloud_cloud_account_v2 resource itself being successfully created, i.e. it's subject to requiring manual intervention if the first-time apply attempt of the prismacloud_cloud_account_v2 fails for any reason.

just to be clear: this wouldn't be happening in the first place if the external ID were available as a resource instead of a data source. this 429 API limiting problem is a ramification of the design decision to make the external ID available only as a data source, because of the way the Terraform ecosystem is architected and because its use as drift detection is encouraged.

other further reading: https://developer.hashicorp.com/terraform/language/expressions/function-calls#when-terraform-calls-functions https://developer.hashicorp.com/terraform/language/resources/terraform-data https://developer.hashicorp.com/terraform/language/expressions/custom-conditions#resources-and-data-sources https://github.com/hashicorp/terraform/issues/30937 https://log.martinatkins.me/2021/06/14/terraform-plan-unknown-values/ https://github.com/hashicorp/terraform/issues/26755 https://discuss.hashicorp.com/t/count-depending-on-resources-when-it-is-clearly-not-null-relates-to-https-github-com-hashicorp-terraform-issues-30816/38082/6 https://github.com/hashicorp/terraform/issues/30816 https://github.com/hashicorp/terraform/issues/17034#issuecomment-648954314

erikpaasonen commented 1 year ago

FYI there does exist a time_rotating Terraform resource (implementation here). this resource expires itself after a certain time period, causing no-op for the duration but flags Terraform to taint (i.e. destroy and re-create) after the duration. something similar could be implemented for the proposed external_id resource to keep it static for a defined rotation period.

I'm imagining the hypothetical external_id to have a configurable/optional rotation duration setting. this would allow an organization of our size to randomize the timing of the taint per account instead of having the same overloading problem every N days. it could include as its output the expected expiration date of each account. 👍

just brainstorming.

UshaPriya2001 commented 1 year ago

@erikpaasonen Thanks for opening the issue, a fix has been released with terraform version = "1.4.1" Closing the issue Thanks

PaloAltoNetworks / terraform-provider-prismacloud