IBM-Cloud / terraform-provider-ibm

https://registry.terraform.io/providers/IBM-Cloud/ibm/latest/docs
Mozilla Public License 2.0
339 stars 663 forks source link

[CBR] Provider produced inconsistent result after apply #5002

Closed Ak-sky closed 5 months ago

Ak-sky commented 9 months ago

Community Note

Terraform CLI and Terraform IBM Provider Version

Terraform Version v1.5.7 Terraform IBM Provider Version v1.60.1

Affected Resource(s)

Terraform Configuration Files

We have noticed an issue during one of the test for TF Event Notification module example which is

Provider produced inconsistent result after apply

When applying changes to module.cbr_zone.ibm_cbr_zone.cbr_zone, provider
"provider[\"registry.terraform.io/ibm-cloud/ibm\"]" produced an unexpected
new value: Root resource was present, but now absent.
This is a bug in the provider, which should be reported in the provider's
own issue tracker.

Please include all Terraform configurations required to reproduce the bug. Bug reports without a functional reproduction may be closed without investigation.

# Copy-paste your Terraform configurations here - for large Terraform configs,
# please share a link to the ZIP file.

Debug Output

Adding Logs TF_debug.log TF_stdout.txt

Panic Output

Expected Behavior

All the zones should get created/destroyed during the TF operation.

Actual Behavior

Provider produced inconsistent result after apply.

Steps to Reproduce

  1. terraform apply

Important Factoids

References

shemau commented 8 months ago

The same error is reported

│ Error: Provider produced inconsistent result after apply
│ 
│ When applying changes to module.cbr_zone_schematics.ibm_cbr_zone.cbr_zone,
│ provider "provider[\"registry.terraform.io/ibm-cloud/ibm\"]" produced an
│ unexpected new value: Root resource was present, but now absent.
│ 
│ This is a bug in the provider, which should be reported in the provider's
│ own issue tracker.

This is an intermittent problem, a re-run in the same environment was successful. So it presents as a timing issue on the backend or a network issue to the backend that the provider might resolve with retry logic.

Ref: terraform-ibm-cos 518

zhenwan commented 8 months ago

From TF_stdout.txt, I see the error was caused by creating a new zone with a zone name which already exists:

{
   "errors":[
      {
         "code":"zone_already_exists_in_store",
         "message":"A network zone with the same name already exists. Assign each network zone a unique name.",
         "target":{
            "name":"name",
            "type":"field",
            "value":"event-notification-complete-1jbr01-VPC-network-zone"
         }
      }
   ],
   "status_code":409,
   "trace":"edbf12a0-5a4d-4353-9a35-bfc6fcf707f9"
}
shemau commented 8 months ago

@zhenwan my take is that causality is the other way around.

At 06:12:08 during an apply the error reported here is observed. At 06:12:13 a retry occurs (a new/second terraform apply is run). At 06:12:31 the terraform delta shows CBR zone and rule are missing (relative to the statefile/partial deploy) followed by a duplicate zone error.

The causality is not that the absent error reported in this issue was caused by the zone existing; The zone existing error was caused by the absent error.

IMHO we see the duplicate because Root resource was present, but now absent had in fact created the resource/zone but failed to store it in the state file because it was absent. The subsequent deploy then tried to create it again, but it already existed.

ocofaigh commented 8 months ago

We have seen this in the past with other services. The terraform provider executes a POST api to create something, and then it does a GET to check if it is there. If the GET is done against a database which may not yet have the entry for the new resources (e.g. slow replication), the GET can return nothing, and terraform can retry the POST to create. That might be what is happening here? Perhaps a retry, or sleep needs to be added to the GET before retrying the POST?

Ak-sky commented 7 months ago

Again just got hit with the same issue- Attached TF Trace Log File: cbr-fscloud_Trace_TFA_05.02.2024-12.47.50.log. TF Apply Std Output: cbr-fscloud_TFA_05.02.2024-12.47.50.log

ibm_is_subnet.testacc_subnet: Creating... ibm_is_subnet.testacc_subnet: Still creating... [10s elapsed] ibm_is_subnet.testacc_subnet: Creation complete after 13s [id=0717-e319bdc9-e139-493a-bb80-b687fbaded17] ╷ │ Error: Provider produced inconsistent result after apply │ │ When applying changes to │ module.cbr_account_level.module.cbr_zone["is"].ibm_cbr_zone.cbr_zone, │ provider "provider[\"registry.terraform.io/ibm-cloud/ibm\"]" produced an │ unexpected new value: Root resource was present, but now absent. │ │ This is a bug in the provider, which should be reported in the provider's │ own issue tracker. ╵ ╷ │ Error: Provider produced inconsistent result after apply │ │ When applying changes to │ module.cbr_account_level.module.cbr_zone["databases-for-enterprisedb"].ibm_cbr_zone.cbr_zone, │ provider "provider[\"registry.terraform.io/ibm-cloud/ibm\"]" produced an │ unexpected new value: Root resource was present, but now absent. │ │ This is a bug in the provider, which should be reported in the provider's │ own issue tracker. ╵

                                                                                                                                                                             [1m-3.0s]
shemau commented 7 months ago

2024-02-05T12:47:59.312+0530 [INFO] provider.terraform-provider-ibm_v1.56.1: 2024/02/05 12:47:59 [Debug] Request:

appears to show a retry loop following a rate limit (429) with 7 retries in 0.05 of a second, ignoring the retry-in field in the response? Showing X-Ratelimit-Remaining: 0 just before the error. I am not sure retries could be that quick, so maybe it is not.

At 12:48:00 shows the zone created (zone id: 8cbe290919a4bd58ea821e7f420f03bd) 2024-02-05T12:48:00.837+0530 [INFO] provider.terraform-provider-ibm_v1.56.1: 2024/02/05 12:48:00 [Debug] Response:

At 12:48:01 shows the zone does not exist 2024-02-05T12:48:01.439+0530 [INFO] provider.terraform-provider-ibm_v1.56.1: 2024/02/05 12:48:01 [Debug] Response:

ocofaigh commented 7 months ago

@Ak-sky please escalate internally if getting no attention here

zhenwan commented 7 months ago

@Ak-sky I tried TF Event Notification module example several times, I could not re-produce the issue that you saw.

I have noticed that you are using Terraform IBM Provider Version v1.60.1, the provider version I am using is the latest one, v1.62.0, which contains several fixes to the CBR provider.

Can you upgrade your version to the latest one, I just want to know if you still see the problem? Please upload the log files to this issue if the problem still exists.

shemau commented 7 months ago
TestRunFSCloudExample 2024-02-06T20:10:26Z logger.go:66: - Installing ibm-cloud/ibm v1.62.0...
...
TestRunFSCloudExample 2024-02-06T20:11:34Z logger.go:66: ╷
TestRunFSCloudExample 2024-02-06T20:11:34Z logger.go:66: │ Error: Provider produced inconsistent result after apply
TestRunFSCloudExample 2024-02-06T20:11:34Z logger.go:66: │ 
TestRunFSCloudExample 2024-02-06T20:11:34Z logger.go:66: │ When applying changes to
TestRunFSCloudExample 2024-02-06T20:11:34Z logger.go:66: │ module.cos_fscloud.module.buckets.module.buckets["cos-fscloud-wcvtor-bucket"].module.bucket_cbr_rule[0].ibm_cbr_rule.cbr_rule,
TestRunFSCloudExample 2024-02-06T20:11:34Z logger.go:66: │ provider "provider[\"registry.terraform.io/ibm-cloud/ibm\"]" produced an
TestRunFSCloudExample 2024-02-06T20:11:34Z logger.go:66: │ unexpected new value: Root resource was present, but now absent.
TestRunFSCloudExample 2024-02-06T20:11:34Z logger.go:66: │ 
TestRunFSCloudExample 2024-02-06T20:11:34Z logger.go:66: │ This is a bug in the provider, which should be reported in the provider's
TestRunFSCloudExample 2024-02-06T20:11:34Z logger.go:66: │ own issue tracker.
TestRunFSCloudExample 2024-02-06T20:11:34Z logger.go:66: ╵

This is encountered multiple times a day at different provider versions from much older providers to the latest.

@zhenwan there is IMHO a long standing race condition in the provider, as @ocofaigh notes without appropriate retry logic in the IBM provider. A resource is created and the state file is not updated which causes a cascading issue within a terraform environment.

I don't have log files this run, it was just a regular run I was performing.

Ak-sky commented 6 months ago

Ran into this issue again- TF Apply std output- cbr-fscloud_TFA_07.03.2024-23.01.57.log TF Trace Log- cbr-fscloud_Trace_TFA_07.03.2024-23.01.57.log

zhenwan commented 6 months ago

@Ak-sky Thank you for the logs. After checking logDNA, it looks the issue was caused by GET request sending to a different region immediately after the POST request.

This is a known issue caused by Eventual consistency patten. We will see how we can mitigate this issue.

zhenwan commented 4 months ago

@Ak-sky I want to let you know if you want to test the fix to this issue, you need to use v1.65.0-beta0 version. The version v1.64.2 does NOT include the fix

tyao117 commented 4 months ago

@Ak-sky, please use the https://github.com/IBM-Cloud/terraform-provider-ibm/releases/tag/v1.65.0-beta0 to test if the fix is correct because all beta releases can still be used to target the production environment. The CBR team needs to determine is the fix is working; that's why it is labeled beta release.

Ak-sky commented 2 months ago

Ran into this issue again, running with TF version - Terraform v1.5.7, IBM provider version- v1.66.0. Looks like this issue still persists in v1.66.0 also.

image

TF Trace Log - cbr-fscloud_Trace_TFA_04.07.2024-20.30.10.log TF Apply std. output - cbr-fscloud_TFA_04.07.2024-20.30.10.log