v0.4.0 - Error: Provider produced inconsistent result after apply

bogdaniordache commented 2 years ago

We are trying to upgrade the provider from version 0.2.0, because we are getting the Error: 429 Too Many Requests, to the 0.4.0 version.

On apply we get the Plan: 15 to add, 0 to change, 0 to destroy.

When confirming the apply, configuration starts to drift: in confluent cloud all the resources are created, but some of them are not reflected in the state file.

Within the process we get the following errors:

│ Error: Provider produced inconsistent result after apply
│
│ When applying changes to module.kafka_topics.confluentcloud_kafka_topic.kafka_topics["ingest_metrics"], provider
│ "module.sentry_kafka_topics.provider[\"registry.terraform.io/confluentinc/confluentcloud\"]" produced an unexpected new value: Root resource was present, but now absent.
│ 
│ This is a bug in the provider, which should be reported in the provider's own issue tracker.

and

│ Error: 404 Not Found: 
│ 
│   with module.kafka_topics.confluentcloud_kafka_topic.kafka_topics["events_subscription_results"],
│   on ../../../../confluent-kafka-topics/main.tf line 1, in resource "confluentcloud_kafka_topic" "kafka_topics":
│    1: resource "confluentcloud_kafka_topic" "kafka_topics" {

What can we do to get around this issue?

linouk23 commented 2 years ago

@bogdaniordache thanks for opening an issue!

We are trying to upgrade the provider from version 0.2.0, because we are getting the Error: 429 Too Many Requests, to the 0.4.0 version.

That's a great idea 👍

Could you confirm you're following our upgrade guide?

If yes, we might need be interested to take a look at the redacted output from every command of that guide that you could send to this email so we could investigate.

On apply we get the Plan: 15 to add, 0 to change, 0 to destroy.

That is definitely very weird. I'm wondering whether you got this message when running terraform plan using 0.2.0 or 0.4.0.

linouk23 commented 2 years ago

Until we figure this out, one easy fix could be to import these 15 topics.

bogdaniordache commented 2 years ago

Could you confirm you're following our upgrade guide?

Yes, I am following the guide, this is related to topics, and with a fresh state at every switch of provider version. Clusters remain the same, isolating just to topic generation.

On apply we get the Plan: 15 to add, 0 to change, 0 to destroy.

That is definitely very weird. I'm wondering whether you got this message when running terraform plan using 0.2.0 or 0.4.0.

The plans were executed on 0.4.0, running plan, apply and destroy(with -parallelism=2) work fine in 0.2.0.

Until we figure this out, one easy fix could be to import these 15 topics.

Can be a solution, but not a reliable one, especially within larger deployments.

afoley-st commented 2 years ago

I see the same issue when creating topics with the confluent provider. It seems to appear about 65% of the time with clean terraform apply's. It seems we both do things in a for_each block, im not sure if that has any impact.

afoley-st commented 2 years ago

to follow up on the above - If i add a long sleep (we create an API key with automation), then it seems to work. I think it stems from that fact that keys are not immediately active. Heres my terraform:

# You need to wait for a large amount of time until key is active, unfortunately
resource "time_sleep" "wait_600_seconds" {
  # An automated API Key for the new cluster
  depends_on      = [module.confluent_cluster.cluster_service_account_api_key]
  create_duration = "600s"
}

resource "confluentcloud_kafka_topic" "topics" {
  depends_on       = [time_sleep.wait_600_seconds]
  for_each         = local.topics
  kafka_cluster    = module.confluent_cluster.cluster_id
  topic_name       = each.key
  partitions_count = each.value.partitions
  http_endpoint    = module.confluent_cluster.cluster_http_endpoint
  credentials {
    key    = module.confluent_cluster.cluster_service_account_api_key
    secret = module.confluent_cluster.cluster_service_account_api_secret
  }
}

I hope this helps someone!

Edit: It also helped for me to scale down the parallelism in terraform.

charlottemach commented 2 years ago

Having the same issue (topic creation succeeds, but terraform state doesn't show all of them), below are some (sanitized) logs from the trace for one topic.

2022-02-23T12:02:02.642+0100 [INFO]  provider.terraform-provider-confluentcloud_0.4.0: 2022/02/23 12:02:02 [DEBUG] Created Kafka topic <CLUSTERID>/<TOPICNAME>: timestamp=2022-02-23T12:02:02.642+0100
2022-02-23T12:02:02.642+0100 [INFO]  provider.terraform-provider-confluentcloud_0.4.0: 2022/02/23 12:02:02 [INFO] Kafka topic read for <CLUSTERID>/<TOPICNAME>: timestamp=2022-02-23T12:02:02.642+0100
2022-02-23T12:02:02.642+0100 [DEBUG] provider.terraform-provider-confluentcloud_0.4.0: 2022/02/23 12:02:02 [DEBUG] GET https://<ENDPOINT>.aws.confluent.cloud:443/kafka/v3/clusters/<CLUSTERID>/topics/<TOPICNAME>
2022-02-23T12:02:02.881+0100 [INFO]  provider.terraform-provider-confluentcloud_0.4.0: 2022/02/23 12:02:02 [WARN] Kafka topic get failed for id <TOPICNAME>, &{404 Not Found 404 HTTP/2.0 2 0 map[Content-Type:[application/json] Date:[Wed, 23 Feb 2022 11:02:02 GMT]] {0x14000b63760} -1 [] false false map[] 0x14000174a00 0x140000ad3f0}, 404 Not Found: timestamp=2022-02-23T12:02:02.881+0100
2022-02-23T12:02:02.881+0100 [INFO]  provider.terraform-provider-confluentcloud_0.4.0: 2022/02/23 12:02:02 [WARN] Kafka topic with id=<CLUSTERID>/<TOPICNAME> is not found: timestamp=2022-02-23T12:02:02.881+0100
2022-02-23T12:02:02.882+0100 [TRACE] maybeTainted: module.applications_kafka_topics.confluentcloud_kafka_topic.kafka_topics["<TOPICNAME>"] encountered an error during creation, so it is now marked as tainted

Think the issue is that this part is no longer checking if there is an response (maybe the topic is still being created at that point?) and just assuming the creation failed even though it did actually succeed.

linouk23 commented 2 years ago

@charlottemach @afoley-st @bogdaniordache thanks for doing the investigation, that's very insightful!

Looks like we need to add a timeout after POST (create a topic) and before GET (read a topic) requests.

linouk23 commented 2 years ago

to follow up on the above - If i add a long sleep (we create an API key with automation), then it seems to work. I think it stems from that fact that keys are not immediately active. Heres my terraform:

Thanks for sharing your experience! FWIW if the Kafka API Key is not active you should be getting 401 so it might be a bit different.

linouk23 commented 2 years ago

Update: we're hoping to release 0.5.0 on Friday or next week that will include a fix for it 🤞:

30 seconds sleep after POST (create a topic) and before GET (read a topic) requests.

That said, we found it difficult to reproduce the issue: namely, we were able to create 200 topics for a Standard Kafka cluster on AWS (TF logs) so we were wondering if it might be connected to other factors such as cluster types / cloud provider. Could you confirm what type of cluster / cloud provider are you using? For example,

Basic Kafka cluster (AWS / GCP / Azure)
Standard Kafka cluster (AWS / GCP / Azure)
Dedicated Kafka cluster that is accessible over the public internet (AWS / GCP / Azure)
Dedicated Kafka cluster that is accessible via PrivateLink connections (AWS / GCP / Azure) -- not supported yet
Dedicated Kafka cluster that is accessible via VPC Peering connections (AWS / GCP / Azure) -- not supported yet

@charlottemach @afoley-st @bogdaniordache

bogdaniordache commented 2 years ago

Issue was found while trying to setup topics on an AWS Basic Kafka cluster.

linouk23 commented 2 years ago

Update: I managed to reproduce the issue for a Basic cluster for 0.4.0 😮 (and Confluent Cloud Console does show all 100 topics), will try to create 100 topics for a Basic cluster using an unreleased 0.5.0 version of TF Provider for Confluent Cloud 🤞 now.

Another update: Great news: the unreleased 0.5.0 version managed to create 200 topics for the same Basic cluster without printing out 404 or any other errors:

➜  test2 git:(master) ✗ terraform apply
...
Apply complete! Resources: 198 added, 0 changed, 0 destroyed.
➜  test2 git:(master) ✗ terraform plan                
...
confluentcloud_kafka_topic.orders_177: Refreshing state... [id=lkc-q21x0p/orders_177]
confluentcloud_kafka_topic.orders_37: Refreshing state... [id=lkc-q21x0p/orders_37]
...
confluentcloud_kafka_topic.orders_113: Refreshing state... [id=lkc-q21x0p/orders_113]
confluentcloud_kafka_topic.orders_168: Refreshing state... [id=lkc-q21x0p/orders_168]

No changes. Your infrastructure matches the configuration.

Terraform has compared your real infrastructure against your configuration and found no differences, so no changes are needed.

AAhmed84 commented 2 years ago

We are facing this issue with Standard cluster where we get either 404 error or terraform root resource missing error. We started with an attempt to create 200+ topics but this never worked. We even tried 10 topics and still same issue. It only works with 3 topics for standard cluster using terraform 0.4.0.

linouk23 commented 2 years ago

@AAhmed84 could you open a separate issue for 400 error (or it's a typo and you meant to write 404?)? I think it might be a different issue.

It would also help if you could include & share the sanitized debug logs.

AAhmed84 commented 2 years ago

@linouk23 yes, indeed it was a typo and we get 404, here is the snippet of two errors we get when we try to create 200+ topics on standard cluster using terraform

.

linouk23 commented 2 years ago

@AAhmed84 gotcha 👍 , then updating to 0.5.0 should help, ETA is early next week most likely.

update: ETA for releasing 0.5.0 is next Wednesday.

linouk23 commented 2 years ago

Check out our most recent release of the TF Provider for Confluent Cloud v0.5.0 where we fixed the issue!

cc @AAhmed84 @bogdaniordache @charlottemach @afoley-st

confluentinc / terraform-provider-confluentcloud

v0.4.0 - Error: Provider produced inconsistent result after apply #40