confluentinc / terraform-provider-confluent

Terraform Provider for Confluent
Apache License 2.0
118 stars 61 forks source link

Stale resource due to terraform provider upgrade #332

Open mihdih opened 7 months ago

mihdih commented 7 months ago

Hello!

First off, for context:

We've been using terraform in adding the ACLs to our Confluent cloud cluster. This has been working well until just recently when we hit some frequent rate limiting issues, similar to the one described in issue 148. Up until then, we were using the confluentinc/confluent provider version 1.36.0. As suggested, to resolve the issue we upgraded the provider to version 1.47.0 or higher. And so we did, we upgraded to latest version which is 1.55.0

The issue:

After the upgrade, it seems that records or resources that were added using the old version are not removed properly. I could perform some more tests, but in particular, the resource below are having this problem.

# Added using confluentinc/confluent version 1.36.0

resource "confluent_kafka_acl" "acl" {
  kafka_cluster {
    id = confluent_kafka_cluster.dedicated-cluster.id
  }

  # Set to '*' as this is for Confluent Cloud.
  host = "*"

  resource_type = "TOPIC"
  resource_name = "*"
  pattern_type  = "LITERAL"
  principal     = "User:<service_account_id>"
  operation     = "ALL"
  permission    = "ALLOW"
  rest_endpoint = confluent_kafka_cluster.dedicated-cluster.rest_endpoint

  credentials {
    key    = confluent_api_key.app-manager-kafka-api-key.id
    secret = confluent_api_key.app-manager-kafka-api-key.secret
  }

  depends_on = [confluent_api_key.app-manager-kafka-api-key]
}
# resource_name is changed in confluentinc/confluent version 1.55.0 to 'foobar' from '*'

resource "confluent_kafka_acl" "acl" {
  kafka_cluster {
    id = confluent_kafka_cluster.dedicated-cluster.id
  }

  # Set to '*' as this is for Confluent Cloud.
  host = "*"

  resource_type = "TOPIC"
  resource_name = "foobar"
  pattern_type  = "LITERAL"
  principal     = "User:<service_account_id>"
  operation     = "ALL"
  permission    = "ALLOW"
  rest_endpoint = confluent_kafka_cluster.dedicated-cluster.rest_endpoint

  credentials {
    key    = confluent_api_key.app-manager-kafka-api-key.id
    secret = confluent_api_key.app-manager-kafka-api-key.secret
  }

  depends_on = [confluent_api_key.app-manager-kafka-api-key]
}

Result

The ACL with resource_name foobar has been added but the old one (resource_name *) is still present, despite the terraform stating that it had been destroyed.

Am I missing something obvious here? Should I not upgrade straight to latest version right away after 1.36.0?

linouk23 commented 7 months ago

Thanks for creating this issue @mihdih!

Should I not upgrade straight to latest version right away after 1.36.0?

That definitely should be fine.

We didn't change much on client but there was a non-breaking API update.

Just to confirm my understanding, are you saying that after updating the value for resource_name, Terraform attempted to recreate the resource (by deleting the old one and creating a new one), and while the creation was successful, the deletion was a no-op (Terraform did not show any errors, but the ACL was not removed from the backend)?

mihdih commented 7 months ago

Thank you for a swift response @linouk23 .

Just to confirm my understanding, are you saying that after updating the value for resource_name, Terraform attempted to recreate the resource (by deleting the old one and creating a new one), and while the creation was successful, the deletion was a no-op (Terraform did not show any errors, but the ACL was not removed from the backend)?

That is correct. The delete + create is expected. There was no error upon the deletion, in fact terraform reported back that it was destroyed. I can confirm there after that the ACL (old) still exists. I have checked in both Confluent Cloud UI and via CLI.

mihdih commented 7 months ago

Hi @linouk23

I did a little bit of investigation and looks like the problem is when passing the User or principal.

In the old version the account id (integer) is used, while in the new one it's the Service Account resource Id. Please check the sample curl queries below.

Working (replicating the call used in version 1.36.0)

curl -u {api_key} --request DELETE --url https://{confluent_kafka_cluster.dedicated-cluster.rest_endpoint}/kafka/v3/clusters/{confluent_kafka_cluster.dedicated-cluster.id}/acls?host=%2A&operation=ALL&pattern_type=LITERAL&permission=ALLOW&principal=User%3A{service_account_id}resource_name=%2A&resource_type=TOPIC

# Where service_account_id is something like 1978961

NOT Working (replicating the call used in version 1.55.0)

curl -u {api_key} --request DELETE --url https://{confluent_kafka_cluster.dedicated-cluster.rest_endpoint}/kafka/v3/clusters/{confluent_kafka_cluster.dedicated-cluster.id}/acls?host=%2A&operation=ALL&pattern_type=LITERAL&permission=ALLOW&principal=User%3A{service_account_resource_id}resource_name=%2A&resource_type=TOPIC

# Where service_account_resource_id is something like sa-xxxxx

I appreciate that the change is deliberate basing on this commit change, which makes me wonder if the bug is in the REST Api?

Another theory is perhaps the new api call won't work on old provisioned clusters?

linouk23 commented 7 months ago

Thank you for performing this investigation!

That was my thinking as well and I’ll reach out to the backend team :saluting_face: