Error loading the latest Schema when re-deploying Schema Registry

xxsenix commented 1 year ago

We have two Terraform modules in our code base. The first module deploys out a Confluent Kafka cluster as well as a schema registry. The second deploys out topics and schemas. We have it split up this way as we've decided on a microservices approach for our architecture and would like to deploy out schemas and topics independently.

Due to this approach, we're storing various secrets in a key vault, such as the cluster/schema id, rest endpoint, api key, and api secret. Then in the topics/schemas module we pull down those secrets from the key vault and include those when creating the resources.

The issue we're running into is when we tear down our cluster and schema registry, and then re-deploy it. We then attempt to re-deploy schemas and we're running into this error when Terraform is refreshing state during the plan phase:

Error: error reading Schema: error reading Schema "old-schema-registry-id/subject-name/latest": error loading the latest Schema: error loading the latest Schema: 404 Not Found: Subject 'subject-name' not found.

It seems like the issue is that Terraform is attempting to refresh the state using the old schema registry id, instead of the new value that we're pulling down from the key vault. We initially ran into this with topics, but after adding this to the provider block that solved this:

provider "confluent" {
  kafka_id            = data.azurerm_key_vault_secret.cluster_id.value
  kafka_rest_endpoint = data.azurerm_key_vault_secret.cluster_rest_endpoint.value
  kafka_api_key       = data.azurerm_key_vault_secret.cluster_key.value
  kafka_api_secret    = data.azurerm_key_vault_secret.cluster_secret.value
}

We've attempted doing the same thing for the schema registry secrets:

provider "confluent" {
  kafka_id            = data.azurerm_key_vault_secret.cluster_id.value
  kafka_rest_endpoint = data.azurerm_key_vault_secret.cluster_rest_endpoint.value
  kafka_api_key       = data.azurerm_key_vault_secret.cluster_key.value
  kafka_api_secret    = data.azurerm_key_vault_secret.cluster_secret.value

  schema_registry_id            = data.azurerm_key_vault_secret.schema_registry_id.value
  schema_registry_rest_endpoint = data.azurerm_key_vault_secret.schema_registry_rest_endpoint.value
  schema_registry_api_key       = data.azurerm_key_vault_secret.schema_registry_api_key.value
  schema_registry_api_secret    = data.azurerm_key_vault_secret.schema_registry_api_secret.value
}

However we're still facing the 404 subject not found error. Anyone have any ideas on what we can try next?

linouk23 commented 1 year ago

That's a great question!

Could you share more details about

It seems like the issue is that Terraform is attempting to refresh the state using the old schema registry id, instead of the new value that we're pulling down from the key vault. We initially ran into this with topics, but after adding this to the provider block that solved this:

But I think in general the list of steps for redeploying SR should be the following:

terraform destroy
Update IDs in Key Vault.
terraform apply
terraform plan shouldn't show any drfit.

@xxsenix could you confirm that you executed the list of steps in the order described above?

xxsenix commented 1 year ago

@linouk23 thanks for the reply! Because we have our scripts separated into different modules we actually have two separate state files. One for the cluster/schema registry (called main), and one for the topics/schema (called kafka). Here are the steps that we are taking, starting from scratch:

Terraform apply main
Terraform apply kafka
Terraform destroy main
Terraform apply main (IDs updated in KV)
Terraform plan/apply kafka (this is where the error is as its referencing the old schema registry id as opposed to the new one that we pull in from the KV)

linouk23 commented 1 year ago

Gotcha, do we know why KV data source still references old SR ID? I'd expect data sources for KeyVault to be implemented in a way where they fetch new value for every terraform plan.

I'm just brainstorming here, do we need to call terraform destroy kafka too?

linouk23 commented 1 year ago

cc @xxsenix

xxsenix commented 1 year ago

@linouk23 The KV data source references the new SR ID, the old SR ID is only in the state file which is what is causing it to error out during the refresh.

If we did a terraform destroy kafka before we destroyed main, that would certainly work as it would clear out the old SR ID. But I was just hoping there was a way around this.

linouk23 commented 1 year ago

@xxsenix given that you destroy Kafka & SR cluster via terraform destroy main, it seems like running terraform destroy kafka to destroy their "child" resources like schemas & topics is a smart idea.

xxsenix commented 1 year ago

@linouk23 that approach would work but we would like to decouple the two scripts as much as possible. Any idea why the provider configurations work for topics but not schemas?

FYI we also tried setting refresh to false, and we got past the plan stage but it failed with a 404 on apply.

Noel-Jones commented 1 year ago

If you delete a schema from CC (just one, not the repository), then the same thing happens; Terraform refresh fails with a 404 error and you are forced to remove the resource from the state file so that Terraform can recreate it. In all other terraform providers that I have used, if the resource is not found during refresh then Terraform will plan to recreate it. This is what Terraform is for - to make sure that an expected resource exists. The provider needs to handle the 404, taking it to mean that the resource does not exist, and then move on to create it. I believe this would take care of the case here too.

confluentinc / terraform-provider-confluent

Error loading the latest Schema when re-deploying Schema Registry #296