GoogleCloudPlatform / terraform-provider-cdap

Custom Terraform Provider for CDAP
https://registry.terraform.io/providers/GoogleCloudPlatform/cdap/
Apache License 2.0
12 stars 12 forks source link

CDAP endpoint no longer available if Data Fusion instance updated. #102

Open FVCCarneiro opened 3 years ago

FVCCarneiro commented 3 years ago

So consider the scenario where I have a Data Fusion in version 6.4.1 and I wish to re-deploy it as 6.5.0 version via Terraform. In Terraform, this implies simply changing the attribute version attribute. After making this change, when performing a terraform plan the following error is obtained:

| ERROR Get "/v3/namespaces" : unsupported protocol scheme""

To solve this, I am forced to remove all cdap ressources which produce the error using the terraform state rm command. Only after this can Terraform update the instance and re-create all the CDAP resources that have to be re-created due to the fact that the Data Fusion instance needs to be destroyed and recreated.

I believe that this error comes from the fact that there is a dependency between a Data Fusion instance and all its CDAP resources that is not considered by the CDAP provider. One should never have to resort to editing the tfstate file to solve these types of issues. Is there a way to fix this?

umairidris commented 3 years ago

It seems the namespace was deleted, which should trigger all the resources within the namespace to be deleted.

There are existence checks on all the resources e.g. https://github.com/GoogleCloudPlatform/terraform-provider-cdap/blob/d4ba182013481be3a21a35c52e896eda90b45407/cdap/resource_application.go#L103

but it might not be checking if the namespace still exists. To do that we need to call https://github.com/GoogleCloudPlatform/terraform-provider-cdap/blob/d4ba182013481be3a21a35c52e896eda90b45407/cdap/resource_namespace.go#L84 before checking if the resource itself still exists.

I don't work on this project actively anymore and don't have the bandwidth to support this, but I am happy to review a PR if someone wants to add this.

Thanks!

Guimove commented 3 years ago

Hello @umairidris

In fact, this issue in not resulting of a namespace deletion, this happens to all CDAP resources when we make a change forcing the replacement of the data fusion instance.

# CDAP provider needed for each Data Fusion instance.
provider "cdap" {
  host  = "${module.datafusion_instance.wait_healthy_service_endpoint}/api/"
  token = data.google_client_config.current.access_token
}

Because the provider is configured by the output on the data fusion instance resources (in this case a script that check that the instance is really created and ready), when terraform need to recreate the resource, it reinitializes the value of this parameter module.datafusion_instance.wait_healthy_service_endpoint so the provider is configured with a null value.

Do you know if there is a way to configure terraform in order to say : If you have to recreate the instance just ignore all CDAP resources because they will be recreated.

umairidris commented 3 years ago

Thanks for the additional context.

If we remove the wait healthy module, does it work correctly? i.e. a CDF recreation correctly recreates all cdap resources? If so, then it seems we just need to work with the wait healthy module. Maybe we need to add some triggers in the null resource to correctly recreate when cdf instance is recreated?

Another best practice could be to separate cdf instance and cdap resources into separate states and apply them separately. If that doesn't work, it may require fixes in the provider existence checks that I mentioned above.

jrstarke commented 2 years ago

@umairidris do you know if anyone is actively developing on this provider anymore?

umairidris commented 2 years ago

Hi @jrstarke, no this provider is not currently being developed. If you would like to send bug fixes or feature improvements then I am happy to review and merge them on a voluntary basis.

If you would like this provider to be better maintained I would suggest sending feedback to the data fusion team: https://cloud.google.com/data-fusion/docs/support/getting-support.

FVCCarneiro commented 2 years ago

Hello,

@umairidris , to get back to this issue that still persists for me, the reason why we added the ${module.datafusion_instance.wait_healthy_service_endpoint}/api/, is because if we follow the documentation, we get an error in Terraform because the Data Fusion service endpoint take time to be up and therefore the Terraform apply crashes.

Perhaps this issue we encountered is no longer relevant ? I will check as you requested if it works when we remove the wait_healthy null_ressource.

Assuming our issue comes from our implementation of the wait_healthy null resource, how would you improve it to avoid this issue ? Could you explain in more detail your trigger idea ?

umairidris commented 1 year ago

Hi @FVCCarneiro,

My apologies, this completely slipped past me!

If you (or someone) is still dealing with this, the first step would be to see if it is an issue with the wait_healthy module. After the initial creation of the null_resource in the wait_healthy module, the CDF instance URL is saved into the TF state and it does not call the script again to check if the instance is still OK.

If they are in the same deployment, a recreation of the instance should cause any dependent resources to be recreated. But if they are not in the same deployment we would need to add a trigger to recheck the state of the instance in wait_healthy module. The null resource https://registry.terraform.io/providers/hashicorp/null/latest/docs/resources/resource has some fields we could set for triggers.

Longer term we should not rely on the wait_healthy module. Waiting for health should be done by the CDF instance itself (e.g. we should be able to configure it as an option on the instance). Please file a ticket with the CDF team to add support for this.

In the provider, we should ensure we have reliable existence checks that can detect if the cdap resource no longer exists and thus will cause a recreation.

Hope this helps!