GoogleCloudPlatform / terraform-splunk-log-export

Deploy Google Cloud log export to Splunk using Terraform
https://cloud.google.com/architecture/deploying-production-ready-log-exports-to-splunk-using-dataflow
Apache License 2.0
43 stars 30 forks source link

terraform destroy never finishes #40

Open mhite opened 1 year ago

mhite commented 1 year ago

Is there something about the Splunk Dataflow pipeline design that causes it to never be able to successfully drain?

I've gone through the full build + teardown (destroy) process at least a dozen times and have never seen it destroy successfully without intervention by manually canceling the dataflow job in the console.

google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-04_10_28_54-13121228337712679470, 6h0m43s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-04_10_28_54-13121228337712679470, 6h0m53s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-04_10_28_54-13121228337712679470, 6h1m3s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-04_10_28_54-13121228337712679470, 6h1m13s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-04_10_28_54-13121228337712679470, 6h1m23s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-04_10_28_54-13121228337712679470, 6h1m33s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-04_10_28_54-13121228337712679470, 6h1m43s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-04_10_28_54-13121228337712679470, 6h1m53s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-04_10_28_54-13121228337712679470, 6h2m3s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-04_10_28_54-13121228337712679470, 6h2m13s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-04_10_28_54-13121228337712679470, 6h2m23s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-04_10_28_54-13121228337712679470, 6h2m33s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-04_10_28_54-13121228337712679470, 6h2m43s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-04_10_28_54-13121228337712679470, 6h2m53s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-04_10_28_54-13121228337712679470, 6h3m3s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-04_10_28_54-13121228337712679470, 6h3m13s elapsed]
ilakhtenkov commented 1 year ago

I also faced with that some time before.

The solution here is adding on_delete option. By default it set up to drain, means Dataflow is trying to gracefully shutdown the job. We can set it to cancel, which could cause potentially some lost of logs in the process of deployment. But will fix this issue.

I would definitely vote on it. What do you think @rarsan?

rarsan commented 1 year ago

I was actually looking at this today but wasn't able to reproduce. I have encountered this before, but very sporadically.

Forcing to cancel vs drain is a reasonable option with proper warning about data loss potential. However, I suspect a clean teardown can be ensured by enforcing a particular order for resource deletion: e.g. delete log sink first, then dataflow job to ensure the sink stops and the dataflow job gets the chance to drain. Perhaps it's due to another prematurely deleted dependency like the GCS bucket causing Dataflow job teardown to hang?

@mhite can you share the order of resources being deleted in the case where it does hang? specifically log sink, pubsub topic, pubsub subscription, gcs bucket, and dataflow job.

mhite commented 1 year ago

@rarsan -

Does this help?

Do you really want to destroy all resources?
  Terraform will destroy all your managed infrastructure, as shown above.
  There is no undo. Only 'yes' will be accepted to confirm.

  Enter a value: yes

google_pubsub_topic_iam_binding.input_sub_publisher: Destroying... [id=projects/<REDACTED>/topics/export-pipeline-input-topic/roles/pubsub.publisher]
google_secret_manager_secret_iam_member.dataflow_worker_secret_access[0]: Destroying... [id=projects/<REDACTED>/secrets/demo-hec-token/roles/secretmanager.secretAccessor/serviceAccount:export-pipeline-worker@<REDACTED>.iam.gserviceaccount.com]
google_pubsub_subscription_iam_binding.input_sub_subscriber: Destroying... [id=projects/<REDACTED>/subscriptions/export-pipeline-input-subscription/roles/pubsub.subscriber]
google_pubsub_topic_iam_binding.deadletter_topic_publisher: Destroying... [id=projects/<REDACTED>/topics/export-pipeline-deadletter-topic/roles/pubsub.publisher]
google_project_iam_binding.dataflow_worker_role[0]: Destroying... [id=<REDACTED>/roles/dataflow.worker]
google_pubsub_subscription.dataflow_deadletter_pubsub_sub: Destroying... [id=projects/<REDACTED>/subscriptions/export-pipeline-deadletter-subscription]
google_dns_policy.splunk_network_dns_policy[0]: Destroying... [id=projects/<REDACTED>/policies/dataflow-net-dns-policy]
google_pubsub_subscription_iam_binding.input_sub_viewer: Destroying... [id=projects/<REDACTED>/subscriptions/export-pipeline-input-subscription/roles/pubsub.viewer]
google_storage_bucket_iam_binding.dataflow_worker_bucket_access: Destroying... [id=b/<REDACTED>-export-pipeline-6563c6ff/roles/storage.objectAdmin]
google_dataflow_job.dataflow_job: Destroying... [id=2023-03-22_15_44_26-13366291121410881687]
google_dns_policy.splunk_network_dns_policy[0]: Destruction complete after 0s
google_monitoring_group.splunk-export-pipeline-group: Destroying... [id=projects/<REDACTED>/groups/358908245497544187]
google_monitoring_group.splunk-export-pipeline-group: Destruction complete after 1s
google_monitoring_dashboard.splunk-export-pipeline-dashboard: Destroying... [id=projects/<REDACTED>/dashboards/d532a668-79ee-4028-8f7b-374f6017ff91]
google_monitoring_dashboard.splunk-export-pipeline-dashboard: Destruction complete after 0s
google_service_account_iam_binding.terraform_caller_impersonate_dataflow_worker[0]: Destroying... [id=projects/<REDACTED>/serviceAccounts/export-pipeline-worker@<REDACTED>.iam.gserviceaccount.com/roles/iam.serviceAccountUser]
google_pubsub_subscription.dataflow_deadletter_pubsub_sub: Destruction complete after 1s
google_compute_firewall.connect_dataflow_workers[0]: Destroying... [id=projects/<REDACTED>/global/firewalls/dataflow-internal-ip-fwr]
google_secret_manager_secret_iam_member.dataflow_worker_secret_access[0]: Destruction complete after 4s
google_pubsub_topic_iam_binding.deadletter_topic_publisher: Destruction complete after 4s
google_pubsub_topic_iam_binding.input_sub_publisher: Destruction complete after 4s
google_compute_router_nat.dataflow_nat[0]: Destroying... [id=<REDACTED>/us-central1/dataflow-net-us-central1-router/dataflow-net-us-central1-router-nat]
google_logging_project_sink.project_log_sink: Destroying... [id=projects/<REDACTED>/sinks/export-pipeline-project-log-sink]
google_storage_bucket_iam_binding.dataflow_worker_bucket_access: Destruction complete after 5s
google_pubsub_subscription_iam_binding.input_sub_viewer: Destruction complete after 5s
google_service_account_iam_binding.terraform_caller_impersonate_dataflow_worker[0]: Destruction complete after 4s
google_logging_project_sink.project_log_sink: Destruction complete after 1s
google_project_iam_binding.dataflow_worker_role[0]: Destruction complete after 8s
google_pubsub_subscription_iam_binding.input_sub_subscriber: Destruction complete after 9s
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-22_15_44_26-13366291121410881687, 10s elapsed]
google_compute_firewall.connect_dataflow_workers[0]: Still destroying... [id=projects/<REDACTED>/global/firewalls/dataflow-internal-ip-fwr, 10s elapsed]
google_compute_firewall.connect_dataflow_workers[0]: Destruction complete after 11s
google_compute_router_nat.dataflow_nat[0]: Still destroying... [id=<REDACTED>/us-central1/dataflow...er/dataflow-net-us-central1-router-nat, 10s elapsed]
google_compute_router_nat.dataflow_nat[0]: Destruction complete after 12s
google_compute_router.dataflow_to_splunk_router[0]: Destroying... [id=projects/<REDACTED>/regions/us-central1/routers/dataflow-net-us-central1-router]
google_compute_address.dataflow_nat_ip_address[0]: Destroying... [id=projects/<REDACTED>/regions/us-central1/addresses/dataflow-splunk-nat-ip-address]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-22_15_44_26-13366291121410881687, 20s elapsed]
google_compute_address.dataflow_nat_ip_address[0]: Still destroying... [id=projects/<REDACTED>/regions/us-...dresses/dataflow-splunk-nat-ip-address, 10s elapsed]
google_compute_router.dataflow_to_splunk_router[0]: Still destroying... [id=projects/<REDACTED>/regions/us-...outers/dataflow-net-us-central1-router, 10s elapsed]
google_compute_router.dataflow_to_splunk_router[0]: Destruction complete after 10s
google_compute_address.dataflow_nat_ip_address[0]: Destruction complete after 11s
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-22_15_44_26-13366291121410881687, 30s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-22_15_44_26-13366291121410881687, 40s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-22_15_44_26-13366291121410881687, 50s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-22_15_44_26-13366291121410881687, 1m0s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-22_15_44_26-13366291121410881687, 1m10s elapsed]
...continues forever...

... I go into the console and manually cancel.

google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-22_15_44_26-13366291121410881687, 7h45m33s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-22_15_44_26-13366291121410881687, 7h45m43s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-22_15_44_26-13366291121410881687, 7h45m53s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-22_15_44_26-13366291121410881687, 7h46m3s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-22_15_44_26-13366291121410881687, 7h46m13s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-22_15_44_26-13366291121410881687, 7h46m23s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-22_15_44_26-13366291121410881687, 7h46m33s elapsed]
google_dataflow_job.dataflow_job: Still destroying... [id=2023-03-22_15_44_26-13366291121410881687, 7h46m43s elapsed]
google_dataflow_job.dataflow_job: Destruction complete after 7h46m53s
random_id.dataflow_job_instance: Destroying... [id=_YQ]
random_id.dataflow_job_instance: Destruction complete after 0s
google_storage_bucket_object.dataflow_job_temp_object: Destroying... [id=<REDACTED>-export-pipeline-6563c6ff-tmp/]
google_compute_subnetwork.splunk_subnet[0]: Destroying... [id=projects/<REDACTED>/regions/us-central1/subnetworks/dataflow-net]
google_pubsub_subscription.dataflow_input_pubsub_subscription: Destroying... [id=projects/<REDACTED>/subscriptions/export-pipeline-input-subscription]
google_service_account.dataflow_worker_service_account[0]: Destroying... [id=projects/<REDACTED>/serviceAccounts/export-pipeline-worker@<REDACTED>.iam.gserviceaccount.com]
google_pubsub_topic.dataflow_deadletter_pubsub_topic: Destroying... [id=projects/<REDACTED>/topics/export-pipeline-deadletter-topic]
google_service_account.dataflow_worker_service_account[0]: Destruction complete after 0s
google_storage_bucket_object.dataflow_job_temp_object: Destruction complete after 0s
google_storage_bucket.dataflow_job_temp_bucket: Destroying... [id=<REDACTED>-export-pipeline-6563c6ff]
google_storage_bucket.dataflow_job_temp_bucket: Destruction complete after 1s
random_id.bucket_suffix: Destroying... [id=ZWPG_w]
random_id.bucket_suffix: Destruction complete after 0s
google_pubsub_subscription.dataflow_input_pubsub_subscription: Destruction complete after 1s
google_pubsub_topic.dataflow_input_pubsub_topic: Destroying... [id=projects/<REDACTED>/topics/export-pipeline-input-topic]
google_pubsub_topic.dataflow_deadletter_pubsub_topic: Destruction complete after 1s
google_pubsub_topic.dataflow_input_pubsub_topic: Destruction complete after 2s
google_compute_subnetwork.splunk_subnet[0]: Still destroying... [id=projects/<REDACTED>/regions/us-central1/subnetworks/dataflow-net, 10s elapsed]
google_compute_subnetwork.splunk_subnet[0]: Destruction complete after 11s
google_compute_network.splunk_export[0]: Destroying... [id=projects/<REDACTED>/global/networks/dataflow-net]
google_compute_network.splunk_export[0]: Still destroying... [id=projects/<REDACTED>/global/networks/dataflow-net, 10s elapsed]
google_compute_network.splunk_export[0]: Still destroying... [id=projects/<REDACTED>/global/networks/dataflow-net, 20s elapsed]
google_compute_network.splunk_export[0]: Still destroying... [id=projects/<REDACTED>/global/networks/dataflow-net, 30s elapsed]
google_compute_network.splunk_export[0]: Still destroying... [id=projects/<REDACTED>/global/networks/dataflow-net, 40s elapsed]
google_compute_network.splunk_export[0]: Still destroying... [id=projects/<REDACTED>/global/networks/dataflow-net, 50s elapsed]
google_compute_network.splunk_export[0]: Still destroying... [id=projects/<REDACTED>/global/networks/dataflow-net, 1m0s elapsed]
google_compute_network.splunk_export[0]: Destruction complete after 1m2s

Destroy complete! Resources: 28 destroyed.
rarsan commented 1 year ago

I neglected to share my findings from analyzing the output of your terraform destroy. So I couldn't trace this to a resource deletion out-of-order issue. The log sink is being deleted before the dataflow job as expected. And the GCS bucket & object are being deleted afterwards, so my hypothesis of GCS resource causing the Dataflow job teardown to hang is not correct.

I'm OK adding the on_delete option that defaults to cancel for quick prototyping and with proper warning that this should be modified to drain for production workloads. Should we expose that as top-level parameter?