boltops-tools / terraspace

Terraspace: The Terraform Framework
https://terraspace.cloud
Apache License 2.0
674 stars 46 forks source link

Loading and writing of statefiles for outputs is not threadsafe #330

Open sosafe-felix-bell opened 12 months ago

sosafe-felix-bell commented 12 months ago

Checklist

My Environment

Software Version
Operating System Debian 11
Terraform 1.3.9
Terraspace latest
Ruby 3.1.4

Expected Behaviour

When running stacks concurrent in a batch with terraspace all up we expect it to be able to pull all remote states used for outputs in parallel without running into errors.

Current Behavior

When pulling and reading remote states for using with terraspace outputs it sometimes runs into an error when two stacks in the same batch are dependent on the same stack. In this case you get the error message as the actual outptut. (Output <output> was not found for the <dependend_stack> tfvars file. Either <stack_with_output> stack has not been deployed yet or it does not have this output: <output>. Also, if local backend is being used and has been removed/cleaned, then it will also result zero-byte state.json with the 'terraform state pull' used to download the terraform state and output will not be found.) This has lead to unwanted deletion of resources, as the attributes for resources changed from the actual value to the error message. As there was no assertion of this value the resources were deleted and terraform only failed when the cloud API asserted the value as false. This left us with the deleted resources.

Step-by-step reproduction instructions

This seems to be a race condition of rubys subprocess handling, so we could only reproduce it of running pipelines more often to force the triggering of the issue / bug. We determined it by checking the source code of terraspace for the error message in the value and followed the program stack. From this we could validate that it had to occur somewhere between terraform (called by terraspace) loading the statefile and terraspace reading it. We quickly had the suspicion that it is caused by two or more of the subprocesses created by the terraspace all command (see: deploy_batch.rb). When executing a command like terraform state pull > /path/to/file it will empty the file as soon as the terraform command starts and flushes the output into the file when it is done. During the command the file is empty (zero-byte) when read by another process. To verify this we added log statements including timestamps before and after the command execution in the fetcher.rb file:

diff --git a/lib/terraspace/terraform/remote_state/fetcher.rb b/lib/terraspace/terraform/remote_state/fetcher.rb
index b3f9ca3..de26b98 100644
--- a/lib/terraspace/terraform/remote_state/fetcher.rb
+++ b/lib/terraspace/terraform/remote_state/fetcher.rb
@@ -72,8 +72,11 @@ module Terraspace::Terraform::RemoteState

       FileUtils.mkdir_p(File.dirname(state_path))
       command = "cd #{@child.cache_dir} && terraform state pull > #{state_path}"
-      logger.debug "=> #{command}"
+      time_before = DateTime.now.strftime "%Y-%m-%d %H:%M:%S.%L"
+      logger.debug "[#{time_before}] => #{command}"
       success = system(command)
+      time_after = DateTime.now.strftime "%Y-%m-%d %H:%M:%S.%L"
+      logger.debug "[#{time_after}] => Finished #{command}"
       # Can error if using a old terraform version and the statefile was created with a newer version of terraform
       # IE: Failed to refresh state: state snapshot was created by Terraform v0.13.2, which is newer than current v0.12.29;
       #     upgrade to Terraform v0.13.2 or greater to work with this state

The next time the issue / bug occured we were able to extract the following log:

Log Output ``` Running: terraspace up ... # batch 1 ... terraspace up stack1 # batch 6 terraspace up stack2 # batch 6 ... Batch Run 6: WARN: Could not find the pid in the logfile log/up/mq.log WARN: Could not find the pid in the logfile log/up/mq_ec2.log Created .terraspace-cache/region/account/stacks/stack2/backend.tf Created .terraspace-cache/region/account/stacks/stack1/backend.tf ... Layers for stack2: Created .terraspace-cache/region/account/stacks/stack2/1-project-base.auto.tfvars Created .terraspace-cache/region/account/stacks/stack2/2-project-account>.auto.tfvars Downloading tfstate files for dependencies defined in tfvars... Downloading tfstate for stack: dependency1 [2023-09-26 14:11:47.255] => Started: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency1 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency1/state.json Layers for stack1: Created .terraspace-cache/region/account/stacks/stack1/1-project-base.auto.tfvars Created .terraspace-cache/region/account/stacks/stack1/2-project-account>.auto.tfvars Downloading tfstate files for dependencies defined in tfvars... Downloading tfstate for stack: dependency1 [2023-09-26 14:11:47.272] => Started: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency1 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency1/state.json [2023-09-26 14:11:49.321] => Finished: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency1 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency1/state.json [2023-09-26 14:11:49.321] => Started reading: /tmp/terraspace/remote_state/stacks/dependency1/state.json [2023-09-26 14:11:49.325] => Finished reading: /tmp/terraspace/remote_state/stacks/dependency1/state.json Downloading tfstate for stack: dependency2 [2023-09-26 14:11:49.326] => Started: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency2 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency2/state.json [2023-09-26 14:11:49.328] => Finished: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency1 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency1/state.json [2023-09-26 14:11:49.329] => Started reading: /tmp/terraspace/remote_state/stacks/dependency1/state.json [2023-09-26 14:11:49.331] => Finished reading: /tmp/terraspace/remote_state/stacks/dependency1/state.json Downloading tfstate for stack: dependency2 [2023-09-26 14:11:49.332] => Started: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency2 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency2/state.json [2023-09-26 14:11:51.463] => Finished: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency2 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency2/state.json [2023-09-26 14:11:51.463] => Started reading: /tmp/terraspace/remote_state/stacks/dependency2/state.json [2023-09-26 14:11:51.465] => Finished reading: /tmp/terraspace/remote_state/stacks/dependency2/state.json Downloading tfstate for stack: dependency3 [2023-09-26 14:11:51.468] => Started: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency3 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency3/state.json --- Thread1 starts downloading state file for dependency3 [2023-09-26 14:11:51.518] => Finished: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency2 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency2/state.json [2023-09-26 14:11:51.518] => Started reading: /tmp/terraspace/remote_state/stacks/dependency2/state.json [2023-09-26 14:11:51.519] => Finished reading: /tmp/terraspace/remote_state/stacks/dependency2/state.json Downloading tfstate for stack: dependency4 [2023-09-26 14:11:51.520] => Started: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency4 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency4/state.json [2023-09-26 14:11:53.464] => Finished: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency4 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency4/state.json [2023-09-26 14:11:53.465] => Started reading: /tmp/terraspace/remote_state/stacks/dependency4/state.json [2023-09-26 14:11:53.466] => Finished reading: /tmp/terraspace/remote_state/stacks/dependency4/state.json Downloading tfstate for stack: dependency3 [2023-09-26 14:11:53.469] => Started: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency3 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency3/state.json --- Thread2 starts downloading the same state [2023-09-26 14:11:53.472] => Finished: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency3 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency3/state.json --- Thread1 finishes writing the state file [2023-09-26 14:11:53.472] => Started reading: /tmp/terraspace/remote_state/stacks/dependency3/state.json --- A thread reads the state file, but it is currently empty because thread2 is writing to it [2023-09-26 14:11:53.472] => Finished reading: /tmp/terraspace/remote_state/stacks/dependency3/state.json DEBUG: (Output output> was not found for the stack1 tfvars file. Either dependency3 stack has not been deployed yet or it does not have this output: output>. Also, if local backend is being used and has been removed/cleaned, then it will also result zero-byte state.json with the 'terraform state pull' used to download the terraform state and output will not be found.) Downloading tfstate for stack: dependency4 [2023-09-26 14:11:53.473] => Started: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency4 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency4/state.json [2023-09-26 14:11:55.439] => Finished: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency3 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency3/state.json --- Thread2 finishes writing the state file [2023-09-26 14:11:55.439] => Started reading: /tmp/terraspace/remote_state/stacks/dependency3/state.json [2023-09-26 14:11:55.440] => Finished reading: /tmp/terraspace/remote_state/stacks/dependency3/state.json Downloading tfstate for stack: dependency5 [2023-09-26 14:11:55.441] => Started: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency5 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency5/state.json [2023-09-26 14:11:55.457] => Finished: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency4 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency4/state.json [2023-09-26 14:11:55.457] => Started reading: /tmp/terraspace/remote_state/stacks/dependency4/state.json [2023-09-26 14:11:55.458] => Finished reading: /tmp/terraspace/remote_state/stacks/dependency4/state.json Downloading tfstate for stack: dependency5 [2023-09-26 14:11:55.460] => Started: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency5 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency5/state.json [2023-09-26 14:11:57.376] => Finished: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency5 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency5/state.json [2023-09-26 14:11:57.376] => Started reading: /tmp/terraspace/remote_state/stacks/dependency5/state.json [2023-09-26 14:11:57.376] => Finished reading: /tmp/terraspace/remote_state/stacks/dependency5/state.json Created .terraspace-cache/region/account/stacks/stack1/3-base.auto.tfvars Created .terraspace-cache/region/account/stacks/stack1/4-account>.auto.tfvars Built in .terraspace-cache/region/account/stacks/stack1 Running: terraspace up stack1 Logs: log/up/stack1.log [2023-09-26 14:11:57.388] => Finished: cd /__w/terraform/terraform/.terraspace-cache/region/account/stacks/dependency5 && terraform state pull > /tmp/terraspace/remote_state/stacks/dependency5/state.json [2023-09-26 14:11:57.388] => Started reading: /tmp/terraspace/remote_state/stacks/dependency5/state.json [2023-09-26 14:11:57.388] => Finished reading: /tmp/terraspace/remote_state/stacks/dependency5/state.json Created .terraspace-cache/region/account/stacks/stack2/3-base.auto.tfvars Created .terraspace-cache/region/account/stacks/stack2/4-account>.auto.tfvars Built in .terraspace-cache/region/account/stacks/stack2 Running: terraspace up stack2 Logs: log/up/stack2.log Writing TFLint output to /__w/terraform/terraform/log/tflint/stack2.log Writing TFLint output to /__w/terraform/terraform/log/tflint/stack1.log WARN: Could not find the pid in the logfile log/up/stack2.log WARN: Could not find the pid in the logfile log/up/stack1.log Error running: terraspace up stack1. Fix the error above or check logs for the error. Error: Error: failed to run script step: command terminated with non-zero exit code: error executing command [sh -e /__w/_temp/bfbeddf0-5c75-11ee-bae9-b572f09d7d5f.sh], exit code 2 Error: Process completed with exit code 1. Error: Executing the custom container implementation failed. Please contact your self hosted runner administrator. ```

There are annotations at the line where the subprocesses call what.

Code Sample

Solution Suggestion

This could probably be handled with a lock. Either with a thread safe variable which can be check by all other subprocesses that resembles the availability of the state.json. Another way could be to create a lock file before any write operation which can also be checked by other subprocesses. If whatever used locking mechanism fails the program falls into a retry loop.

Also it would be helpful to fail the terraspace operation when the output could not be loaded instead of continuing with the error message as the value.

Other Issues

We found another issue referencing the same problem: https://github.com/boltops-tools/terraspace/issues/300