boltops-tools / terraspace

Terraspace: The Terraform Framework
https://terraspace.cloud
Apache License 2.0
678 stars 46 forks source link

Terraspace hangs when TF_LOG=TRACE environment variable exists #97

Closed gwvandesteeg closed 3 years ago

gwvandesteeg commented 3 years ago

Checklist

My Environment

export TF_LOG=TRACE

Software Version
Operating System ubuntu 18.04
Terraform 0.14.8
Terraspace 0.6.5 (also on 0.6.4)
Ruby 2.5.1

Expected Behaviour

The terraform logs to be present in the terraspace logs.

Current Behavior

Step-by-step reproduction instructions

  1. Create a stack set that accesses AWS resources (using a gitlab HTTP backend, but also fails with an S3 backend)
  2. export TF_LOG=TRACE
  3. terraspace all xxx (or any terraspace command really)

Code Sample

Not needed

Solution Suggestion

Ensure that both standard out and standard error when executing the terraform application are captured and streamed to the log file in an appropriate manner and flushed to disk/screen frequently to ensure no IO blocks occur.

gwvandesteeg commented 3 years ago

The same behaviour occurs when you set the environment variable via the config/args/terraform.rb on any of the apply, plan or destroy commands.

command("apply",
  args: ["-lock-timeout=22m"],
  env: { TF_LOG: "TRACE" },
)

The initial trace logs occur and then when it gets around to applying the terraform it hangs and just sits there. Had a look at the source and it seems you're using Popen3 to capture stdin, stdout, stderr, so in theory that should be right but there may be some peculiarities with how Ruby does things there that i'm not familiar with.

justrp commented 3 years ago

sorry if that's not an issue, but I've seen the very same behaviour with Popen3 in python. If it's the case, this might help - https://thraxil.org/users/anders/posts/2008/03/13/Subprocess-Hanging-PIPE-is-your-enemy/

gwvandesteeg commented 3 years ago

Due to the inability to run terraspace with TF_ENV this is causing some problems during CI/CD pipelines, where i've got a 5 batch sequence and it gets to some batch in a teardown and then just stops/dies with a success exit code, but the state file for that particular stack is left locked which indicates a failed teardown.

$ terraspace all down --yes --exit-on-fail
Building one stack to build all stacks
Building .terraspace-cache/ap-southeast-2/test/stacks/shared_central_storage
Downloading tfstate files for dependencies defined in tfvars...
Built in .terraspace-cache/ap-southeast-2/test/stacks/shared_central_storage
Running:
    terraspace down egress_instances         # batch 1
    terraspace down voicecore_instances      # batch 1
    terraspace down shared_central_storage   # batch 2
    terraspace down kubernetes_vpc           # batch 3
    terraspace down voicecore_vpc            # batch 3
    terraspace down egress_vpc               # batch 4
    terraspace down defaults                 # batch 5
    terraspace down internal_transit_gateway # batch 5
Batch Run 1:
Running: terraspace down voicecore_instances Logs: log/down/voicecore_instances.log
Running: terraspace down egress_instances Logs: log/down/egress_instances.log
terraspace down egress_instances:  Changes to Outputs:
terraspace down egress_instances:  Destroy complete! Resources: 14 destroyed.
terraspace down voicecore_instances:  Changes to Outputs:
terraspace down voicecore_instances:  Destroy complete! Resources: 63 destroyed.
Batch Run 2:
Running: terraspace down shared_central_storage Logs: log/down/shared_central_storage.log

And then when looking at the logs in this teardown of Batch 2, the log file just stops halfway through the teardown (this is a 200k log file, there are others that are 800k happily). Below are the last entries in the log lines and we can see the delete of 2 of the EFS mount points completed, but not the third.

[2021-04-22T06:15:02 #5306 terraspace down shared_central_storage]: module.homedirs.module.homedirs.aws_efs_mount_target.mounts[2]: Still destroying... [id=fsmt-344c6f0d, 10s elapsed]
[2021-04-22T06:15:02 #5306 terraspace down shared_central_storage]: module.homedirs.module.homedirs.aws_efs_mount_target.mounts[0]: Still destroying... [id=fsmt-364c6f0f, 10s elapsed]
[2021-04-22T06:15:02 #5306 terraspace down shared_central_storage]: module.homedirs.module.homedirs.aws_efs_mount_target.mounts[1]: Still destroying... [id=fsmt-374c6f0e, 10s elapsed]
[2021-04-22T06:15:05 #5306 terraspace down shared_central_storage]: module.shared-freeswitch.module.shared.aws_efs_mount_target.mounts[1]: Still destroying... [id=fsmt-354c6f0c, 10s elapsed]
[2021-04-22T06:15:05 #5306 terraspace down shared_central_storage]: module.shared-freeswitch.module.shared.aws_efs_mount_target.mounts[0]: Still destroying... [id=fsmt-304c6f09, 10s elapsed]
[2021-04-22T06:15:05 #5306 terraspace down shared_central_storage]: module.shared-freeswitch.module.shared.aws_efs_mount_target.mounts[2]: Still destroying... [id=fsmt-324c6f0b, 10s elapsed]
[2021-04-22T06:15:07 #5306 terraspace down shared_central_storage]: module.homedirs.module.homedirs.aws_efs_mount_target.mounts[0]: Destruction complete after 14s
[2021-04-22T06:15:07 #5306 terraspace down shared_central_storage]: module.homedirs.module.homedirs.aws_efs_mount_target.mounts[1]: Destruction complete after 14s

(This is running in a pipeline on Gitlab, execution time of around 30 minutes, the step in the pipeline has a timeout set of 1hr and configured to be uninterruptable). The random end of the logs and inability to set TF_ENV means we're unable to identify where the problem lies, is it terraform, terraspace, or the gitlab-runner.

tongueroo commented 3 years ago

This was one as a doosy šŸ§ Fixed in #110 and released in v0.6.9