Terraspace does not print Terraform errors

Checklist

[x] Upgrade Terraspace: Are you using the latest version of Terraspace? This allows Terraspace to fix issues fast. There's an Upgrading Guide: https://terraspace.cloud/docs/misc/upgrading/
[x] Reproducibility: Are you reporting a bug others will be able to reproduce and not asking a question. If you're unsure or want to ask a question, do so on https://community.boltops.com

My Environment

Software	Version
Operating System	Windows
Terraform	1.1.5
Terraspace	1.1.3

Expected Behaviour

See all TF output

Current Behavior

Only positive output is shown

Step-by-step reproduction instructions

In the image you see:

On the left, terraspace 1.1.3, shell.rb with and added puts line on line 94

def handle_stdout(line, newline: true)
  # Terraspace logger has special stdout method so original terraform output
  # can be piped to jq. IE:
  #   terraspace show demo --json | jq
  if logger.respond_to?(:stdout) && !@options[:log_to_stderr]
    logger.stdout(line, newline: newline)
  else
    puts line
    logger.info(line)
  end
end

On the right, some invalid TF code which crashes. As well as the logger configuration info

Without the puts line, no errors are shown. With the line, the output as in the screenshot is shown.

Perhaps I am really configuring my logger wrong? I copied it from the code examples. Or something is up with the logger itself.

It seems something fishy is going on with the buffer size for reading stdout and stderr. For some reason, if the block size is too big, it will not print any output (I guess because f.read_nonblock is waiting till the block_size is reached)

It looks like you ran into it, because you wrote this code:

    def suppress_newline(line)
      # Regular prompt
      line.include?("Enter a value:") ||
      # Edge case: When buffer is very large buffer.split("\n") only gives 8192 chars at a time
      line.size == 8192 && line[-1] != "\n" ||
      # Edge case: "value:" chopped off "Enter a" and "value" prompt
      # Very hard to reproduce. Happens 1/5 times on terraspace up autoscaling example.
      # Sometimes lines come in as:
      #   [...,"  Only 'yes' will be accepted to approve.", "", "  \e[1mEnter a"]
      #   [" value:\e[0m \e[0m"]
      line.match(/Enter a$/) || line.match(/^ value:/) # chopped off prompt
      # line.include?(" value:") && lines.last.match(/Enter a$/) # chopped off prompt
    end

Which is exactly what happens with this block_size behaviour

I wonder if there is a better way to capture stdout and stderr in ruby?

A bit of a hack, but this makes it work for me:

    def popen3(env)
      Open3.popen2e(env, @command, chdir: @mod.cache_dir) do |stdin, stdout_and_stderr, wait_thread|
        handle_streams(stdin, stdout_and_stderr)
        status = wait_thread.value.exitstatus
        exit_status(status)
      end
    end

    BLOCK_SIZE = Integer(ENV['TS_BUFFER_BLOCK_SIZE'] || 102400)
    BUFFER_TIMEOUT = Integer(ENV['TS_BUFFER_TIMEOUT'] || 3600) # 3600s = 1h
    def handle_streams(stdin, stdout_and_stderr)
      # note: t=0 and t=nil means no timeout. See: https://bit.ly/2PURlCX
      t = BUFFER_TIMEOUT.to_i unless BUFFER_TIMEOUT.nil?
      Timeout::timeout(t) do
        files = [stdout_and_stderr]
        until all_eof?(files) do
          ready = IO.select(files)
          next unless ready

          readable = ready[0]
          readable.each do |f|
            buffer = f.read_nonblock(BLOCK_SIZE, exception: false)
            next unless buffer

            lines = buffer.split("\n")
            lines.each do |line|
              # if f.fileno == stdout.fileno
                handle_stdout(line, newline: !suppress_newline(line))
                handle_input(stdin, line)
              # else
                # handle_stderr(line)
              # end
            end
          end
        end
      end
    end

Using Open3.popen2e to get stdout and stderr together gives me the TF errors

RE: not printing errors

Think you’ve figured it out 👍 Think the terraform error output is so large that it overflows the buffer that popen3 uses to print to stderr. It’s a bummer.

Ran into similar issue before https://github.com/boltops-tools/terraspace/issues/97 This is why Terraspace sets the buffer size block size larger https://github.com/boltops-tools/terraspace/blob/37fdf43d02ccae63e60e6cc3a61afbe4b62bfb5f/lib/terraspace/shell.rb#L43

Wondering if Windows uses even larger buffer block sizes. Maybe, to test the theory, also try setting env var TS_BUFFER_BLOCK_SIZE to something huge.

RE: popen2e vs popen3

Maybe popen2e is worth a revisit. Though, using popen3 to allow piping to tools like jq. IE:

terraspace show demo --json | jq

Only stdout gets piped to jq and it works. Stderr with info messages is not passed through the pipe. Unsure how to handle things and keep piping to jq working. Some thoughts:

Maybe lower the timeout for when buffer stalls. It’s currently configured to 1hr. Think terraform should report something every 10s.
When timeout happens, assume error and report it and mention the the TS_BUFFER_BLOCK_SIZE setting.

It’s not an ideal user dev workflow. And it would still require increasing the block size or popen2e combing stderr and stdout.

Sadly, unsure how to report the actual error. Couldn’t figure out how to flush and print out what’s in the buffer in the case of an overflow. Could just be lack of knowledge around popen. Haven’t had the time to dig in what’s happening with popen in the C code itself.

Or maybe it might be more pragmatic again to use a different mode with Windows. IE: Use popen2e for Windows and popen3 for Linux. Lol. It looks like it’s the nature of the beast when dealing with differences in OS. 🤣 Just tossing out some thoughts for now.

I'll have a look tomorrow at using a bigger buffer block size, not sure if it'll work. From playing with it, it looks like the buffer size is too big and not being filled up completely when TF crashes, resulting in nothing being printed. On the other hand.... I would expect TF to return an error code when it crashes, maybe I can play with that :)

I'll update you when I play with it later :)

Tested with BLOCK_SIZE 102400000 -> Same result, error is not printed, and no exit code is returned either, I guess it is just stuck somewhere :p

To test my assumption I added a silly print after a buffer is read:

            buffer = f.read_nonblock(BLOCK_SIZE, exception: false)
            puts "buffer!"
            next unless buffer

This is the result:

terraspace up backend
Building .terraspace-cache/eu-central-1/dev/stacks/backend
Built in .terraspace-cache/eu-central-1/dev/stacks/backend
Current directory: .terraspace-cache/eu-central-1/dev/stacks/backend
=> terraform apply -input=false
buffer!
random_pet.this: Refreshing state... [id=quiet-hawk]
buffer!
aws_cloudwatch_log_group.logs: Refreshing state... [id=quiet-hawk]
buffer!
module.lambda_hello2.aws_cloudwatch_log_group.lambda[0]: Refreshing state... [id=/aws/lambda/hello2]
module.api_gateway.aws_apigatewayv2_api.this[0]: Refreshing state... [id=gys1whnq72]
module.lambda_hello1.aws_cloudwatch_log_group.lambda[0]: Refreshing state... [id=/aws/lambda/hello1]
module.lambda_hello1.aws_iam_role.lambda[0]: Refreshing state... [id=hello1]
module.lambda_hello2.aws_iam_role.lambda[0]: Refreshing state... [id=hello2]
buffer!
module.lambda_hello1.aws_iam_policy.logs[0]: Refreshing state... [id=arn:aws:iam::xxx:policy/hello1-logs]
buffer!
module.api_gateway.aws_apigatewayv2_stage.default[0]: Refreshing state... [id=$default]
buffer!
module.lambda_hello2.aws_lambda_function.this[0]: Refreshing state... [id=hello2]
buffer!
module.lambda_hello1.aws_iam_role_policy_attachment.logs[0]: Refreshing state... [id=hello1-20220218215412109400000001]
module.lambda_hello1.aws_lambda_function.this[0]: Refreshing state... [id=hello1]
buffer!
module.lambda_hello1.aws_lambda_permission.unqualified_alias_triggers["AllowExecutionFromAPIGateway"]: Refreshing state... [id=AllowExecutionFromAPIGateway]
buffer!
module.lambda_hello1.aws_lambda_permission.current_version_triggers["AllowExecutionFromAPIGateway"]: Refreshing state... [id=AllowExecutionFromAPIGateway]
module.api_gateway.aws_apigatewayv2_integration.this["GET /hello"]: Refreshing state... [id=y61golh]
module.api_gateway.aws_apigatewayv2_integration.this["GET /wave"]: Refreshing state... [id=w4zvekf]
buffer!
module.api_gateway.aws_apigatewayv2_route.this["GET /hello"]: Refreshing state... [id=6a1um5p]
buffer!
module.api_gateway.aws_apigatewayv2_route.this["GET /wave"]: Refreshing state... [id=j6939ni]
buffer!

Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  + create
  ~ update in-place
+/- create replacement and then destroy
 <= read (data resources)

buffer!

Terraform will perform the following actions:

  # module.api_gateway.aws_apigatewayv2_integration.this["GET /wave"] must be replaced
+/- resource "aws_apigatewayv2_integration" "this" {
      + connection_id                             = (known after apply)
      ~ id                                        = "w4zvekf" -> (known after apply)
      + integration_response_selection_expression = (known after apply)
      ~ integration_type                          = "AWS_PROXY" -> (known after apply) # forces replacement
      ~ integration_uri                           = "arn:aws:lambda:eu-central-1:xxx:function:hello2" -> (known after apply)
      - request_parameters                        = {} -> null
      - request_templates                         = {} -> null
        # (5 unchanged attributes hidden)
    }

  # module.api_gateway.aws_apigatewayv2_route.this["GET /wave"] will be updated in-place
  ~ resource "aws_apigatewayv2_route" "this" {
        id                   = "j6939ni"
      ~ target               = "integrations/w4zvekf" -> (known after apply)
        # (6 unchanged attributes hidden)
    }

  # module.lambda_hello2.data.aws_iam_policy_document.logs[0] will be read during apply
  # (config refers to values not yet known)
 <= data "aws_iam_policy_document" "logs"  {
      + id   = (known after apply)
      + json = (known after apply)

      + statement {
          + actions   = [
              + "logs:CreateLogGroup",
              + "logs:CreateLogStream",
              + "logs:PutLogEvents",
            ]
          + effect    = "Allow"
          + resources = (known after apply)
        }
    }

  # module.lambda_hello2.aws_cloudwatch_log_group.lambda[0] must be replaced
+/- resource "aws_cloudwatch_log_group" "lambda" {
      ~ arn               = "arn:aws:logs:eu-central-1:xxx:log-group:/aws/lambda/hello2" -> (known after apply)
      ~ id                = "/aws/lambda/hello2" -> (known after apply)
      ~ name              = "/aws/lambda/hello2" -> "/aws/lambda/hello1" # forces replacement
        tags              = {
            "Environment" = "Development"
        }
        # (2 unchanged attributes hidden)
    }

  # module.lambda_hello2.aws_iam_policy.logs[0] will be created
  + resource "aws_iam_policy" "logs" {
      + arn       = (known after apply)
      + id        = (known after apply)
      + name      = "hello1-logs"
      + path      = "/"
      + policy    = (known after apply)
      + policy_id = (known after apply)
      + tags      = {
          + "Environment" = "Development"
        }
      + tags_all  = {
          + "Environment" = "Development"
        }
    }

  # module.lambda_hello2.aws_iam_role.lambda[0] must be replaced
+/- resource "aws_iam_role" "lambda" {
      ~ arn                   = "arn:aws:iam::xxx:role/hello2" -> (known after apply)
      ~ assume_role_policy    = jsonencode( # whitespace changes
            {
                Statement = [
                    {
                        Action    = "sts:AssumeRole"
                        Effect    = "Allow"
                        Principal = {
                            Service = "lambda.amazonaws.com"
                        }
                        Sid       = ""
                    },
                ]
                Version   = "2012-10-17"
            }
        )
      ~ create_date           = "2022-02-18T21:33:01Z" -> (known after apply)
      ~ id                    = "hello2" -> (known after apply)
      ~ managed_policy_arns   = [] -> (known after apply)
      ~ name                  = "hello2" -> "hello1" # forces replacement
      + name_prefix           = (known after apply)
        tags                  = {
            "Environment" = "Development"
        }
      ~ unique_id             = "AROATS32B5M2Q5NDAOXQM" -> (known after apply)
        # (4 unchanged attributes hidden)

      - inline_policy {}
      + inline_policy {
          + name   = (known after apply)
          + policy = (known after apply)
        }
    }

  # module.lambda_hello2.aws_iam_role_policy_attachment.logs[0] will be created
  + resource "aws_iam_role_policy_attachment" "logs" {
      + id         = (known after apply)
      + policy_arn = (known after apply)
      + role       = "hello1"
    }

  # module.lambda_hello2.aws_lambda_function.this[0] must be replaced
+/- resource "aws_lambda_function" "this" {
      ~ arn                            = "arn:aws:lambda:eu-central-1:xxx:function:hello2" -> (known after apply)
      ~ function_name                  = "hello2" -> "hello1" # forces replacement
      ~ id                             = "hello2" -> (known after apply)
      ~ invoke_arn                     = "arn:aws:apigateway:eu-central-1:lambda:path/2015-03-31/functions/arn:aws:lambda:eu-central-1:xxx:function:hello2/invocations" -> (known after apply)
      ~ last_modified                  = "2022-02-19T00:20:03.445+0000" -> (known after apply)
      - layers                         = [] -> null
      ~ qualified_arn                  = "arn:aws:lambda:eu-central-1:xxx:function:hello2:1" -> (known after apply)
      ~ role                           = "arn:aws:iam::xxx:role/hello2" -> (known after apply)
      + signing_job_arn                = (known after apply)
      + signing_profile_version_arn    = (known after apply)
      ~ source_code_size               = 761249 -> (known after apply)
        tags                           = {
            "Environment" = "Development"
        }
      ~ version                        = "1" -> (known after apply)
        # (11 unchanged attributes hidden)

      ~ tracing_config {
          ~ mode = "PassThrough" -> (known after apply)
        }
    }

  # module.lambda_hello2.aws_lambda_permission.current_version_triggers["AllowExecutionFromAPIGateway"] will be created
  + resource "aws_lambda_permission" "current_version_triggers" {
      + action        = "lambda:InvokeFunction"
      + function_name = "hello1"
      + id            = (known after apply)
      + principal     = "apigateway.amazonaws.com"
      + qualifier     = (known after apply)
      + source_arn    = "arn:aws:execute-api:eu-central-1:xxx:gys1whnq72/*/*"
      + statement_id  = "AllowExecutionFromAPIGateway"
    }

  # module.lambda_hello2.aws_lambda_permission.unqualified_alias_triggers["AllowExecutionFromAPIGateway"] will be created
  + resource "aws_lambda_permission" "unqualified_alias_triggers" {
      + action        = "lambda:InvokeFunction"
      + function_name = "hello1"
      + id            = (known after apply)
      + principal     = "apigateway.amazonaws.com"
      + source_arn    = "arn:aws:execute-api:eu-central-1:xxx:gys1whnq72/*/*"
      + statement_id  = "AllowExecutionFromAPIGateway"
    }

Plan: 8 to add, 1 to change, 4 to destroy.
╷
│ Warning: Argument is deprecated
│
│   with module.lambda_hello2.aws_s3_bucket_object.lambda_package,
│   on ..\..\modules\lambda\main.tf line 124, in resource "aws_s3_bucket_object" "lambda_package":
│  124:   bucket        = var.s3_bucket
│
│ Use the aws_s3_object resource instead
│ 
│ (and 3 more similar warnings elsewhere)
╵

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes
buffer!
buffer!
module.lambda_hello2.aws_cloudwatch_log_group.lambda[0]: Creating...
buffer!
module.lambda_hello2.aws_iam_role.lambda[0]: Creating...

As you can see, after the last line, there is no more "buffer!" printed, so the error is probably stuck inside buffer = f.read_nonblock(BLOCK_SIZE, exception: false) where the code waits for more characters.

When you get a chance, wondering if you can try

buffer = f.read_nonblock(BLOCK_SIZE, exception: true)

Think am using exception: false because an Exception signifies the end of the stream. Unsure which exception or exceptions are the ones we want in this case though. There might be some exceptions that indicate a buffer overflow. 🧐

Note: It’ll probably be a decent amount of digging ⛏ On the road the last 2 days and been on my phone 🤳🏽😆 Also, don’t have Windows set up yet. Thanks for the help. No sweat either way of course. 👍

Also, am leaning on a switch that uses popen2e for Windows for now 😂

buffer = f.read_nonblock(BLOCK_SIZE, exception: true) --> No difference

Using popen2e will need refactoring though, with my hack, this is broken:

 Error: Module not installed
│
│   on domain.tf line 13:
│   13: module "records" {
│
│ This module is not yet installed. Run "terraform init" to install all
│ modules required by this configuration.

No more automatic init

Bummer. The automatic init is based on a few things, whether or not the .terraform cache folder exists but will also automatically retry based on terraform command exit status. Guessing popen2e might suppress error exit statuses and messes up automatic init.

Wondering if you can give this a try when you get a chance

config/app.rb

Terraspace.configure do |config|
  logger = Logger.new($stdout)
  logger.level = :debug
  config.logger = logger
end

Docs: https://terraspace.cloud/docs/config/app/#custom-logger

boltops-tools / terraspace