boltops-tools / terraspace

Terraspace: The Terraform Framework
https://terraspace.cloud
Apache License 2.0
678 stars 46 forks source link

Terraspace does not print Terraform errors #211

Open CumpsD opened 2 years ago

CumpsD commented 2 years ago

Checklist

My Environment

Software Version
Operating System Windows
Terraform 1.1.5
Terraspace 1.1.3

Expected Behaviour

See all TF output

Current Behavior

Only positive output is shown

Step-by-step reproduction instructions

image

In the image you see:

Without the puts line, no errors are shown. With the line, the output as in the screenshot is shown.

Perhaps I am really configuring my logger wrong? I copied it from the code examples. Or something is up with the logger itself.

CumpsD commented 2 years ago

It seems something fishy is going on with the buffer size for reading stdout and stderr. For some reason, if the block size is too big, it will not print any output (I guess because f.read_nonblock is waiting till the block_size is reached)

It looks like you ran into it, because you wrote this code:

    def suppress_newline(line)
      # Regular prompt
      line.include?("Enter a value:") ||
      # Edge case: When buffer is very large buffer.split("\n") only gives 8192 chars at a time
      line.size == 8192 && line[-1] != "\n" ||
      # Edge case: "value:" chopped off "Enter a" and "value" prompt
      # Very hard to reproduce. Happens 1/5 times on terraspace up autoscaling example.
      # Sometimes lines come in as:
      #   [...,"  Only 'yes' will be accepted to approve.", "", "  \e[1mEnter a"]
      #   [" value:\e[0m \e[0m"]
      line.match(/Enter a$/) || line.match(/^ value:/) # chopped off prompt
      # line.include?(" value:") && lines.last.match(/Enter a$/) # chopped off prompt
    end

Which is exactly what happens with this block_size behaviour

I wonder if there is a better way to capture stdout and stderr in ruby?

CumpsD commented 2 years ago

A bit of a hack, but this makes it work for me:

    def popen3(env)
      Open3.popen2e(env, @command, chdir: @mod.cache_dir) do |stdin, stdout_and_stderr, wait_thread|
        handle_streams(stdin, stdout_and_stderr)
        status = wait_thread.value.exitstatus
        exit_status(status)
      end
    end

    BLOCK_SIZE = Integer(ENV['TS_BUFFER_BLOCK_SIZE'] || 102400)
    BUFFER_TIMEOUT = Integer(ENV['TS_BUFFER_TIMEOUT'] || 3600) # 3600s = 1h
    def handle_streams(stdin, stdout_and_stderr)
      # note: t=0 and t=nil means no timeout. See: https://bit.ly/2PURlCX
      t = BUFFER_TIMEOUT.to_i unless BUFFER_TIMEOUT.nil?
      Timeout::timeout(t) do
        files = [stdout_and_stderr]
        until all_eof?(files) do
          ready = IO.select(files)
          next unless ready

          readable = ready[0]
          readable.each do |f|
            buffer = f.read_nonblock(BLOCK_SIZE, exception: false)
            next unless buffer

            lines = buffer.split("\n")
            lines.each do |line|
              # if f.fileno == stdout.fileno
                handle_stdout(line, newline: !suppress_newline(line))
                handle_input(stdin, line)
              # else
                # handle_stderr(line)
              # end
            end
          end
        end
      end
    end

Using Open3.popen2e to get stdout and stderr together gives me the TF errors

tongueroo commented 2 years ago

RE: not printing errors

Think you’ve figured it out 👍 Think the terraform error output is so large that it overflows the buffer that popen3 uses to print to stderr. It’s a bummer.

Ran into similar issue before https://github.com/boltops-tools/terraspace/issues/97 This is why Terraspace sets the buffer size block size larger https://github.com/boltops-tools/terraspace/blob/37fdf43d02ccae63e60e6cc3a61afbe4b62bfb5f/lib/terraspace/shell.rb#L43

Wondering if Windows uses even larger buffer block sizes. Maybe, to test the theory, also try setting env var TS_BUFFER_BLOCK_SIZE to something huge.

RE: popen2e vs popen3

Maybe popen2e is worth a revisit. Though, using popen3 to allow piping to tools like jq. IE:

terraspace show demo --json | jq

Only stdout gets piped to jq and it works. Stderr with info messages is not passed through the pipe. Unsure how to handle things and keep piping to jq working. Some thoughts:

It’s not an ideal user dev workflow. And it would still require increasing the block size or popen2e combing stderr and stdout.

Sadly, unsure how to report the actual error. Couldn’t figure out how to flush and print out what’s in the buffer in the case of an overflow. Could just be lack of knowledge around popen. Haven’t had the time to dig in what’s happening with popen in the C code itself.

Or maybe it might be more pragmatic again to use a different mode with Windows. IE: Use popen2e for Windows and popen3 for Linux. Lol. It looks like it’s the nature of the beast when dealing with differences in OS. 🤣 Just tossing out some thoughts for now.

CumpsD commented 2 years ago

I'll have a look tomorrow at using a bigger buffer block size, not sure if it'll work. From playing with it, it looks like the buffer size is too big and not being filled up completely when TF crashes, resulting in nothing being printed. On the other hand.... I would expect TF to return an error code when it crashes, maybe I can play with that :)

I'll update you when I play with it later :)

CumpsD commented 2 years ago

Tested with BLOCK_SIZE 102400000 -> Same result, error is not printed, and no exit code is returned either, I guess it is just stuck somewhere :p

CumpsD commented 2 years ago

To test my assumption I added a silly print after a buffer is read:

            buffer = f.read_nonblock(BLOCK_SIZE, exception: false)
            puts "buffer!"
            next unless buffer

This is the result:

terraspace up backend
Building .terraspace-cache/eu-central-1/dev/stacks/backend
Built in .terraspace-cache/eu-central-1/dev/stacks/backend
Current directory: .terraspace-cache/eu-central-1/dev/stacks/backend
=> terraform apply -input=false
buffer!
random_pet.this: Refreshing state... [id=quiet-hawk]
buffer!
aws_cloudwatch_log_group.logs: Refreshing state... [id=quiet-hawk]
buffer!
module.lambda_hello2.aws_cloudwatch_log_group.lambda[0]: Refreshing state... [id=/aws/lambda/hello2]
module.api_gateway.aws_apigatewayv2_api.this[0]: Refreshing state... [id=gys1whnq72]
module.lambda_hello1.aws_cloudwatch_log_group.lambda[0]: Refreshing state... [id=/aws/lambda/hello1]
module.lambda_hello1.aws_iam_role.lambda[0]: Refreshing state... [id=hello1]
module.lambda_hello2.aws_iam_role.lambda[0]: Refreshing state... [id=hello2]
buffer!
module.lambda_hello1.aws_iam_policy.logs[0]: Refreshing state... [id=arn:aws:iam::xxx:policy/hello1-logs]
buffer!
module.api_gateway.aws_apigatewayv2_stage.default[0]: Refreshing state... [id=$default]
buffer!
module.lambda_hello2.aws_lambda_function.this[0]: Refreshing state... [id=hello2]
buffer!
module.lambda_hello1.aws_iam_role_policy_attachment.logs[0]: Refreshing state... [id=hello1-20220218215412109400000001]
module.lambda_hello1.aws_lambda_function.this[0]: Refreshing state... [id=hello1]
buffer!
module.lambda_hello1.aws_lambda_permission.unqualified_alias_triggers["AllowExecutionFromAPIGateway"]: Refreshing state... [id=AllowExecutionFromAPIGateway]
buffer!
module.lambda_hello1.aws_lambda_permission.current_version_triggers["AllowExecutionFromAPIGateway"]: Refreshing state... [id=AllowExecutionFromAPIGateway]
module.api_gateway.aws_apigatewayv2_integration.this["GET /hello"]: Refreshing state... [id=y61golh]
module.api_gateway.aws_apigatewayv2_integration.this["GET /wave"]: Refreshing state... [id=w4zvekf]
buffer!
module.api_gateway.aws_apigatewayv2_route.this["GET /hello"]: Refreshing state... [id=6a1um5p]
buffer!
module.api_gateway.aws_apigatewayv2_route.this["GET /wave"]: Refreshing state... [id=j6939ni]
buffer!

Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
  + create
  ~ update in-place
+/- create replacement and then destroy
 <= read (data resources)

buffer!

Terraform will perform the following actions:

  # module.api_gateway.aws_apigatewayv2_integration.this["GET /wave"] must be replaced
+/- resource "aws_apigatewayv2_integration" "this" {
      + connection_id                             = (known after apply)
      ~ id                                        = "w4zvekf" -> (known after apply)
      + integration_response_selection_expression = (known after apply)
      ~ integration_type                          = "AWS_PROXY" -> (known after apply) # forces replacement
      ~ integration_uri                           = "arn:aws:lambda:eu-central-1:xxx:function:hello2" -> (known after apply)
      - request_parameters                        = {} -> null
      - request_templates                         = {} -> null
        # (5 unchanged attributes hidden)
    }

  # module.api_gateway.aws_apigatewayv2_route.this["GET /wave"] will be updated in-place
  ~ resource "aws_apigatewayv2_route" "this" {
        id                   = "j6939ni"
      ~ target               = "integrations/w4zvekf" -> (known after apply)
        # (6 unchanged attributes hidden)
    }

  # module.lambda_hello2.data.aws_iam_policy_document.logs[0] will be read during apply
  # (config refers to values not yet known)
 <= data "aws_iam_policy_document" "logs"  {
      + id   = (known after apply)
      + json = (known after apply)

      + statement {
          + actions   = [
              + "logs:CreateLogGroup",
              + "logs:CreateLogStream",
              + "logs:PutLogEvents",
            ]
          + effect    = "Allow"
          + resources = (known after apply)
        }
    }

  # module.lambda_hello2.aws_cloudwatch_log_group.lambda[0] must be replaced
+/- resource "aws_cloudwatch_log_group" "lambda" {
      ~ arn               = "arn:aws:logs:eu-central-1:xxx:log-group:/aws/lambda/hello2" -> (known after apply)
      ~ id                = "/aws/lambda/hello2" -> (known after apply)
      ~ name              = "/aws/lambda/hello2" -> "/aws/lambda/hello1" # forces replacement
        tags              = {
            "Environment" = "Development"
        }
        # (2 unchanged attributes hidden)
    }

  # module.lambda_hello2.aws_iam_policy.logs[0] will be created
  + resource "aws_iam_policy" "logs" {
      + arn       = (known after apply)
      + id        = (known after apply)
      + name      = "hello1-logs"
      + path      = "/"
      + policy    = (known after apply)
      + policy_id = (known after apply)
      + tags      = {
          + "Environment" = "Development"
        }
      + tags_all  = {
          + "Environment" = "Development"
        }
    }

  # module.lambda_hello2.aws_iam_role.lambda[0] must be replaced
+/- resource "aws_iam_role" "lambda" {
      ~ arn                   = "arn:aws:iam::xxx:role/hello2" -> (known after apply)
      ~ assume_role_policy    = jsonencode( # whitespace changes
            {
                Statement = [
                    {
                        Action    = "sts:AssumeRole"
                        Effect    = "Allow"
                        Principal = {
                            Service = "lambda.amazonaws.com"
                        }
                        Sid       = ""
                    },
                ]
                Version   = "2012-10-17"
            }
        )
      ~ create_date           = "2022-02-18T21:33:01Z" -> (known after apply)
      ~ id                    = "hello2" -> (known after apply)
      ~ managed_policy_arns   = [] -> (known after apply)
      ~ name                  = "hello2" -> "hello1" # forces replacement
      + name_prefix           = (known after apply)
        tags                  = {
            "Environment" = "Development"
        }
      ~ unique_id             = "AROATS32B5M2Q5NDAOXQM" -> (known after apply)
        # (4 unchanged attributes hidden)

      - inline_policy {}
      + inline_policy {
          + name   = (known after apply)
          + policy = (known after apply)
        }
    }

  # module.lambda_hello2.aws_iam_role_policy_attachment.logs[0] will be created
  + resource "aws_iam_role_policy_attachment" "logs" {
      + id         = (known after apply)
      + policy_arn = (known after apply)
      + role       = "hello1"
    }

  # module.lambda_hello2.aws_lambda_function.this[0] must be replaced
+/- resource "aws_lambda_function" "this" {
      ~ arn                            = "arn:aws:lambda:eu-central-1:xxx:function:hello2" -> (known after apply)
      ~ function_name                  = "hello2" -> "hello1" # forces replacement
      ~ id                             = "hello2" -> (known after apply)
      ~ invoke_arn                     = "arn:aws:apigateway:eu-central-1:lambda:path/2015-03-31/functions/arn:aws:lambda:eu-central-1:xxx:function:hello2/invocations" -> (known after apply)
      ~ last_modified                  = "2022-02-19T00:20:03.445+0000" -> (known after apply)
      - layers                         = [] -> null
      ~ qualified_arn                  = "arn:aws:lambda:eu-central-1:xxx:function:hello2:1" -> (known after apply)
      ~ role                           = "arn:aws:iam::xxx:role/hello2" -> (known after apply)
      + signing_job_arn                = (known after apply)
      + signing_profile_version_arn    = (known after apply)
      ~ source_code_size               = 761249 -> (known after apply)
        tags                           = {
            "Environment" = "Development"
        }
      ~ version                        = "1" -> (known after apply)
        # (11 unchanged attributes hidden)

      ~ tracing_config {
          ~ mode = "PassThrough" -> (known after apply)
        }
    }

  # module.lambda_hello2.aws_lambda_permission.current_version_triggers["AllowExecutionFromAPIGateway"] will be created
  + resource "aws_lambda_permission" "current_version_triggers" {
      + action        = "lambda:InvokeFunction"
      + function_name = "hello1"
      + id            = (known after apply)
      + principal     = "apigateway.amazonaws.com"
      + qualifier     = (known after apply)
      + source_arn    = "arn:aws:execute-api:eu-central-1:xxx:gys1whnq72/*/*"
      + statement_id  = "AllowExecutionFromAPIGateway"
    }

  # module.lambda_hello2.aws_lambda_permission.unqualified_alias_triggers["AllowExecutionFromAPIGateway"] will be created
  + resource "aws_lambda_permission" "unqualified_alias_triggers" {
      + action        = "lambda:InvokeFunction"
      + function_name = "hello1"
      + id            = (known after apply)
      + principal     = "apigateway.amazonaws.com"
      + source_arn    = "arn:aws:execute-api:eu-central-1:xxx:gys1whnq72/*/*"
      + statement_id  = "AllowExecutionFromAPIGateway"
    }

Plan: 8 to add, 1 to change, 4 to destroy.
╷
│ Warning: Argument is deprecated
│
│   with module.lambda_hello2.aws_s3_bucket_object.lambda_package,
│   on ..\..\modules\lambda\main.tf line 124, in resource "aws_s3_bucket_object" "lambda_package":
│  124:   bucket        = var.s3_bucket
│
│ Use the aws_s3_object resource instead
│ 
│ (and 3 more similar warnings elsewhere)
╵

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes
buffer!
buffer!
module.lambda_hello2.aws_cloudwatch_log_group.lambda[0]: Creating...
buffer!
module.lambda_hello2.aws_iam_role.lambda[0]: Creating...

As you can see, after the last line, there is no more "buffer!" printed, so the error is probably stuck inside buffer = f.read_nonblock(BLOCK_SIZE, exception: false) where the code waits for more characters.

tongueroo commented 2 years ago

When you get a chance, wondering if you can try

buffer = f.read_nonblock(BLOCK_SIZE, exception: true)

Think am using exception: false because an Exception signifies the end of the stream. Unsure which exception or exceptions are the ones we want in this case though. There might be some exceptions that indicate a buffer overflow. 🧐

Note: It’ll probably be a decent amount of digging ⛏ On the road the last 2 days and been on my phone 🤳🏽😆 Also, don’t have Windows set up yet. Thanks for the help. No sweat either way of course. 👍

Also, am leaning on a switch that uses popen2e for Windows for now 😂

CumpsD commented 2 years ago

buffer = f.read_nonblock(BLOCK_SIZE, exception: true) --> No difference

CumpsD commented 2 years ago

Using popen2e will need refactoring though, with my hack, this is broken:

 Error: Module not installed
│
│   on domain.tf line 13:
│   13: module "records" {
│
│ This module is not yet installed. Run "terraform init" to install all
│ modules required by this configuration.

No more automatic init

tongueroo commented 2 years ago

Bummer. The automatic init is based on a few things, whether or not the .terraform cache folder exists but will also automatically retry based on terraform command exit status. Guessing popen2e might suppress error exit statuses and messes up automatic init.

Wondering if you can give this a try when you get a chance

config/app.rb

Terraspace.configure do |config|
  logger = Logger.new($stdout)
  logger.level = :debug
  config.logger = logger
end

Docs: https://terraspace.cloud/docs/config/app/#custom-logger