Tunnel closes too fast - Githubissues

AndrewChubatiuk / terraform-provider-ssh

This provider enables SSH port forwarding in Terraform.

Mozilla Public License 2.0

8 stars 9 forks source link

Tunnel closes too fast #10

Open thecadams opened 1 year ago

thecadams commented 1 year ago

Hi @AndrewChubatiuk, Thanks for this module, hoping to make it work over here!

Looks like the tunnel is closed from the Terraform side, about 1-3 seconds after being opened.

Logs: https://gist.github.com/thecadams/e3dc630cadadc9018946fef98aea26ca Of particular interest in the tf log is this line:

[0m[1mdata.ssh_tunnel.bastion_ssh_tunnel: Read complete after 1s [id=localhost:26127][0m

I have a config similar to this:

terraform {
  required_providers {
    ...
    grafana = {
      source  = "grafana/grafana"
      version = "~> 1.35.0"
    }
    ssh = {
        source = "AndrewChubatiuk/ssh"
    }
    ...
  }
  required_version = ">= 1.2.6"
}

data "ssh_tunnel" "bastion_ssh_tunnel" {
  user = "terraform"
  auth {
    private_key {
      content = var.bastion_ssh_private_key
    }
  }
  server {
    host = "bastion-test.revenuecat.com"
    port = 222
  }
  remote {
    host = "grafana.test.internal"
    port = 3000
  }
}

provider "grafana" {
  auth = var.grafana_auth
  url  = "http://${data.ssh_tunnel.bastion_ssh_tunnel.local.0.host}:${data.ssh_tunnel.bastion_ssh_tunnel.local.0.port}"
}

module "rc_prometheus_test" {
  source = "../../modules/rc_prometheus"
  ...
  dashboards = {"uid1": some_dashbord_json_1, "uid2": some_dashboard_json_2}
  ...
  providers {
    grafana = grafana
  }
}

The rc_prometheus module manages 1 grafana folder and several dashboards in that folder:

(in ../../modules/rc_prometheus/main.tf):
...
resource "grafana_folder" "dashboards" {
  title = "Generated: DO NOT EDIT"
}

resource "grafana_dashboard" "dashboards" {
  for_each = var.dashboards
  folder   = grafana_folder.dashboards.id
  config_json = each.value
  overwrite = true
}

Unfortunately despite the grafana provider getting the correct host and port, I get connection refused as it seems the connection shuts down too fast. I also tried using time_sleep resources and provisioners in various places, but nothing worked.

Expected Behavior

There should be a way to control when the tunnel closes.

Actual Behavior

Tunnel closes within 1-3 seconds, causing connection refused errors in the module.

Steps to Reproduce

Something like the config above should repro this.

Important Factoids

Looks like recent changes in this fork removed the "close connection" provider, maybe that should be reinstated to support this use case?

You'll also notice stuff in the logs like this, which is not related, it's because I moved the ssh tunnel out of the module since the previous apply:

2023-03-07T01:34:58.915Z [DEBUG] module.rc_prometheus_test.module.bastion_ssh_tunnel is no longer in configuration

References

thecadams commented 1 year ago

Maybe the tunnel is torn down after one usage? Just saw this on the remote side from sshd:

Mar 07 02:50:29 ip-10-1-3-170.ec2.internal sshd[4333]: debug2: channel 0: input drain -> closed
Mar 07 02:50:29 ip-10-1-3-170.ec2.internal sshd[4333]: debug2: channel 0: rcvd adjust 9127
Mar 07 02:50:29 ip-10-1-3-170.ec2.internal sshd[4333]: debug3: receive packet: type 97
Mar 07 02:50:29 ip-10-1-3-170.ec2.internal sshd[4333]: debug2: channel 0: rcvd close
Mar 07 02:50:29 ip-10-1-3-170.ec2.internal sshd[4333]: debug2: channel 0: output open -> drain
Mar 07 02:50:29 ip-10-1-3-170.ec2.internal sshd[4333]: debug3: channel 0: will not send data after close
Mar 07 02:50:29 ip-10-1-3-170.ec2.internal sshd[4333]: debug2: channel 0: obuf empty
Mar 07 02:50:29 ip-10-1-3-170.ec2.internal sshd[4333]: debug2: channel 0: close_write
Mar 07 02:50:29 ip-10-1-3-170.ec2.internal sshd[4333]: debug2: channel 0: output drain -> closed
Mar 07 02:50:29 ip-10-1-3-170.ec2.internal sshd[4333]: debug2: channel 0: send close
Mar 07 02:50:29 ip-10-1-3-170.ec2.internal sshd[4333]: debug3: send packet: type 97
Mar 07 02:50:29 ip-10-1-3-170.ec2.internal sshd[4333]: debug2: channel 0: is dead
Mar 07 02:50:29 ip-10-1-3-170.ec2.internal sshd[4333]: debug2: channel 0: garbage collecting
Mar 07 02:50:29 ip-10-1-3-170.ec2.internal sshd[4333]: debug1: channel 0: free: direct-tcpip, nchannels 8
Mar 07 02:50:29 ip-10-1-3-170.ec2.internal sshd[4333]: debug3: channel 0: status: The following connections are open:\r\n  #0 direct-tcpip (t4 r0 i3/0 o3/0 fd 8/8 cc -1)\r\n  #1 direct-tcpip (t4 r1 i0/0 o0/0 fd 9/9 cc -1)\r\n  #2 direct-tcpip (t4 r2 i0/0 o0/0 fd 10/10 cc -1)\r\n  #3 direct-tcpip (t4 r3 i0/0 o0/0 fd 11/11 cc -1)\r\n  #4 direct-tcpip (t4 r4 i0/0 o0/0 fd 1
Mar 07 02:50:29 ip-10-1-3-170.ec2.internal sshd[4333]: Connection closed by 50.17.68.142 port 39023

Blefish commented 1 year ago

I found that if I removed the redirectStd commands which redirect stdout/stderr of the child process back to the provider, the child process outlives the provider process. I think it is intended by the module, but for some reason does not work.

thecadams commented 1 year ago

@Blefish based on what you mentioned, plus the bufio Scanner.Scan() docs I have a hypothesis:

The parent process is experiencing a panic in one of the goroutines due to no input from the child after a while, per this from the docs:

Scan panics if the split function returns too many empty tokens without advancing the input. This is a common error mode for scanners.

Pretty sure it's talking about this panic.

The child process writing to stdout/stderr, which is now closed on the read end, either blocks once the pipe fills up, or causes a crash (can't investigate which one in my setup as it's TF cloud)

If this is the case, the parent's stderr would have to not be in the logs, otherwise we'd see the panic. As well as that, it's reasonable for the child to die without anything in the tf logs, since the parent died first.

Thoughts on this?

thecadams commented 1 year ago

@Blefish you were right, ignoring the child process stdout and stderr seems to prevent the child process crashing. My fork has the change you described, and that fixes the issue for me. Thanks for the suggestion!

mvgijssel commented 1 year ago

@thecadams thanks for putting up the fork! I've managed getting it to work when executing Terraform locally, but unfortunately Terraform Cloud with remote execution does not work. Terraform Cloud has the same behaviour as you are describing even with your fork installed, the ssh tunnel stops 2 or 3 seconds after it's started.

AndrewChubatiuk commented 1 year ago

You can try release v0.2.3