Juniper / junos-terraform

Apache License 2.0
69 stars 22 forks source link

Creating a large number of resources causes plugin to fail due to SSH connection limits #29

Open JerradGit opened 2 years ago

JerradGit commented 2 years ago

Summary

We want to use JTAF to provision 100% of device configurations which results in hundreds if not thousands of resources which causes the plugin to fail randomly.

│ Error: Request cancelled
│ 
│   with module.vlans_for_each.junos-qfx_VlansVlanVlan__Id.vlan_vid["VLAN_390"],
│   on vlans_for_each/main.tf line 22, in resource "junos-qfx_VlansVlanVlan__Id" "vlan_vid":
│   22: resource "junos-qfx_VlansVlanVlan__Id" "vlan_vid" {
│ 
│ The plugin.(*GRPCProvider).ValidateResourceConfig request was cancelled.
╵
╷
│ Error: Request cancelled
│ 
│   with module.vlans_for_each.junos-qfx_VlansVlanVlan__Id.vlan_vid["VLAN_270"],
│   on vlans_for_each/main.tf line 22, in resource "junos-qfx_VlansVlanVlan__Id" "vlan_vid":
│   22: resource "junos-qfx_VlansVlanVlan__Id" "vlan_vid" {
│ 
│ The plugin.(*GRPCProvider).ValidateResourceConfig request was cancelled.
╵
╷
│ Error: Request cancelled
│ 
│   with module.vlans_for_each.junos-qfx_VlansVlanVlan__Id.vlan_vid["VLAN_384"],
│   on vlans_for_each/main.tf line 22, in resource "junos-qfx_VlansVlanVlan__Id" "vlan_vid":
│   22: resource "junos-qfx_VlansVlanVlan__Id" "vlan_vid" {
│ 
│ The plugin.(*GRPCProvider).ValidateResourceConfig request was cancelled.
╵
╷
│ Error: Request cancelled
│ 
│   with module.vlans_for_each.junos-qfx_VlansVlanVlan__Id.vlan_vid["VLAN_146"],
│   on vlans_for_each/main.tf line 22, in resource "junos-qfx_VlansVlanVlan__Id" "vlan_vid":
│   22: resource "junos-qfx_VlansVlanVlan__Id" "vlan_vid" {
│ 
│ The plugin.(*GRPCProvider).ValidateResourceConfig request was cancelled.
╵
╷
│ Error: Request cancelled
│ 
│   with module.vlans_for_each.junos-qfx_VlansVlanVlan__Id.vlan_vid["VLAN_218"],
│   on vlans_for_each/main.tf line 22, in resource "junos-qfx_VlansVlanVlan__Id" "vlan_vid":
│   22: resource "junos-qfx_VlansVlanVlan__Id" "vlan_vid" {
│ 
│ The plugin.(*GRPCProvider).ValidateResourceConfig request was cancelled.

After investigation we have determined that the JTAF plugin is establishing and terminating a TCP sockets towards the device for each resource which causes Junos to enforce connection limits and refuse new connections. This issue can occur very randomly depending on the performance of your terraform client device. e.g. Fast CPU with no competition for resources you can expect this to occur more frequently.

e.g. vqfx messages output during a few seconds of an apply

vagrant@vqfx-re# 
*** messages ***
Mar  2 16:53:58  vqfx-re sshd[64282]: Accepted password for automation from 10.0.2.2 port 36788 ssh2
Mar  2 16:53:58  vqfx-re sshd[64289]: Accepted password for automation from 10.0.2.2 port 36790 ssh2
Mar  2 16:53:59  vqfx-re sshd[64297]: Accepted password for automation from 10.0.2.2 port 36792 ssh2
Mar  2 16:53:59  vqfx-re sshd[64304]: Accepted password for automation from 10.0.2.2 port 36794 ssh2
Mar  2 16:53:59  vqfx-re /usr/bin/ssh[64296]: ssh: connect to host 169.254.0.1 port 3000: Operation timed out
Mar  2 16:54:00  vqfx-re sshd[64313]: Accepted password for automation from 10.0.2.2 port 36796 ssh2
Mar  2 16:54:00  vqfx-re sshd[64320]: Accepted password for automation from 10.0.2.2 port 36798 ssh2
Mar  2 16:54:01  vqfx-re sshd[64327]: Accepted password for automation from 10.0.2.2 port 36800 ssh2
Mar  2 16:54:01  vqfx-re sshd[64334]: Accepted password for automation from 10.0.2.2 port 36802 ssh2
Mar  2 16:54:01  vqfx-re sshd[64341]: Accepted password for automation from 10.0.2.2 port 36804 ssh2
Mar  2 16:54:02  vqfx-re sshd[64348]: Accepted password for automation from 10.0.2.2 port 36806 ssh2
Mar  2 16:54:02  vqfx-re sshd[64355]: Accepted password for automation from 10.0.2.2 port 36808 ssh2
Mar  2 16:54:03  vqfx-re sshd[64362]: Accepted password for automation from 10.0.2.2 port 36810 ssh2
Mar  2 16:54:03  vqfx-re sshd[64369]: Accepted password for automation from 10.0.2.2 port 36812 ssh2
Mar  2 16:54:03  vqfx-re sshd[64376]: Accepted password for automation from 10.0.2.2 port 36814 ssh2
Mar  2 16:54:04  vqfx-re sshd[64383]: Accepted password for automation from 10.0.2.2 port 36816 ssh2
Mar  2 16:54:04  vqfx-re sshd[64390]: Accepted password for automation from 10.0.2.2 port 36818 ssh2
Mar  2 16:54:05  vqfx-re sshd[64397]: Accepted password for automation from 10.0.2.2 port 36820 ssh2
Mar  2 16:54:05  vqfx-re sshd[64404]: Accepted password for automation from 10.0.2.2 port 36822 ssh2
Mar  2 16:54:06  vqfx-re sshd[64411]: Accepted password for automation from 10.0.2.2 port 36824 ssh2
Mar  2 16:54:06  vqfx-re sshd[64418]: Accepted password for automation from 10.0.2.2 port 36826 ssh2
Mar  2 16:54:07  vqfx-re sshd[64425]: Accepted password for automation from 10.0.2.2 port 36828 ssh2
Mar  2 16:54:08  vqfx-re sshd[64432]: Accepted password for automation from 10.0.2.2 port 36832 ssh2
Mar  2 16:54:08  vqfx-re sshd[64439]: Accepted password for automation from 10.0.2.2 port 36834 ssh2
Mar  2 16:54:09  vqfx-re sshd[64446]: Accepted password for automation from 10.0.2.2 port 36836 ssh2
Mar  2 16:54:09  vqfx-re sshd[64453]: Accepted password for automation from 10.0.2.2 port 36838 ssh2
Mar  2 16:54:10  vqfx-re sshd[64461]: Accepted password for automation from 10.0.2.2 port 36840 ssh2
Mar  2 16:54:10  vqfx-re /usr/bin/ssh[64455]: ssh: connect to host 169.254.0.1 port 3000: Operation timed out
Mar  2 16:54:11  vqfx-re sshd[64470]: Accepted password for automation from 10.0.2.2 port 36842 ssh2
Mar  2 16:54:11  vqfx-re sshd[64477]: Accepted password for automation from 10.0.2.2 port 36844 ssh2
Mar  2 16:54:11  vqfx-re sshd[64484]: Accepted password for automation from 10.0.2.2 port 36846 ssh2
Mar  2 16:54:12  vqfx-re sshd[64491]: Accepted password for automation from 10.0.2.2 port 36848 ssh2
Mar  2 16:54:12  vqfx-re sshd[64498]: Accepted password for automation from 10.0.2.2 port 36850 ssh2
Mar  2 16:54:13  vqfx-re sshd[64505]: Accepted password for automation from 10.0.2.2 port 36852 ssh2
Mar  2 16:54:13  vqfx-re sshd[64512]: Accepted password for automation from 10.0.2.2 port 36854 ssh2
Mar  2 16:54:14  vqfx-re sshd[64519]: Accepted password for automation from 10.0.2.2 port 36856 ssh2
Mar  2 16:54:14  vqfx-re sshd[64526]: Accepted password for automation from 10.0.2.2 port 36858 ssh2
Mar  2 16:54:14  vqfx-re sshd[64533]: Accepted password for automation from 10.0.2.2 port 36860 ssh2
Mar  2 16:54:15  vqfx-re sshd[64540]: Accepted password for automation from 10.0.2.2 port 36862 ssh2
Mar  2 16:54:15  vqfx-re sshd[64547]: Accepted password for automation from 10.0.2.2 port 36864 ssh2
Mar  2 16:54:16  vqfx-re sshd[64554]: Accepted password for automation from 10.0.2.2 port 36868 ssh2
Mar  2 16:54:16  vqfx-re sshd[64561]: Accepted password for automation from 10.0.2.2 port 36870 ssh2
Mar  2 16:54:17  vqfx-re sshd[64568]: Accepted password for automation from 10.0.2.2 port 36872 ssh2
Mar  2 16:54:17  vqfx-re sshd[64575]: Accepted password for automation from 10.0.2.2 port 36874 ssh2
Mar  2 16:54:18  vqfx-re sshd[64582]: Accepted password for automation from 10.0.2.2 port 36876 ssh2
Mar  2 16:54:18  vqfx-re sshd[64589]: Accepted password for automation from 10.0.2.2 port 36878 ssh2
Mar  2 16:54:19  vqfx-re sshd[64596]: Accepted password for automation from 10.0.2.2 port 36880 ssh2
Mar  2 16:54:19  vqfx-re sshd[64603]: Accepted password for automation from 10.0.2.2 port 36882 ssh2
Mar  2 16:54:20  vqfx-re sshd[64610]: Accepted password for automation from 10.0.2.2 port 36884 ssh2
Mar  2 16:54:20  vqfx-re sshd[64617]: Accepted password for automation from 10.0.2.2 port 36886 ssh2
Mar  2 16:54:21  vqfx-re sshd[64625]: Accepted password for automation from 10.0.2.2 port 36888 ssh2
Mar  2 16:54:21  vqfx-re sshd[64632]: Accepted password for automation from 10.0.2.2 port 36890 ssh2

We have found the following work around for now which allows us to get through the apply, however 700+ resources can take over 10 minutes to apply against a vqfx on the same device as the terraform client.

set system services ssh connection-limit 250
set system services ssh rate-limit 250
set system services netconf ssh connection-limit 250
set system services netconf ssh rate-limit 250

You can re-produce this with this sample HCL

locals {
  vlans = flatten([
    for i in range(2, 500) : [
      {
        name    = "VLAN_${i}"
        vlan_id = i
      }
    ]
  ])
}

resource "junos-qfx_VlansVlanVlan__Id" "vlan_vid" {
  for_each = {
    for vlan in local.vlans : "${vlan.name}" => vlan
  }

  resource_name = "${each.value.name}_vlan_id_grp"
  name          = each.value.name
  vlan__id      = each.value.vlan_id
}

I don't know if the new socket per resource is by design or if it can be changed. While the limit values can be adjusted, this creates a larger DoS surface which would be nice to avoid.

davedotdev commented 2 years ago

Hi there,

Thanks for the detail and reproduction info. Super useful.

Yikes. Ok. That's not good. In all of my testing, I came across this once and it seemed to level off at about 5-10 connections. The system kind of self-regulated as parts of the tree converged etc. However, as to your test, you've found an unhappy path with the way the transport layer works. Thanks for flagging this.

I'll tag this as an enhancement and work on it. Historically, the TF transport layer was locked to a single transport instance which limited the connection to exactly one, but it took forever to complete.

Would a configurable worker pool of connections work for you? Something we could put in the provider config as input?

JerradGit commented 2 years ago

I think a worker pool would be fine, I can test locally on a dev branch when something is ready and report back.

aaronchar commented 2 years ago

@davedotdev

I have created a PR on the go-netconf library that addresses the transport part of this issue. Using it in conjunction with this provider the time it takes to deploy has been greatly reduced.

Previously we were seeing upwards of 10+ minutes for a deployment on the same scale, this is also performing a commit check before we fully commit it.

Apply complete! Resources: 721 added, 0 changed, 0 destroyed.
terraform apply -auto-approve  298.09s user 5.12s system 329% cpu 1:32.02 total

Like I mentioned in the PR on the other library, I actually went with a worker pool initially and that just ends up killing the netconf subsystem regardless of any settings I tried.