IBM-Cloud / terraform-provider-ibm

https://registry.terraform.io/providers/IBM-Cloud/ibm/latest/docs
Mozilla Public License 2.0
341 stars 670 forks source link

DNS Resource record creation is too slow to build large clusters #3867

Open gmewhinney opened 2 years ago

gmewhinney commented 2 years ago

Community Note

Terraform CLI and Terraform IBM Provider Version

Terraform version: 1.0.11 terraform provider: 1.41

Affected Resource(s)

Terraform Configuration Files

/*
    Creates specified number of IBM Cloud Virtual Server Instance(s).
*/

terraform {
  required_providers {
    ibm = {
      source = "IBM-Cloud/ibm"
    }
  }
}

variable "total_vsis" {}
variable "vsi_name_prefix" {}
variable "vpc_id" {}
variable "zones" {}
variable "dns_service_id" {}
variable "dns_zone_id" {}
variable "dns_domain" {}
variable "vsi_subnet_id" {}
variable "vsi_security_group" {}
variable "vsi_profile" {}
variable "vsi_image_id" {}
variable "vsi_user_public_key" {}
variable "vsi_meta_private_key" {}
variable "vsi_meta_public_key" {}
variable "resource_group_id" {}

data "template_file" "metadata_startup_script" {
  template = <<EOF
#!/usr/bin/env bash
if grep -q "Red Hat" /etc/os-release
then
    USER=vpcuser
    yum install -y python3 kernel-devel-$(uname -r) kernel-headers-$(uname -r)
elif grep -q "Ubuntu" /etc/os-release
then
    USER=ubuntu
fi
sed -i -e "s/^/no-port-forwarding,no-agent-forwarding,no-X11-forwarding,command=\"echo \'Please login as the user \\\\\"$USER\\\\\" rather than the user \\\\\"root\\\\\".\';echo;sleep 10; exit 142\" /" ~/.ssh/authorized_keys
echo "${var.vsi_meta_private_key}" > ~/.ssh/id_rsa
chmod 600 ~/.ssh/id_rsa
echo "${var.vsi_meta_public_key}" >> ~/.ssh/authorized_keys
echo "StrictHostKeyChecking no" >> ~/.ssh/config
echo "DOMAIN=\"${var.dns_domain}\"" >> "/etc/sysconfig/network-scripts/ifcfg-eth0"
systemctl restart NetworkManager
systemctl stop firewalld
firewall-offline-cmd --zone=public --add-port=1191/tcp
firewall-offline-cmd --zone=public --add-port=60000-61000/tcp
firewall-offline-cmd --zone=public --add-port=47080/tcp
firewall-offline-cmd --zone=public --add-port=47080/udp
firewall-offline-cmd --zone=public --add-port=47443/tcp
firewall-offline-cmd --zone=public --add-port=47443/udp
firewall-offline-cmd --zone=public --add-port=4444/tcp
firewall-offline-cmd --zone=public --add-port=4444/udp
firewall-offline-cmd --zone=public --add-port=4739/udp
firewall-offline-cmd --zone=public --add-port=4739/tcp
firewall-offline-cmd --zone=public --add-port=9084/tcp
firewall-offline-cmd --zone=public --add-port=9085/tcp
firewall-offline-cmd --zone=public --add-service=http
firewall-offline-cmd --zone=public --add-service=https
systemctl start firewalld
EOF
}

resource "ibm_is_instance" "itself" {
  for_each = {
    # This assigns a subnet-id to each of the instance
    # iteration.
    for idx, count_number in range(1, var.total_vsis + 1) : idx => {
      sequence_string = tostring(count_number)
      subnet_id       = element(var.vsi_subnet_id, idx)
      zone            = element(var.zones, idx)
    }
  }

  name    = format("%s-%s", var.vsi_name_prefix, each.value.sequence_string)
  image   = var.vsi_image_id
  profile = var.vsi_profile

  primary_network_interface {
    subnet          = each.value.subnet_id
    security_groups = var.vsi_security_group
  }

  vpc            = var.vpc_id
  zone           = each.value.zone
  resource_group = var.resource_group_id
  keys           = var.vsi_user_public_key
  user_data      = data.template_file.metadata_startup_script.rendered

  boot_volume {
    name = format("%s-boot-%s", var.vsi_name_prefix, each.value.sequence_string)
  }
}

resource "ibm_dns_resource_record" "a_itself" {
  for_each = {
    for idx, count_number in range(1, var.total_vsis + 1) : idx => {
      name       = element(tolist([for name_details in ibm_is_instance.itself : name_details.name]), idx)
      network_ip = element(tolist([for ip_details in ibm_is_instance.itself : ip_details.primary_network_interface[0]["primary_ipv4_address"]]), idx)
    }
  }

  instance_id = var.dns_service_id
  zone_id     = var.dns_zone_id
  type        = "A"
  name        = each.value.name
  rdata       = each.value.network_ip
  ttl         = 300
}

resource "ibm_dns_resource_record" "ptr_itself" {
  for_each = {
    for idx, count_number in range(1, var.total_vsis + 1) : idx => {
      name       = element(tolist([for name_details in ibm_is_instance.itself : name_details.name]), idx)
      network_ip = element(tolist([for ip_details in ibm_is_instance.itself : ip_details.primary_network_interface[0]["primary_ipv4_address"]]), idx)
    }
  }

  instance_id = var.dns_service_id
  zone_id     = var.dns_zone_id
  type        = "PTR"
  name        = each.value.network_ip
  rdata       = format("%s.%s", each.value.name, var.dns_domain)
  ttl         = 300
  depends_on  = [ibm_dns_resource_record.a_itself]
}

output "instance_ids" {
  value      = try(toset([for instance_details in ibm_is_instance.itself : instance_details.id]), [])
  depends_on = [ibm_dns_resource_record.a_itself, ibm_dns_resource_record.ptr_itself]
}

output "instance_private_ips" {
  value      = try(toset([for instance_details in ibm_is_instance.itself : instance_details.primary_network_interface[0]["primary_ipv4_address"]]), [])
  depends_on = [ibm_dns_resource_record.a_itself, ibm_dns_resource_record.ptr_itself]
}

Debug Output

https://gist.github.com/gmewhinney/6071c5f490f9e31d02c1a385c4b2c87a#file-terraform-log-txt

Expected Behavior

outside of Terraform, 100 DNS records can be created in < 1 second. We need to get closer to that within Terraform

Actual Behavior

creating a single DNS 'A' record takes over 1 second, so for the 64 compute nodes on this small cluster it took 74 seconds All the records are started within 1 second of each other at which time all of the records are in progress, but the completions trickle out at the rate of 1 every second or so. So from start to finish the first record finishes after one second but the last one takes 74 seconds. PTR records are a little slower. It takes 95 seconds to create all 64 PTR records.
This is borderline for a small cluster, but would take over 30 minutes for a 1000 node cluster.

Steps to Reproduce

The code will be moving to a public repository soon. Right now it resides on an internal repository at: https://github.ibm.com/IBMSpectrumScale/ibm-spectrum-scale-ibm-cloud-schematics

To recreate, you would build a Scale cluster starting with schematics specifying the above repo

Important Factoids

The cluster creation is a hybrid between schematics which creates part of the cluster and a service machine, schematics transfers control to the service machine where the Terraform that creates the DNS records is executed. The log linked above is from the service machine.

References

This issue is a result of the need to serialize record creation as detailed in: https://github.com/IBM-Cloud/terraform-provider-ibm/issues/1430 I have discussed this issue with @MalarvizhiK who worked on the above issue. I think her and Vasu from the cloud DNS team have some ideas for improving this.

MalarvizhiK commented 2 years ago

PR: https://github.com/IBM-Cloud/terraform-provider-ibm/pull/3886/files

gmewhinney commented 2 years ago

the 1.43.0 prototype much improved DNS record creation, but there is still a 40+ second delay in creating some of the ptr records. This is a log from the run. a good example of this is the record for instance 21. it starts at: 2022-07-08T19:30:46.744Z and ends at:2022-07-08T19:30:46.744Z this record took over 40 seconds, while most took 3-4 seconds

gmewhinney commented 2 years ago

https://gist.github.com/gmewhinney/8c2e26c051ebc206cbb3a30ae9ce2114