department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
283 stars 204 forks source link

Duplicate vets-api infra to mitigate evss-bgs causing latency #45570

Closed ianhundere closed 2 years ago

ianhundere commented 2 years ago

Description

As part of the #vets-api-latency-issue-aug08 effort, this ticket will duplicate much of the infra that vets-api relies on to then route problematic evss-bgs routes to the duplicated ASGs/LBs. This ticket is concerned with the duplication of infra, custom jenkins job to plan/apply duplicated resources, and revproxy changes.

Acceptance Criteria

ianhundere commented 2 years ago

complete: https://github.com/department-of-veterans-affairs/vsp-infra-evss-bgs-split

ianhundere commented 2 years ago

plans dev

Acquiring state lock. This may take a few moments...
data.aws_subnet.subnet_id_a: Reading...
data.aws_subnet.subnet_id_c: Reading...
data.aws_autoscaling_group.selected: Reading...
data.aws_subnet.subnet_id_b: Reading...
data.aws_launch_template.selected: Reading...
data.aws_elb.selected: Reading...
data.aws_subnet.subnet_id_b: Read complete after 0s [id=subnet-8d3dd6e9]
data.aws_subnet.subnet_id_c: Read complete after 0s [id=subnet-2c4e176a]
data.aws_subnet.subnet_id_a: Read complete after 0s [id=subnet-d6f512a0]
data.aws_launch_template.selected: Read complete after 1s [id=lt-0dde36914873ae8b6]
data.aws_autoscaling_group.selected: Read complete after 1s [id=dsva-vagov-dev-deployment-vagov-dev-vets-api-server-20220810-194214-asg]
data.aws_elb.selected: Read complete after 1s [id=dsva-vagov-dev-vets-api-elb]

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # aws_autoscaling_group.vets-api-server will be created
  + resource "aws_autoscaling_group" "vets-api-server" {
      + arn                       = (known after apply)
      + availability_zones        = (known after apply)
      + default_cooldown          = (known after apply)
      + desired_capacity          = (known after apply)
      + force_delete              = false
      + force_delete_warm_pool    = false
      + health_check_grace_period = 120
      + health_check_type         = "ELB"
      + id                        = (known after apply)
      + max_size                  = 6
      + metrics_granularity       = "1Minute"
      + min_size                  = 3
      + name                      = "dsva-vagov-dev-deployment-vagov-dev-vets-api-server-20220810-194214-asg-evss-bgs-split"
      + name_prefix               = (known after apply)
      + protect_from_scale_in     = false
      + service_linked_role_arn   = (known after apply)
      + termination_policies      = [
          + "OldestLaunchTemplate",
        ]
      + vpc_zone_identifier       = [
          + "subnet-2c4e176a,subnet-d6f512a0,subnet-8d3dd6e9",
        ]
      + wait_for_capacity_timeout = "10m"

      + instance_refresh {
          + strategy = "Rolling"

          + preferences {
              + min_healthy_percentage = 50
              + skip_matching          = false
            }
        }

      + launch_template {
          + id      = "lt-0dde36914873ae8b6"
          + name    = (known after apply)
          + version = "$Default"
        }
    }

  # aws_elb.vets-api-server will be created
  + resource "aws_elb" "vets-api-server" {
      + arn                         = (known after apply)
      + availability_zones          = (known after apply)
      + connection_draining         = true
      + connection_draining_timeout = 30
      + cross_zone_load_balancing   = false
      + desync_mitigation_mode      = "defensive"
      + dns_name                    = (known after apply)
      + id                          = (known after apply)
      + idle_timeout                = 120
      + instances                   = (known after apply)
      + internal                    = true
      + name                        = "dsva-vagov-dev-vets-api-bgs"
      + security_groups             = [
          + "sg-9eaa28f9",
        ]
      + source_security_group       = (known after apply)
      + source_security_group_id    = (known after apply)
      + subnets                     = [
          + "subnet-2c4e176a",
          + "subnet-8d3dd6e9",
          + "subnet-d6f512a0",
        ]
      + tags_all                    = {
          + "application" = "vets-api"
          + "environment" = "dev"
          + "managed_by"  = "Terraform"
          + "purpose"     = "mitigate latency issues as per https://dsva.slack.com/archives/C03STQZ40DQ"
          + "repo"        = "https://github.com/department-of-veterans-affairs/vsp-infra-evss-bgs-split"
        }
      + zone_id                     = (known after apply)

      + health_check {
          + healthy_threshold   = 3
          + interval            = 30
          + target              = "HTTP:3004/"
          + timeout             = 5
          + unhealthy_threshold = 2
        }

      + listener {
          + instance_port     = 3004
          + instance_protocol = "HTTP"
          + lb_port           = 3004
          + lb_protocol       = "HTTP"
        }
    }

Plan: 2 to add, 0 to change, 0 to destroy.

staging

> tf plan
Acquiring state lock. This may take a few moments...
data.aws_subnet.subnet_id_a: Reading...
data.aws_subnet.subnet_id_c: Reading...
data.aws_elb.selected: Reading...
data.aws_subnet.subnet_id_b: Reading...
data.aws_autoscaling_group.selected: Reading...
data.aws_launch_template.selected: Reading...
data.aws_subnet.subnet_id_a: Read complete after 0s [id=subnet-70f51206]
data.aws_subnet.subnet_id_c: Read complete after 0s [id=subnet-b84e17fe]
data.aws_subnet.subnet_id_b: Read complete after 0s [id=subnet-cc3cd7a8]
data.aws_autoscaling_group.selected: Read complete after 0s [id=dsva-vagov-staging-deployment-vagov-staging-vets-api-server-20220810-194219-asg]
data.aws_launch_template.selected: Read complete after 1s [id=lt-0301a333d650b33b1]
data.aws_elb.selected: Read complete after 1s [id=dsva-vagov-staging-vets-api-elb]

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # aws_autoscaling_group.vets-api-server will be created
  + resource "aws_autoscaling_group" "vets-api-server" {
      + arn                       = (known after apply)
      + availability_zones        = (known after apply)
      + default_cooldown          = (known after apply)
      + desired_capacity          = (known after apply)
      + force_delete              = false
      + force_delete_warm_pool    = false
      + health_check_grace_period = 120
      + health_check_type         = "ELB"
      + id                        = (known after apply)
      + max_size                  = 12
      + metrics_granularity       = "1Minute"
      + min_size                  = 6
      + name                      = "dsva-vagov-staging-deployment-vagov-staging-vets-api-server-20220810-194219-asg-evss-bgs-split"
      + name_prefix               = (known after apply)
      + protect_from_scale_in     = false
      + service_linked_role_arn   = (known after apply)
      + termination_policies      = [
          + "OldestLaunchTemplate",
        ]
      + vpc_zone_identifier       = [
          + "subnet-cc3cd7a8,subnet-b84e17fe,subnet-70f51206",
        ]
      + wait_for_capacity_timeout = "10m"

      + instance_refresh {
          + strategy = "Rolling"

          + preferences {
              + min_healthy_percentage = 50
              + skip_matching          = false
            }
        }

      + launch_template {
          + id      = "lt-0301a333d650b33b1"
          + name    = (known after apply)
          + version = "$Default"
        }
    }

  # aws_elb.vets-api-server will be created
  + resource "aws_elb" "vets-api-server" {
      + arn                         = (known after apply)
      + availability_zones          = (known after apply)
      + connection_draining         = true
      + connection_draining_timeout = 30
      + cross_zone_load_balancing   = false
      + desync_mitigation_mode      = "defensive"
      + dns_name                    = (known after apply)
      + id                          = (known after apply)
      + idle_timeout                = 120
      + instances                   = (known after apply)
      + internal                    = true
      + name                        = "dsva-vagov-staging-vets-api-bgs"
      + security_groups             = [
          + "sg-854cc6e2",
        ]
      + source_security_group       = (known after apply)
      + source_security_group_id    = (known after apply)
      + subnets                     = [
          + "subnet-70f51206",
          + "subnet-b84e17fe",
          + "subnet-cc3cd7a8",
        ]
      + tags_all                    = {
          + "application" = "vets-api"
          + "environment" = "staging"
          + "managed_by"  = "Terraform"
          + "purpose"     = "mitigate latency issues as per https://dsva.slack.com/archives/C03STQZ40DQ"
          + "repo"        = "https://github.com/department-of-veterans-affairs/vsp-infra-evss-bgs-split"
        }
      + zone_id                     = (known after apply)

      + health_check {
          + healthy_threshold   = 3
          + interval            = 30
          + target              = "HTTP:3004/"
          + timeout             = 5
          + unhealthy_threshold = 2
        }

      + listener {
          + instance_port     = 3004
          + instance_protocol = "HTTP"
          + lb_port           = 3004
          + lb_protocol       = "HTTP"
        }
    }

Plan: 2 to add, 0 to change, 0 to destroy.

prod

Acquiring state lock. This may take a few moments...
data.aws_subnet.subnet_id_a: Reading...
data.aws_subnet.subnet_id_b: Reading...
data.aws_elb.selected: Reading...
data.aws_launch_template.selected: Reading...
data.aws_subnet.subnet_id_c: Reading...
data.aws_autoscaling_group.selected: Reading...
data.aws_subnet.subnet_id_a: Read complete after 0s [id=subnet-f3f31485]
data.aws_subnet.subnet_id_b: Read complete after 0s [id=subnet-a433d8c0]
data.aws_subnet.subnet_id_c: Read complete after 0s [id=subnet-66411820]
data.aws_launch_template.selected: Read complete after 0s [id=lt-06c6752970f3ee877]
data.aws_autoscaling_group.selected: Read complete after 1s [id=dsva-vagov-prod-deployment-vagov-prod-vets-api-server-20220810-190140-asg]
data.aws_elb.selected: Read complete after 1s [id=dsva-vagov-prod-vets-api-elb]

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # aws_autoscaling_group.vets-api-server will be created
  + resource "aws_autoscaling_group" "vets-api-server" {
      + arn                       = (known after apply)
      + availability_zones        = (known after apply)
      + default_cooldown          = (known after apply)
      + desired_capacity          = (known after apply)
      + force_delete              = false
      + force_delete_warm_pool    = false
      + health_check_grace_period = 120
      + health_check_type         = "ELB"
      + id                        = (known after apply)
      + max_size                  = 32
      + metrics_granularity       = "1Minute"
      + min_size                  = 16
      + name                      = "dsva-vagov-prod-deployment-vagov-prod-vets-api-server-20220810-190140-asg-evss-bgs-split"
      + name_prefix               = (known after apply)
      + protect_from_scale_in     = false
      + service_linked_role_arn   = (known after apply)
      + termination_policies      = [
          + "OldestLaunchTemplate",
        ]
      + vpc_zone_identifier       = [
          + "subnet-f3f31485,subnet-66411820,subnet-a433d8c0",
        ]
      + wait_for_capacity_timeout = "10m"

      + instance_refresh {
          + strategy = "Rolling"

          + preferences {
              + min_healthy_percentage = 50
              + skip_matching          = false
            }
        }

      + launch_template {
          + id      = "lt-06c6752970f3ee877"
          + name    = (known after apply)
          + version = "$Default"
        }
    }

  # aws_elb.vets-api-server will be created
  + resource "aws_elb" "vets-api-server" {
      + arn                         = (known after apply)
      + availability_zones          = (known after apply)
      + connection_draining         = true
      + connection_draining_timeout = 30
      + cross_zone_load_balancing   = false
      + desync_mitigation_mode      = "defensive"
      + dns_name                    = (known after apply)
      + id                          = (known after apply)
      + idle_timeout                = 120
      + instances                   = (known after apply)
      + internal                    = true
      + name                        = "dsva-vagov-prod-vets-api-bgs"
      + security_groups             = [
          + "sg-2d6cf84a",
        ]
      + source_security_group       = (known after apply)
      + source_security_group_id    = (known after apply)
      + subnets                     = [
          + "subnet-66411820",
          + "subnet-a433d8c0",
          + "subnet-f3f31485",
        ]
      + tags_all                    = {
          + "application" = "vets-api"
          + "environment" = "prod"
          + "managed_by"  = "Terraform"
          + "purpose"     = "mitigate latency issues as per https://dsva.slack.com/archives/C03STQZ40DQ"
          + "repo"        = "https://github.com/department-of-veterans-affairs/vsp-infra-evss-bgs-split"
        }
      + zone_id                     = (known after apply)

      + health_check {
          + healthy_threshold   = 3
          + interval            = 30
          + target              = "HTTP:3004/"
          + timeout             = 5
          + unhealthy_threshold = 2
        }

      + listener {
          + instance_port     = 3004
          + instance_protocol = "HTTP"
          + lb_port           = 3004
          + lb_protocol       = "HTTP"
        }
    }

Plan: 2 to add, 0 to change, 0 to destroy.
ianhundere commented 2 years ago

the dns_name will still need to be added to the route53 record still, but i'll do that manually since we don't wanna mess with state if we can help it.

ianhundere commented 2 years ago

ran into an issues where the asg name keeps changing due to deploys, i'm now filtering based off of some tags while using the aws_autoscaling_groups datasource to grab the latest asg name.

ianhundere commented 2 years ago

currently we're at the following:

ianhundere commented 2 years ago

https://github.com/department-of-veterans-affairs/devops/pull/11782/files ^ jenkins job has been added, just needs to be tested.

ianhundere commented 2 years ago

jenkins job has been tested, just coordinating when to flip all the switches for staging.

ianhundere commented 2 years ago

the plan:

we’ll only do staging for this piece in order to test jenkins job

then y’all test staging and if all is well i’ll make the necessary changes to the rev proxy for both dev / prod image.png

ianhundere commented 2 years ago

currently have a test jenkins job, will remove this when the jenkins pr is merged / closed: http://jenkins.vfs.va.gov/job/testing/job/testing_tf/

edit: test job removed / pr merged.

ianhundere commented 2 years ago

i'll keep this ticket open until the remainder changes for the rev proxy / jenkins job have been completed / merged.

ianhundere commented 2 years ago

manually ran the revproxy deploy job for staging as well as the seed job.

sh-4.2$ cat /usr/local/openresty/nginx/sites-enabled/api_server.conf | grep bgs
    # bgs and evss routes

revproxy is updated

ianhundere commented 2 years ago

https://github.com/department-of-veterans-affairs/vsp-infra-evss-bgs-split/pull/1 ^ tags / asg attachment added

and nginx config fixed: https://github.com/department-of-veterans-affairs/devops/commit/280980b07dcbe9131ceb8b15a557716b7e1b9bf2

ianhundere commented 2 years ago

confirmed staging revproxy is correctly updated and that ec2s are being registered with LB:

sh-4.2$ cat api_server.conf | grep -A 4 bgs
    # bgs and evss routes
        location ~ ^/(v0|v1)(/debts|/debt_letters|/profile/ch33_bank_accounts|/profile/payment_history) {
      proxy_pass http://internal-dsva-vagov-staging-vets-api-2nd-445860677.us-gov-west-1.elb.amazonaws.com:3004$request_uri;

image.png

ianhundere commented 2 years ago

jenkins job is fixed: https://github.com/department-of-veterans-affairs/devops/pull/11803/files

ianhundere commented 2 years ago

reaching out to various parties in regards to moving / renaming the dd forwarder before enabling logs in staging / prod.

ianhundere commented 2 years ago

jenkins job confirmed to be working: image.png

ianhundere commented 2 years ago

jenkins job confirmed to be working: image.png

ianhundere commented 2 years ago

we’ve discovered, at least this is what it looks like, traffic isn’t going thru the revproxy. Kyle was just as confused and sanity checked it. we don’t know where the revproxy is coming into play because all records pointed straight to the asg worker lbs. we’re gonna pull Jeremy in to see if he knows where the revproxy comes into play.

https://dsva.slack.com/archives/CTYQL39FE/p1660849458919659

edit: endpoints are being used, infra will investigate further.

ianhundere commented 2 years ago

pr up: https://github.com/department-of-veterans-affairs/devops/pull/11820/files

ianhundere commented 2 years ago

oops, dev/prod failed because of not including proper schema / port. monday woes

ianhundere commented 2 years ago

https://dsva.slack.com/archives/CJYRZK2HH/p1661185087672539

and we're a go!

ianhundere commented 2 years ago

since our prometheus metrics rely on tags, tags were added to the 2nd asg instances since the launch template doesn't include them, specifically deployment_name.

https://dsva.slack.com/archives/C03KT515C0H/p1661263063451239

provider "aws" {
  region = "us-gov-west-1"
  default_tags {
    tags = {
      Name            = "${var.name}-2nd"
      deployment_name = "vets-api-server"
      repo            = "https://github.com/department-of-veterans-affairs/vsp-infra-evss-bgs-split"
      managed_by      = "Terraform"
      application     = "vets-api"
      purpose         = "mitigate latency issues as per https://dsva.slack.com/archives/C03STQZ40DQ"
      environment     = var.env
    }
  }
}
ianhundere commented 2 years ago

enables access_logs for each respective elb.

https://github.com/department-of-veterans-affairs/vsp-infra-evss-bgs-split/commit/3bfe95ff88f3cb2674420a08bb9882f2ed18d97f