Duplicate vets-api infra to mitigate evss-bgs causing latency

ianhundere commented 2 years ago

Description

As part of the #vets-api-latency-issue-aug08 effort, this ticket will duplicate much of the infra that vets-api relies on to then route problematic evss-bgs routes to the duplicated ASGs/LBs. This ticket is concerned with the duplication of infra, custom jenkins job to plan/apply duplicated resources, and revproxy changes.

Acceptance Criteria

[x] duplicate tf resources for the following envs:
- [x] dev
- [x] staging
- [x] prod
[x] test with plans
- [x] dev
- [x] staging
- [x] prod
[x] create jenkins job to plan/apply split tf resources
[x] apply to necessary envs
- [x] dev
- [x] staging
- [x] prod
[x] update jenkins job
- [x] dev
- [x] staging
- [x] prod
[x] update nginx
- [x] dev
- [x] staging
- [x] prod
[x] Enable Access Logs

ianhundere commented 2 years ago

complete: https://github.com/department-of-veterans-affairs/vsp-infra-evss-bgs-split

ianhundere commented 2 years ago

plans dev

Acquiring state lock. This may take a few moments...
data.aws_subnet.subnet_id_a: Reading...
data.aws_subnet.subnet_id_c: Reading...
data.aws_autoscaling_group.selected: Reading...
data.aws_subnet.subnet_id_b: Reading...
data.aws_launch_template.selected: Reading...
data.aws_elb.selected: Reading...
data.aws_subnet.subnet_id_b: Read complete after 0s [id=subnet-8d3dd6e9]
data.aws_subnet.subnet_id_c: Read complete after 0s [id=subnet-2c4e176a]
data.aws_subnet.subnet_id_a: Read complete after 0s [id=subnet-d6f512a0]
data.aws_launch_template.selected: Read complete after 1s [id=lt-0dde36914873ae8b6]
data.aws_autoscaling_group.selected: Read complete after 1s [id=dsva-vagov-dev-deployment-vagov-dev-vets-api-server-20220810-194214-asg]
data.aws_elb.selected: Read complete after 1s [id=dsva-vagov-dev-vets-api-elb]

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # aws_autoscaling_group.vets-api-server will be created
  + resource "aws_autoscaling_group" "vets-api-server" {
      + arn                       = (known after apply)
      + availability_zones        = (known after apply)
      + default_cooldown          = (known after apply)
      + desired_capacity          = (known after apply)
      + force_delete              = false
      + force_delete_warm_pool    = false
      + health_check_grace_period = 120
      + health_check_type         = "ELB"
      + id                        = (known after apply)
      + max_size                  = 6
      + metrics_granularity       = "1Minute"
      + min_size                  = 3
      + name                      = "dsva-vagov-dev-deployment-vagov-dev-vets-api-server-20220810-194214-asg-evss-bgs-split"
      + name_prefix               = (known after apply)
      + protect_from_scale_in     = false
      + service_linked_role_arn   = (known after apply)
      + termination_policies      = [
          + "OldestLaunchTemplate",
        ]
      + vpc_zone_identifier       = [
          + "subnet-2c4e176a,subnet-d6f512a0,subnet-8d3dd6e9",
        ]
      + wait_for_capacity_timeout = "10m"

      + instance_refresh {
          + strategy = "Rolling"

          + preferences {
              + min_healthy_percentage = 50
              + skip_matching          = false
            }
        }

      + launch_template {
          + id      = "lt-0dde36914873ae8b6"
          + name    = (known after apply)
          + version = "$Default"
        }
    }

  # aws_elb.vets-api-server will be created
  + resource "aws_elb" "vets-api-server" {
      + arn                         = (known after apply)
      + availability_zones          = (known after apply)
      + connection_draining         = true
      + connection_draining_timeout = 30
      + cross_zone_load_balancing   = false
      + desync_mitigation_mode      = "defensive"
      + dns_name                    = (known after apply)
      + id                          = (known after apply)
      + idle_timeout                = 120
      + instances                   = (known after apply)
      + internal                    = true
      + name                        = "dsva-vagov-dev-vets-api-bgs"
      + security_groups             = [
          + "sg-9eaa28f9",
        ]
      + source_security_group       = (known after apply)
      + source_security_group_id    = (known after apply)
      + subnets                     = [
          + "subnet-2c4e176a",
          + "subnet-8d3dd6e9",
          + "subnet-d6f512a0",
        ]
      + tags_all                    = {
          + "application" = "vets-api"
          + "environment" = "dev"
          + "managed_by"  = "Terraform"
          + "purpose"     = "mitigate latency issues as per https://dsva.slack.com/archives/C03STQZ40DQ"
          + "repo"        = "https://github.com/department-of-veterans-affairs/vsp-infra-evss-bgs-split"
        }
      + zone_id                     = (known after apply)

      + health_check {
          + healthy_threshold   = 3
          + interval            = 30
          + target              = "HTTP:3004/"
          + timeout             = 5
          + unhealthy_threshold = 2
        }

      + listener {
          + instance_port     = 3004
          + instance_protocol = "HTTP"
          + lb_port           = 3004
          + lb_protocol       = "HTTP"
        }
    }

Plan: 2 to add, 0 to change, 0 to destroy.

staging

> tf plan
Acquiring state lock. This may take a few moments...
data.aws_subnet.subnet_id_a: Reading...
data.aws_subnet.subnet_id_c: Reading...
data.aws_elb.selected: Reading...
data.aws_subnet.subnet_id_b: Reading...
data.aws_autoscaling_group.selected: Reading...
data.aws_launch_template.selected: Reading...
data.aws_subnet.subnet_id_a: Read complete after 0s [id=subnet-70f51206]
data.aws_subnet.subnet_id_c: Read complete after 0s [id=subnet-b84e17fe]
data.aws_subnet.subnet_id_b: Read complete after 0s [id=subnet-cc3cd7a8]
data.aws_autoscaling_group.selected: Read complete after 0s [id=dsva-vagov-staging-deployment-vagov-staging-vets-api-server-20220810-194219-asg]
data.aws_launch_template.selected: Read complete after 1s [id=lt-0301a333d650b33b1]
data.aws_elb.selected: Read complete after 1s [id=dsva-vagov-staging-vets-api-elb]

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # aws_autoscaling_group.vets-api-server will be created
  + resource "aws_autoscaling_group" "vets-api-server" {
      + arn                       = (known after apply)
      + availability_zones        = (known after apply)
      + default_cooldown          = (known after apply)
      + desired_capacity          = (known after apply)
      + force_delete              = false
      + force_delete_warm_pool    = false
      + health_check_grace_period = 120
      + health_check_type         = "ELB"
      + id                        = (known after apply)
      + max_size                  = 12
      + metrics_granularity       = "1Minute"
      + min_size                  = 6
      + name                      = "dsva-vagov-staging-deployment-vagov-staging-vets-api-server-20220810-194219-asg-evss-bgs-split"
      + name_prefix               = (known after apply)
      + protect_from_scale_in     = false
      + service_linked_role_arn   = (known after apply)
      + termination_policies      = [
          + "OldestLaunchTemplate",
        ]
      + vpc_zone_identifier       = [
          + "subnet-cc3cd7a8,subnet-b84e17fe,subnet-70f51206",
        ]
      + wait_for_capacity_timeout = "10m"

      + instance_refresh {
          + strategy = "Rolling"

          + preferences {
              + min_healthy_percentage = 50
              + skip_matching          = false
            }
        }

      + launch_template {
          + id      = "lt-0301a333d650b33b1"
          + name    = (known after apply)
          + version = "$Default"
        }
    }

  # aws_elb.vets-api-server will be created
  + resource "aws_elb" "vets-api-server" {
      + arn                         = (known after apply)
      + availability_zones          = (known after apply)
      + connection_draining         = true
      + connection_draining_timeout = 30
      + cross_zone_load_balancing   = false
      + desync_mitigation_mode      = "defensive"
      + dns_name                    = (known after apply)
      + id                          = (known after apply)
      + idle_timeout                = 120
      + instances                   = (known after apply)
      + internal                    = true
      + name                        = "dsva-vagov-staging-vets-api-bgs"
      + security_groups             = [
          + "sg-854cc6e2",
        ]
      + source_security_group       = (known after apply)
      + source_security_group_id    = (known after apply)
      + subnets                     = [
          + "subnet-70f51206",
          + "subnet-b84e17fe",
          + "subnet-cc3cd7a8",
        ]
      + tags_all                    = {
          + "application" = "vets-api"
          + "environment" = "staging"
          + "managed_by"  = "Terraform"
          + "purpose"     = "mitigate latency issues as per https://dsva.slack.com/archives/C03STQZ40DQ"
          + "repo"        = "https://github.com/department-of-veterans-affairs/vsp-infra-evss-bgs-split"
        }
      + zone_id                     = (known after apply)

      + health_check {
          + healthy_threshold   = 3
          + interval            = 30
          + target              = "HTTP:3004/"
          + timeout             = 5
          + unhealthy_threshold = 2
        }

      + listener {
          + instance_port     = 3004
          + instance_protocol = "HTTP"
          + lb_port           = 3004
          + lb_protocol       = "HTTP"
        }
    }

Plan: 2 to add, 0 to change, 0 to destroy.

prod

Acquiring state lock. This may take a few moments...
data.aws_subnet.subnet_id_a: Reading...
data.aws_subnet.subnet_id_b: Reading...
data.aws_elb.selected: Reading...
data.aws_launch_template.selected: Reading...
data.aws_subnet.subnet_id_c: Reading...
data.aws_autoscaling_group.selected: Reading...
data.aws_subnet.subnet_id_a: Read complete after 0s [id=subnet-f3f31485]
data.aws_subnet.subnet_id_b: Read complete after 0s [id=subnet-a433d8c0]
data.aws_subnet.subnet_id_c: Read complete after 0s [id=subnet-66411820]
data.aws_launch_template.selected: Read complete after 0s [id=lt-06c6752970f3ee877]
data.aws_autoscaling_group.selected: Read complete after 1s [id=dsva-vagov-prod-deployment-vagov-prod-vets-api-server-20220810-190140-asg]
data.aws_elb.selected: Read complete after 1s [id=dsva-vagov-prod-vets-api-elb]

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # aws_autoscaling_group.vets-api-server will be created
  + resource "aws_autoscaling_group" "vets-api-server" {
      + arn                       = (known after apply)
      + availability_zones        = (known after apply)
      + default_cooldown          = (known after apply)
      + desired_capacity          = (known after apply)
      + force_delete              = false
      + force_delete_warm_pool    = false
      + health_check_grace_period = 120
      + health_check_type         = "ELB"
      + id                        = (known after apply)
      + max_size                  = 32
      + metrics_granularity       = "1Minute"
      + min_size                  = 16
      + name                      = "dsva-vagov-prod-deployment-vagov-prod-vets-api-server-20220810-190140-asg-evss-bgs-split"
      + name_prefix               = (known after apply)
      + protect_from_scale_in     = false
      + service_linked_role_arn   = (known after apply)
      + termination_policies      = [
          + "OldestLaunchTemplate",
        ]
      + vpc_zone_identifier       = [
          + "subnet-f3f31485,subnet-66411820,subnet-a433d8c0",
        ]
      + wait_for_capacity_timeout = "10m"

      + instance_refresh {
          + strategy = "Rolling"

          + preferences {
              + min_healthy_percentage = 50
              + skip_matching          = false
            }
        }

      + launch_template {
          + id      = "lt-06c6752970f3ee877"
          + name    = (known after apply)
          + version = "$Default"
        }
    }

  # aws_elb.vets-api-server will be created
  + resource "aws_elb" "vets-api-server" {
      + arn                         = (known after apply)
      + availability_zones          = (known after apply)
      + connection_draining         = true
      + connection_draining_timeout = 30
      + cross_zone_load_balancing   = false
      + desync_mitigation_mode      = "defensive"
      + dns_name                    = (known after apply)
      + id                          = (known after apply)
      + idle_timeout                = 120
      + instances                   = (known after apply)
      + internal                    = true
      + name                        = "dsva-vagov-prod-vets-api-bgs"
      + security_groups             = [
          + "sg-2d6cf84a",
        ]
      + source_security_group       = (known after apply)
      + source_security_group_id    = (known after apply)
      + subnets                     = [
          + "subnet-66411820",
          + "subnet-a433d8c0",
          + "subnet-f3f31485",
        ]
      + tags_all                    = {
          + "application" = "vets-api"
          + "environment" = "prod"
          + "managed_by"  = "Terraform"
          + "purpose"     = "mitigate latency issues as per https://dsva.slack.com/archives/C03STQZ40DQ"
          + "repo"        = "https://github.com/department-of-veterans-affairs/vsp-infra-evss-bgs-split"
        }
      + zone_id                     = (known after apply)

      + health_check {
          + healthy_threshold   = 3
          + interval            = 30
          + target              = "HTTP:3004/"
          + timeout             = 5
          + unhealthy_threshold = 2
        }

      + listener {
          + instance_port     = 3004
          + instance_protocol = "HTTP"
          + lb_port           = 3004
          + lb_protocol       = "HTTP"
        }
    }

Plan: 2 to add, 0 to change, 0 to destroy.

ianhundere commented 2 years ago

the dns_name will still need to be added to the route53 record still, but i'll do that manually since we don't wanna mess with state if we can help it.

ianhundere commented 2 years ago

ran into an issues where the asg name keeps changing due to deploys, i'm now filtering based off of some tags while using the aws_autoscaling_groups datasource to grab the latest asg name.

ianhundere commented 2 years ago

currently we're at the following:

duplicated resources are ready to be applied
once that is done, we'll add thedns_name for the lb to the route53 record,
add the lb url for nginx_config_bgs_and_envss_split_api_url: ""
create a jenkins job that will be triggered by the vets-api build job that will plan/apply the split tf.

ianhundere commented 2 years ago

https://github.com/department-of-veterans-affairs/devops/pull/11782/files ^ jenkins job has been added, just needs to be tested.

ianhundere commented 2 years ago

jenkins job has been tested, just coordinating when to flip all the switches for staging.

ianhundere commented 2 years ago

the plan:

[x] deploy all tf resources to all 3 envs
- add dns_name (internal-dsva-vagov-staging-vets-api-2nd-445860677.us-gov-west-1.elb.amazonaws.com) to
  - [x] route53 record (staging for now)
  - [x] nginx config for the PR below (staging for now)
- [x] merge the devops repo PRs
  - [x] bgs/evss routes
  - [ ] jenkins job

we’ll only do staging for this piece in order to test jenkins job

then y’all test staging and if all is well i’ll make the necessary changes to the rev proxy for both dev / prod

ianhundere commented 2 years ago

currently have a test jenkins job, will remove this when the jenkins pr is merged / closed: http://jenkins.vfs.va.gov/job/testing/job/testing_tf/

edit: test job removed / pr merged.

ianhundere commented 2 years ago

i'll keep this ticket open until the remainder changes for the rev proxy / jenkins job have been completed / merged.

ianhundere commented 2 years ago

manually ran the revproxy deploy job for staging as well as the seed job.

sh-4.2$ cat /usr/local/openresty/nginx/sites-enabled/api_server.conf | grep bgs
    # bgs and evss routes

revproxy is updated

ianhundere commented 2 years ago

https://github.com/department-of-veterans-affairs/vsp-infra-evss-bgs-split/pull/1 ^ tags / asg attachment added

and nginx config fixed: https://github.com/department-of-veterans-affairs/devops/commit/280980b07dcbe9131ceb8b15a557716b7e1b9bf2

ianhundere commented 2 years ago

confirmed staging revproxy is correctly updated and that ec2s are being registered with LB:

sh-4.2$ cat api_server.conf | grep -A 4 bgs
    # bgs and evss routes
        location ~ ^/(v0|v1)(/debts|/debt_letters|/profile/ch33_bank_accounts|/profile/payment_history) {
      proxy_pass http://internal-dsva-vagov-staging-vets-api-2nd-445860677.us-gov-west-1.elb.amazonaws.com:3004$request_uri;

ianhundere commented 2 years ago

jenkins job is fixed: https://github.com/department-of-veterans-affairs/devops/pull/11803/files

ianhundere commented 2 years ago

reaching out to various parties in regards to moving / renaming the dd forwarder before enabling logs in staging / prod.

ianhundere commented 2 years ago

jenkins job confirmed to be working:

ianhundere commented 2 years ago

jenkins job confirmed to be working:

ianhundere commented 2 years ago

we’ve discovered, at least this is what it looks like, traffic isn’t going thru the revproxy. Kyle was just as confused and sanity checked it. we don’t know where the revproxy is coming into play because all records pointed straight to the asg worker lbs. we’re gonna pull Jeremy in to see if he knows where the revproxy comes into play.

https://dsva.slack.com/archives/CTYQL39FE/p1660849458919659

edit: endpoints are being used, infra will investigate further.

ianhundere commented 2 years ago

pr up: https://github.com/department-of-veterans-affairs/devops/pull/11820/files

ianhundere commented 2 years ago

oops, dev/prod failed because of not including proper schema / port. monday woes

ianhundere commented 2 years ago

https://dsva.slack.com/archives/CJYRZK2HH/p1661185087672539

and we're a go!

ianhundere commented 2 years ago

since our prometheus metrics rely on tags, tags were added to the 2nd asg instances since the launch template doesn't include them, specifically deployment_name.

https://dsva.slack.com/archives/C03KT515C0H/p1661263063451239

provider "aws" {
  region = "us-gov-west-1"
  default_tags {
    tags = {
      Name            = "${var.name}-2nd"
      deployment_name = "vets-api-server"
      repo            = "https://github.com/department-of-veterans-affairs/vsp-infra-evss-bgs-split"
      managed_by      = "Terraform"
      application     = "vets-api"
      purpose         = "mitigate latency issues as per https://dsva.slack.com/archives/C03STQZ40DQ"
      environment     = var.env
    }
  }
}

ianhundere commented 2 years ago

enables access_logs for each respective elb.

https://github.com/department-of-veterans-affairs/vsp-infra-evss-bgs-split/commit/3bfe95ff88f3cb2674420a08bb9882f2ed18d97f

department-of-veterans-affairs / va.gov-team

Duplicate vets-api infra to mitigate evss-bgs causing latency #45570

Description

Acceptance Criteria