department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
282 stars 203 forks source link

vets-api autoscaling tuning #50604

Closed LindseySaari closed 1 year ago

LindseySaari commented 1 year ago

Description

The current vets-api EKS autoscaling (HPA) is based on what was defined for BRD (threads + worker count). We should keep an eye on metrics to see if we are under or over utilizing pods. E.g. The min/max thresholds. See Datadog

EKS Dashboards:

laineymajor commented 1 year ago

Currently have ported over what we see in BRD... but needs fine tuning for EKS.

LindseySaari commented 1 year ago

These are the current BRD CPU usage percentage rates for Prod, Staging and Sandbox. We will want to compare to EKS as we scale up the weighted deployments and compare to defined autoscaling defaults.

According to the AWS Docs "A good, general rule for EC2 instances is that if your maximum CPU and memory usage is less than 40% over a four-week period, you can safely cut the machine in half."

Screen Shot 2023-01-03 at 10.38.36 AM.png

laineymajor commented 1 year ago

@RachalCassity to sync with @rmtolmach on HPA autoscaling during 1:1

rmtolmach commented 1 year ago

Could Goldilocks be used for determining the limits? Here is a really old ticket where I was investigating deployment requests and limits in EKS for vets-api: https://github.com/department-of-veterans-affairs/va.gov-team/issues/39691

Edit: never mind on ☝️ that, Goldilocks is used for VERTICAL scaling, not horizontal.

laineymajor commented 1 year ago

With 100% traffic going to dev, should be pretty accurate. Need to start working on load testing.

laineymajor commented 1 year ago

On hold until we turn up the dial on staging + higher environments.

RachalCassity commented 1 year ago

Reminder: Attach PRs to scale up pods for future traffic.

laineymajor commented 1 year ago

Rachal going to make a PR for staging

laineymajor commented 1 year ago

Rachal created the PRs for SB and production (as draft currently). Removed the DB migrate job from SB and production (this will remain a manual process in Jenkins).

LindseySaari commented 1 year ago

PRs still in draft mode and ready to go when we are.

laineymajor commented 1 year ago

Now that we have increased number of pods in staging, the pods are not all cycling through. @RachalCassity to look at the config and pull in Kyle if needed. Percentages seem to be the way to go.

LindseySaari commented 1 year ago

@RachalCassity Do you want to update this ticket with the changes made + discussed in this thread please?

LindseySaari commented 1 year ago

@RachalCassity Do you want to update this ticket with the changes made + discussed in this thread please?

LindseySaari commented 1 year ago

Current BRD request rates: https://vagov.ddog-gov.com/dashboard/b8k-uy2-fkm?from_ts=1677259846726&to_ts=1677263446726&live=true