department-of-veterans-affairs / va.gov-team

Public resources for building on and in support of VA.gov. Visit complete Knowledge Hub:
https://depo-platform-documentation.scrollhelp.site/index.html
283 stars 205 forks source link

Load testing #54225

Closed laineymajor closed 1 year ago

laineymajor commented 1 year ago

PROBLEM STATEMENT Load testing needs to be completed to check how systems function while a heavy volume of concurrent virtual users perform transactions over a period of time. We will run a Load Test to:

ACTION STEPS/TASKS

DEFINITION OF DONE

laineymajor commented 1 year ago

@laineymajor to send out communications to VFS teams about upcoming load testing.

laineymajor commented 1 year ago

@laineymajor drafting new MVP load testing plan and will include in this ticket

LindseySaari commented 1 year ago

Loadtesting dashboard in Datadog

laineymajor commented 1 year ago

@laineymajor update with MVP plan

laineymajor commented 1 year ago

grafana (or jmeter) current request rate dashboar - lindsey to add

laineymajor commented 1 year ago

I've updated the action steps to match the new MVP load testing plan

laineymajor commented 1 year ago

This ticket will carryover to next sprint as we needed to give VFS teams some lead time to prep for load testing. We will be completing load testing together on March 20.

laineymajor commented 1 year ago

TEAMS TO TEST

  1. checkin team
  2. lighthouse team
  3. Platform team (internally done)
laineymajor commented 1 year ago

What needs to be done BEFORE load testing on Monday:

LindseySaari commented 1 year ago

Potential boards to pay attention to: Load testing dashboards:

LindseySaari commented 1 year ago

Load testing today at 12:30PM

LindseySaari commented 1 year ago

Board for pods in the vets-api-staging vets-api namespace

Deployment dashboard

LindseySaari commented 1 year ago

We had an issue with the check-in team tests. 500 errors were being returned, but we pushed a fix. Thread here

LindseySaari commented 1 year ago

After the pushed the fix, load tests ran at 100% success.

@considerable Please type up a summary of the outcome and the issues we ran into. Thread here

LindseySaari commented 1 year ago

Link for search load test script

LindseySaari commented 1 year ago

Adjusted liveness probes yesterday. Requests are only going to one pod, so we need to re-run the tests. The script may be mocked and requests are for some reason going to 1 pod.

Next step: scale down and re-run tests @oseasmoran73, @considerable and Kanchana will need to re-run the scripts. We will scale down to 2 pods or decrease the targetAverageValue.

LindseySaari commented 1 year ago

Console commands:

git clone git@github.com:department-of-veterans-affairs/vets-api-loadtest.git

cd vets-api-loadtest

docker run --rm -v `pwd`/loadtest:/loadtest -i locustio/locust:2.14.2 \
  -u 40 -r 5 -t 30m --headless --only-summary -H https://staging-api.va.gov \
  -f /loadtest/search/search_locust.py
LindseySaari commented 1 year ago

At 2:25 on 3/23, the team ran the loadtest script and scaling is working Full chart for reference

Screenshot 2023-03-23 at 2.27.46 PM.png

LindseySaari commented 1 year ago

After the load test completed, pods scaled down properly as well!

Screenshot 2023-03-23 at 2.36.09 PM.png

LindseySaari commented 1 year ago

Once the loadtest summary is updated, we can close this ticket @considerable

laineymajor commented 1 year ago

@considerable did you complete load testing on Monday with the VFS teams? Please provide a detailed summary on the work you did with the VFS teams.

laineymajor commented 1 year ago

@considerable to provide summary, then close ticket (before COB 3.27).

considerable commented 1 year ago

A. Load test for smoke-testing

Note: Amazon blog Load testing your workload running on Amazon EKS with Locust explains load testing for the performance and reliability of a workload by generating artificial loads that mimics real-world traffic.

Reality check with actual load test ran 3/20/2023 against staging-api.va.gov:

Lesson learned:

So, what the load test said?
Has the EKS infrastructure enough capacity to run the code? - YES:


B. Load test for the auto-scaling configuration

Note: Amazon recommends performing load tests to choose an automatic scaling configuration that works the way you want. See Load testing your auto-scaling configuration.

Important notes:

  1. Initially, with the load of 500 req/min to /v2/check_in for 30 mins, vets-api had only scaled up by two pods. Therefore, vets-api got to be scaled down from 16 pods to a low number, so we can see how the pods do scale with load.
  2. Upgrade to ruby 3 might get in the mix.
  3. Researching log, distinguish those entries showing instance IDs instead of pod IDs.
  4. Maybe we caused a breaker outage with the load.
  5. ClamAV was not able to connect. Shortly after the limit was reached, the pod was taken out of service. ClamAV memory was increased to 3Gi.
  6. Enable the mocking for load tests to work.
  7. Kudos to Eric and Kanchana w/o whose help we wouldn’t have been able to find those errors. We really appreciate it!