API load testing setup - Githubissues

lbeaufort commented 4 years ago

Action item from https://github.com/fecgov/openFEC/issues/4314

Need to know what the system can handle before it "falls over". Need to know what the load was on April 15, for example.

Completion criteria:

[ ] After Aurora migration, measure "fallover point" for number/complexity of requests
[ ] Put in tickets for findings

Technical steps

Prepare the "locusts"

[X] Get API queries and load data from 4/15 (Kibana max size reached, saved 10am-12pm) link: https://logs.fr.cloud.gov/goto/8993f1282d2ab11c4aabd09330e107de
[x] Parse logs to generate locust queries (see old issue https://github.com/fecgov/openFEC/pull/4031/files)
[X] Make a branch with locust tests for API
[x] Figure out users to simulate - at peak, 1000 requests/minute = ~50/second = 500 users, (1/second?)

Set up environment

[X] Figure out what it's going to take to set up the DB environment to replicate prod now that we're using Aurora. How many clusters, PR to terraform, check with DB team on timing, etc.
Q: Stage and dev have 2 clusters of 4.8xlarge, prod has 3. Can we give stage and prod 3 each and dev one? Alternately, can we spin up a 3rd replica for one day?
Per Rohan: Currently DEV (1 master - 1 replica) STG (1 master - 1 replica) PRD (1 master - 4 replica) (Change it manually during the test make it much easier than going thru terraform). Make sure to scale it down after the test to minimize cost. Should test 2 vs. 4 clusters
[x] Confirm access to increase cluster count
[X] Make sure everything actually works by running a sample test on stage as-is
[ ] Make a testing plan. 2 vs 4 clusters, number of application instances and memory, CMS timeouts, API timeouts. Gunicorn workers (need research). Application profiling? https://github.com/benfred/py-spy.
Add downloads to locust tests? Will need to look at celery worker setups
First test production setup with 2 clusters
Test production setup with 4 clusters
Test more application memory

Communicate

[x] Let FEC, API umbrella, and cloud.gov teams know which day we'll be doing the testing. Cloud.gov wants an email: https://cloud.gov/docs/compliance/pentest/

lbeaufort commented 4 years ago

We should also let the team and cloud.gov know (ideally a week beforehand) which day we plan to test. Reference: https://cloud.gov/docs/compliance/pentest/

lbeaufort commented 4 years ago

Parsing script:

"""Extract API query from Kibana log file"""
import csv
import json

with open("LB_API_RTR_requests_4-15-20.csv", "r") as file:
    reader = csv.reader(file, delimiter=',')
    # Endpoint/uery lookup
    queries = {}
    count = 0
    for row in reader:
        # Column 2 has the queries. Throw out some bad data
        if 'v1' in row[1]:
            # Get the endpoint- everything after the 'v1' to the first '?'
            endpoint = row[1].partition("v1/")[2].partition("?")[0].partition(" ")[0]
            if endpoint[-1] != "/":
                endpoint += "/"
            # Get the query param string - everything between the ? and the ' '
            query_parameters = row[1].partition("?")[2].partition(" ")[0]
            # Parse arguments
            query_dict = {}
            if "&" in query_parameters:
                # Clean up some double &&
                query_parameters = query_parameters.replace("&&", "&")
                # Split each query pair out into a list
                parameter_groups = query_parameters.split("&")
                # Make a dictionary of the parameters (this is how locust needs them)
                for result in parameter_groups:
                    # Split the parameters from the values
                    if "=" in result and "api_key" not in result:
                        key, value = result.split("=")
                        if not query_dict.get(key):
                            query_dict[key] = [value]
                        else:
                            query_dict[key].append(value)
            if query_dict:
                # Add to the endpoint/query lookup
                if not queries.get(endpoint):
                    queries[endpoint] = [query_dict]
                else:
                    if query_dict not in queries[endpoint]:
                        queries[endpoint].append(query_dict)

print(json.dumps(queries, indent=1))

lbeaufort commented 4 years ago

Let the team know and set up a maintenance window in Pingdom, this could cause downtime

In the console:

Update DB size (currently 4.8 in prod) - start with the writer.
Update parameter group ("DB parameter group" NOT "DB cluster parameter group" - usually needs to correspond with the size - fec-aurora-master for the writer and fec-aurora-replica-5 for reader (we will clean these up later?)
Note: FYI reader(s) has enhanced monitoring for autoscaling
Choose "apply immediately"
Repeat for reader instance
Add one instance (production currently at 3 instances) Steps: Select the cluster, Actions -> Add instance
Double check security groups
Set up autoscaling - we created a new policy that mirrored the policy for production docs
Run "warmup script" in stage for new reader - David will run this. this is a SQL package (pg_prewarm) that caches data in memory. How to run this? Can manually trigger with one statement one time after reboot, where does the script live, what machine runs it. DB130 and 029 on-prem servers. Could we run this with celery task? Currently on demand. David can share the script with the team. Best to run the readers individually. Best practice to run this from time to time - maybe after election? @dzhang-fec will document and share with the team.

https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/Overview.DBInstance.Modifying.html

lbeaufort commented 4 years ago

There were some issues with formatting the requests, so I also did some testing with the normal locust setup.

fecgov / openFEC

API load testing setup #4327

Technical steps