HHS / simpler-grants-gov

https://simpler.grants.gov
Other
41 stars 13 forks source link

Spike: figure out approach for load testing to ensure we can hit our usage goals for search and opportunity listing (10k cumulative users and 2k cumulative users, respectively) #2322

Open mxk0 opened 3 weeks ago

mxk0 commented 3 weeks ago

Summary

Several things need to happen on the frontend and backends to prepare for load testing both the search pages and the opportunity listing pages. This task is to create a plan and tickets and answer any open questions. As part of this spike we should think through what peak traffic to each feature would realistically look like (we aren't likely to have 10k concurrent users for search).

Some of those include:

Application specific concerns extracted to: #2330

Acceptance criteria

acouch commented 1 week ago

@coilysiren

configuring open search for production traffic (not sure what this entails ATM)?

Do we have any greater sense of this? There are a number of items in Operational best practices for Amazon OpenSearch Service and open search index cache settings.

select a CDN

Don't see a reason not to go with AWS CloudFront.

are there unanswered questions we have about a CDN?

We answered a lot of those in #2330 . There are specific settings of course. We will also need to detail a plan of action for setting it up, testing, and rerouting DNS.

do we need to update artillery?

Yes, would need to a update the API search load test to include the API but also add search terms maybe using csv exports or something.

is it sufficient to run load tests from our local devices or do we need a hosted solution?

Not sure what a good threshold is for local vs hosted. Artillery integrates with lambda or fargate. Not sure how long it would take to spin up lambdas just for this type of effort. For under 50K user load test it seems like we have enough devs where a handful could run them from their local machines.

is there anything we can / should put in place before we have new relic?

It looks like we will do a load test before setting up new relic. We have a limited performance dashboard in CloudWatch. Maybe there is a 2 point ticket to update that and make sure that the correct logs are available?

coilysiren commented 1 week ago

@acouch

Do we have any greater sense of this? There are a number of items in Operational best practices for Amazon OpenSearch Service and open search index cache settings.

I hadn't gotten to the best practices doc yet, so thanks for that! The only thing I've setup so far is the AZ redundancy and the dedicated master nodes

coilysiren commented 1 week ago

It looks like we will do a load test before setting up new relic. We have a limited performance dashboard in CloudWatch. Maybe there is a 2 point ticket to update that and make sure that the correct logs are available?

To the best of my knowledge, the correct logs and dashboards are already available. They are just annoying to access. I can double confirm that they are available, I suppose.

coilysiren commented 1 week ago

Not sure what a good threshold is for local vs hosted. Artillery integrates with lambda or fargate. Not sure how long it would take to spin up lambdas just for this type of effort. For under 50K user load test it seems like we have enough devs where a handful could run them from their local machines.

It would be rather fast to setup (less than a day) but I would still rather not. I'm fine with just running it from someone's local laptop. Ideally from someone on the west coast, since we are hosted on the east coast.

coilysiren commented 1 week ago

Load testing tickets:

coilysiren commented 3 days ago

@acouch mentions that we want to move to a professional scale load test, which would be nice to use lambda for. but that's not really necessary for this quad IMO

coilysiren commented 3 days ago

AWS CDN option: https://aws.amazon.com/cloudfront/

coilysiren commented 3 days ago

API Gateway should be included here IMO

maybe? I dunno

@acouch: we can wait until we actually have this problem

coilysiren commented 3 days ago

we should be able to correlate client to API load to be able to track how clients are using the API

coilysiren commented 3 days ago

infra monitoring and alerts

My main thing here is to ensure the logging is in place. There are a lot of graphs and alarms, but I don't think I've logged into the frontend and backend logs. I especially haven't looked into correlating them.