Open jechols opened 4 years ago
It's worth noting that today we ran the reindexing process on roughly 2600 assets. Timing:
tesseract ticket
Same concern here as with the profiler issue (#936): this isn't related to tesseract. Have we done load testing beyond the reindex? If so, let's update this ticket and close it. If not, let's keep this open or else explicitly state we're not going to load test.
In c4l, and a session on locust.io (Python) makes me wonder if it might work for this: https://locust.io/. Wanted to park it here in case it's helpful.
I probably wouldn’t waste time on finding a tool with a “pretty” UI. There are a ton of tools for load testing, but simpler ones like siege can be as effective as anything with a UI without getting in the way of just blasting a ton of users at the front-end. But the thing is, all these tools are just front-end tests for HTTP requests.
With OD, load testing can mean a whole lot more than just HTTP, though. User traffic is definitely one aspect, but there’s also: ingesting too many things at once, indexing the full database, overloading workers on purpose to see what happens, etc. These are important. Possibly more important than just front-end traffic. Then there are “everything” tests – what happens if we have a bulk ingest of AVI files (very heavy on workers) while the site has a lot of users? Unfortunately, generic HTTP load tests can’t automate this kind of thing, because we have to identify what’s expensive on the back-end and script something that will force that to occur.
My bigger concern is that we haven’t really even thought through possible scenarios, much less how we could test them.
From: Ray Henry @.> Sent: Wednesday, March 24, 2021 12:06 To: OregonDigital/OD2 @.> Cc: Jeremy Echols @.>; State change @.> Subject: Re: [OregonDigital/OD2] Load test the stack (#930)
In c4l, and a session on locust.io (Python) makes me wonder if it might work for this: https://locust.io/. Wanted to park it here in case it's helpful.
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHubhttps://github.com/OregonDigital/OD2/issues/930#issuecomment-806081479, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAQO3FSV2SQVAEBI6PABLKLTFIZ2FANCNFSM4KDJ6N6Q.
@jechols - definitely not looking for a pretty UI! Small, Python, flexible user test scenarios. Not just "non-dev manager at a conference wants to do a thing manager saw at the conference." :) And yeah, I agree re: just http. Definitely need the other load/stress testing as well, and if that's a list you think the devs can come up with, we should probably consider doing that this workcycle (WC 16).
Unless there are specific things left we want to stress test, I think we've pretty well stress tested the backend! Do we need to do frontend load testing?
Fedora, Postgres, Redis, Blazegraph, Sidekiq and SOLR have been through the equivalent of the battle of thermopylae with the migration from OD1 and they're handling it quite well. And the cluster is still able to run OSU's other production workloads just fine.
I don't think we've done "load testing" per se of the Rails frontend. We've fixed a few scaling issues:
web-admin
deployment leaving web
free to handle normal user traffic. this keeps long-running dashboard and workflow tasks from eating up all of the puma threadsWe're currently running 2 replicas for web
and web-admin
, and we can add more on-demand. Anyone with cluster access can scale the deployments up as needed. I can also put together an autoscaling automation to scale the deployments up and down as needed.
Descriptive summary
We have no idea how the new stack will perform under stress. The tech has changed significantly since the days of our original CDM -> Hydra migration, we have more data, and we're planning to promote OD a lot more than we did upon the Hydra-based OD release.
We should get a good deal of data in before we do this, because the new stack seems to add a lot more overhead per item than the current stack. This could mean that a load test against 100 items works out just fine while 100,000 items falls over.
By the time we get 100,000 items we're probably mostly done with the migration, and we want data sooner than that. I suggest a nice middle-ground test of around 10,000 items where we simulate 20-30 simultaneous users, such as what we might see from a smallish class using the site for something.
Expected behavior
I don't know that we have any specific expectations. This is to get more data to see how the stack works and figure out what to do if it doesn't before we've gone live.
Related work
Similar ticket for OD1: https://github.com/OregonDigital/oregondigital/issues/324