Stress testing for a larger-than-usual number of participants

samuel-yeom commented 4 years ago

In https://github.com/dlareau/puzzlehunt_server/issues/11#issuecomment-431153784, you said that with 480 users the server will be "unusable". Is this still the case? It is plausible that we will have that many users for the spring hunt, so it would be great if there is a way to stress test the server.

dlareau commented 4 years ago

A number of things have improved since then, so I'll try to attempt a new stress test in the next few days.

dlareau commented 4 years ago

So my stress testing code needs a bit of updating/bug fixing, but the quick test that I just ran with the existing code seems to indicate that we can keep up with 400 users while keeping page load times under 800ms.

Gonna update the code to hopefully error less and see if I can get some more exact figures.

samuel-yeom commented 4 years ago

Ideally, we would like to be ready for up to 1,000 users.

dlareau commented 4 years ago

I suspect that with the numbers I'm seeing we'll be able to support around 600-700 before we see a noticeable difference in load times. I'm hoping to conduct some more solid testing this afternoon and see if there are any quick tweaks I can make to make things better. Shooting for 1000 is something I'd love to have as a goal, but would require either a large effort spent on optimization or moving to multiple load balanced hosts, neither of which are likely to happen for this hunt.

The real problem is that most modern advice in this area is "just setup caching or a CDN", but the problem is that our site relies on dozens to hundreds of small groups all getting different views of our content (teams see different hunt and puzzle pages depending on which puzzles they have unlocked/solved). One thing I had on the roadmap was caching pages on a per team basis and flushing the cache when they solve a puzzle and seeing if that helped, but is a complicated effort that likely wouldn't be bug free in time for this hunt.

Certainly the easiest solution (money aside) is horizontal scaling on something like amazon AWS. The server is now completely dockerized and it would be easy to just spin up 2 or 4 or 8 of them, the thing that isn't ready there is the load balancing layer between them and testing to make sure database interactions don't have any weird race conditions.

TL;DR: More testing coming soon, I hope we are okay for this hunt, scaling concerns are in the roadmap already.

dlareau commented 4 years ago

I would like to note here for future referencing that everything went great on 4/25/20 on an AWS m5.2xlarge with 9 gunicorn workers and 1400 hunt participants over 400 teams. I want to set up load testing for more than 500 users (the amount my computer alone can roughly simulate), but this issue can be pretty safely closed.

dlareau / puzzlehunt_server

Stress testing for a larger-than-usual number of participants #125