Leader App Vote API - Build and Test Scaling to millions of votes

geerlingguy commented 1 year ago

The voting API will be the core functionality of the leader-app—it needs to be able to accept up to thousands of votes per minute (potentially for many minutes!) and dump that information back out in real-time.

For this issue my goals are to:

[x] Have a /vote endpoint that accepts POST requests from room nodes (test with curl) with fields room_id and value (from the node), and round_id (from the API, checking what the current round is). The endpoint should only accept votes when the current round is 'open' and accepting votes.
[x] We should also add functionality to check if the current round only accepts one vote for the round per room—if so, perform a database query to check if the room has already submitted a vote for this round. If so, return a 423 Locked or maybe 429 Too Many Requests. Or if we want to be cheeky, 418 I'm a Teapot.
[x] Write a script that sends as many requests per second (using curl?) to the /vote endpoint simulating 100 different rooms as we can, and measure the performance. Also check on per-request latency (e.g. if we can get request latency sub-10ms, I don't think we'll have any issues even if everyone is hitting all three buttons as fast as humanly possible—famous last words, I know).
[x] Run the script a number of times (one for round 1, one for round 2, one for round 3, etc.) so the database is loaded up with at least 1 million votes. Then use that database for more load testing (e.g. do writes slow down at some point? What is the size of the DB? are any queries performing slowly at this point—especially the lighting and tally endpoints?)
[x] Maybe implement execution timing data in our output at the bottom in a bootstrap footer, so we can monitor that over time? See: https://stackoverflow.com/a/51874656/100134

geerlingguy commented 1 year ago

If we really wanted to go deep, we could implement monitoring: https://medium.com/flask-monitoringdashboard-turtorial/monitor-your-flask-web-application-automatically-with-flask-monitoring-dashboard-d8990676ce83

geerlingguy commented 1 year ago

My daughter said we should respond with 418 I'm a Teapot when voting is closed, so that's what I'm gonna do!

geerlingguy commented 1 year ago

Testing with curl:

curl -X POST http://127.0.0.1:5000/vote \
   -H 'Content-Type: application/json' \
   -d '{"room_id":3,"value":0}'

geerlingguy commented 1 year ago

We're getting 23 ms for a vote right now:

 20:17:46 ~ 
$ time curl -X POST http://127.0.0.1:5000/vote \
   -H 'Content-Type: application/json' \
   -d '{"room_id":3,"value":0}'

real    0.023
user    0.006
sys 0.008

But I would like to load test this a bit better (that's one thread, in development mode). Without debug mode on, I'm getting like 20ms. So not much difference there. Is uwsgi actually faster?

geerlingguy commented 1 year ago

Going to use wrk with a lua script and eventually have multiple randomized requests so it can generate realistically-randomized vote data: https://stackoverflow.com/a/68597094/100134

Adding a script in a load-testing folder so it is easy to reproduce (and eventually test on the NUC that will run this thing).

geerlingguy commented 1 year ago

Synthetic load test:

 20:39:56 beast-game/leader-app/load-testing 
$ wrk "http://127.0.0.1:5000/vote" -s wrk_vote.lua --latency -t 5 -c 20 -d 30s
Running 30s test @ http://127.0.0.1:5000/vote
  5 threads and 20 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency    88.77ms  224.31ms   1.99s    90.68%
    Req/Sec   226.83    134.04   810.00     66.39%
  Latency Distribution
     50%    2.57ms
     75%   47.63ms
     90%  283.28ms
     99%    1.17s 
  16326 requests in 30.11s, 2.77MB read
Requests/sec:    542.29
Transfer/sec:     94.27KB

Doing good on my M2 MacBook Air, with the built-in server (not using WSGI). Did not test any other functionality at the same time (more realistic would involve randomized data, not the same request, and also some other script requesting a few other endpoints to get tally data, or room state data like lighting colors and LEDs).

geerlingguy commented 1 year ago

So, writes actually scale quite well. The reads are fast too, but when you need to iterate through a list of 50,000 votes like 100 times it gets a little slower, lol.

I need to optimize my code for the tally side a bit, but basically, I have it working with wrk and this lua script: https://github.com/geerlingguy/beast-game/blob/master/leader-app/load-testing/wrk_vote.lua

I tested multiple rounds at 5 threads and 10 concurrent connections, for 30 seconds each, and every time was able to sustain over 500 votes per second, with latency averaging 2-10 ms per request. If I just do 1 thread and 1 connection, I can hit 1,358 req/s with under 1ms latency for almost every request:

 15:11:44 beast-game/leader-app/load-testing 
$ wrk "http://127.0.0.1:5000/" -s wrk_vote.lua --latency -t 1 -c 1 -d 10s
Running 10s test @ http://127.0.0.1:5000/
  1 threads and 1 connections
  Thread Stats   Avg      Stdev     Max   +/- Stdev
    Latency   792.51us    1.46ms  21.97ms   98.37%
    Req/Sec     1.36k    83.73     1.47k    84.16%
  Latency Distribution
     50%  602.00us
     75%  644.00us
     90%  733.00us
     99%    8.45ms
  13720 requests in 10.10s, 2.33MB read
Requests/sec:   1358.43
Transfer/sec:    236.13KB

The one performance concern at this point is if we have a round where the goal is "hit the buttons as fast as possible" and then they let all 100 rooms do it for like an hour. At that point, the writes are still fast, but the tally page code (which is like O^3) starts bogging down to 200-400ms per page load, with extra latency on the database side...

geerlingguy / beast-challenge

Leader App Vote API - Build and Test Scaling to millions of votes #2