Perf drop in latest releases - unstable `postgrest-loadtest` results

steve-chavez commented 1 year ago

Problem

On https://github.com/PostgREST/postgrest/pull/2523, the loadtest reported 420.02257774518125 throughput.

On main, the loadtest reports a 338.6473662241433 throughput.

Proposal

We need to check on which commit(s) the regression happened and then find a solution.

@wolfgangwalther I recall you ran the postgrest-loadtest on a commit range before. Perhaps you could share the script so I can check? (the script would be useful as one of our nix commands too).

wolfgangwalther commented 1 year ago

@wolfgangwalther I recall you ran the postgrest-loadtest on a commit range before. Perhaps you could share the script so I can check? (the script would be useful as one of our nix commands too).

Unfortunately, I don't have that script anymore.

develop7 commented 1 year ago

I've tried to reproduce the issue with postgrest-loadtest-against0a1564ba5aaa8175dcb738271967944652cf4ce9 locally

According to the report, requests throughput does drop, along with request rate, but not 25% down:

	Rate	Throughput
0a1564	710.57	645.97
main	687.80	625.27
diff	-4%	-4%

While there is a slowdown, it's not awful lot. Might be a fluke; worst case is the slowdown is amplified by the environment Github actions are running in. I'd suggest we'd try running loadtest in a dedicated VM (in AWS for now) and see if we could reproduce the issue.

postgrest-loadtest report

```text ❯ postgrest-loadtest-against 0a1564ba5aaa8175dcb738271967944652cf4ce9 Running loadtest on "0a1564ba5aaa8175dcb738271967944652cf4ce9"... Preparing worktree (detached HEAD 0a1564ba) postgrest-with-postgresql-15: You can connect with: psql 'postgres:///postgres?host=/run/user/1000/postgrest/postgrest-with-postgresql-15-qzo/socket' -U postgres postgrest-with-postgresql-15: You can tail the logs with: tail -f /run/user/1000/postgrest/postgrest-with-postgresql-15-qzo/db.log delaying data to/from postgres by 0ms Building postgrest (nix)... done. Starting postgrest... done. delaying data to/from PostgREST by 0ms 2023/09/11 16:59:39 accept unix /run/user/1000/postgrest/postgrest-with-slow-postgrest-y1R/postgrest.socket: use of closed network connection 2023/09/11 16:59:39 accept unix /run/user/1000/postgrest/postgrest-with-slow-pg-SlF/socket/.s.PGSQL.5432: use of closed network connection Requests [total, rate, throughput] 42638, 710.57, 645.97 Duration [total, attack, wait] 1m0s, 1m0s, 438.81µs Latencies [min, mean, 50, 90, 95, 99, max] 72.607µs, 1.406ms, 467.721µs, 1.413ms, 10.105ms, 10.945ms, 19.807ms Bytes In [total, mean] 43138986, 1011.75 Bytes Out [total, mean] 164090460, 3848.46 Success [ratio] 90.91% Status Codes [code:count] 200:23258 201:3876 204:11628 400:3876 Error Set: 400 Bad Request Done running on "0a1564ba5aaa8175dcb738271967944652cf4ce9". Running loadtest on HEAD... postgrest-with-postgresql-15: You can connect with: psql 'postgres:///postgres?host=/run/user/1000/postgrest/postgrest-with-postgresql-15-sOm/socket' -U postgres postgrest-with-postgresql-15: You can tail the logs with: tail -f /run/user/1000/postgrest/postgrest-with-postgresql-15-sOm/db.log delaying data to/from postgres by 0ms Building postgrest (nix)... done. Starting postgrest... done. delaying data to/from PostgREST by 0ms 2023/09/11 17:04:39 accept unix /run/user/1000/postgrest/postgrest-with-slow-postgrest-ojG/postgrest.socket: use of closed network connection 2023/09/11 17:04:39 accept unix /run/user/1000/postgrest/postgrest-with-slow-pg-ktm/socket/.s.PGSQL.5432: use of closed network connection Requests [total, rate, throughput] 41274, 687.80, 625.27 Duration [total, attack, wait] 1m0s, 1m0s, 595.107µs Latencies [min, mean, 50, 90, 95, 99, max] 69.772µs, 1.452ms, 465.196µs, 1.469ms, 10.637ms, 11.475ms, 19.373ms Bytes In [total, mean] 42521098, 1030.22 Bytes Out [total, mean] 158840920, 3848.45 Success [ratio] 90.91% Status Codes [code:count] 200:22514 201:3752 204:11256 400:3752 Error Set: 400 Bad Request Done running on HEAD. ```

develop7 commented 1 year ago

On the other hand, the loadtest against the commit before the v11.2.0 reports throughput = 261.92, while the one before the v11.1.0 does 323.83 which results in 20% drop.

develop7 commented 1 year ago

Ran postgrest-loadtest-against v11.1.0 locally, no luck — less that 5% drop

Reviewed loadtest results for handful of commits between v11.1.0 and v11.2.0. Most of the tests have their throughput in 320-350, and while there are some drops to 260s, they do not hold — loadtest against next commit has throughput back in 300s.

To me it looks like an irregular fluke so far.

postgrest-loadtest-against v11.1.0

```console ❯ postgrest-loadtest-against v11.1.0 Running loadtest on "v11.1.0"... Preparing worktree (detached HEAD 1f13e43a) postgrest-with-postgresql-15: You can connect with: psql 'postgres:///postgres?host=/run/user/1000/postgrest/postgrest-with-postgresql-15-uj1/socket' -U postgres postgrest-with-postgresql-15: You can tail the logs with: tail -f /run/user/1000/postgrest/postgrest-with-postgresql-15-uj1/db.log delaying data to/from postgres by 0ms Building postgrest (nix)... done. Starting postgrest... done. delaying data to/from PostgREST by 0ms 2023/09/12 17:17:00 accept unix /run/user/1000/postgrest/postgrest-with-slow-postgrest-3O6/postgrest.socket: use of closed network connection 2023/09/12 17:17:00 accept unix /run/user/1000/postgrest/postgrest-with-slow-pg-YDd/socket/.s.PGSQL.5432: use of closed network connection Requests [total, rate, throughput] 41890, 698.08, 634.62 Duration [total, attack, wait] 1m0s, 1m0s, 427.59µs Latencies [min, mean, 50, 90, 95, 99, max] 74.201µs, 1.431ms, 480.298µs, 1.356ms, 10.445ms, 10.965ms, 19.283ms Bytes In [total, mean] 42382350, 1011.75 Bytes Out [total, mean] 161211680, 3848.45 Success [ratio] 90.91% Status Codes [code:count] 200:22850 201:3808 204:11424 400:3808 Error Set: 400 Bad Request Done running on "v11.1.0". Running loadtest on HEAD... postgrest-with-postgresql-15: You can connect with: psql 'postgres:///postgres?host=/run/user/1000/postgrest/postgrest-with-postgresql-15-0iR/socket' -U postgres postgrest-with-postgresql-15: You can tail the logs with: tail -f /run/user/1000/postgrest/postgrest-with-postgresql-15-0iR/db.log delaying data to/from postgres by 0ms Building postgrest (nix)... done. Starting postgrest... done. delaying data to/from PostgREST by 0ms 2023/09/12 17:18:04 accept unix /run/user/1000/postgrest/postgrest-with-slow-postgrest-oDJ/postgrest.socket: use of closed network connection 2023/09/12 17:18:05 accept unix /run/user/1000/postgrest/postgrest-with-slow-pg-Nt9/socket/.s.PGSQL.5432: use of closed network connection Requests [total, rate, throughput] 40614, 676.90, 615.36 Duration [total, attack, wait] 1m0s, 1m0s, 479.307µs Latencies [min, mean, 50, 90, 95, 99, max] 74.712µs, 1.476ms, 483.312µs, 1.504ms, 10.708ms, 11.614ms, 19.803ms Bytes In [total, mean] 41841298, 1030.22 Bytes Out [total, mean] 156300820, 3848.45 Success [ratio] 90.91% Status Codes [code:count] 200:22154 201:3692 204:11076 400:3692 Error Set: 400 Bad Request Done running on HEAD. ```

steve-chavez commented 1 year ago

While there is a slowdown, it's not awful lot. Might be a fluke; worst case is the slowdown is amplified by the environment Github actions are running in.

@develop7 Those are great news! Thanks for reporting!

I'd suggest we'd try running loadtest in a dedicated VM (in AWS for now) and see if we could reproduce the issue.

That would be great. Having a dedicated AWS instance where postgrest-loadtest runs on every CI run would give the most stable results. Seems doable considering we already use AWS for our ARM builds.

steve-chavez commented 1 year ago

Having a dedicated AWS instance where postgrest-loadtest runs on every CI run

@laurenceisla Just told me that the above is not possible, the github secrets required for the AWS instance to run are not exposed to PRs, only for the main branch. This is why our ARM build only runs on main.

develop7 commented 1 year ago

I'll summarize here (for posterity's sake) our private chat with @steve-chavez from yesterday.

I've reviewed the loadtest reports across commits in main branch and, while there are significant drops here and there, they are not preserved further in history. Therefore I think these drops are caused by uneven resources budgeting for action runners on GitHub. We could avoid that by moving runners to the environment with more control on resources (dedicated server, spot instances on AWS, etc).
Although, while producing unstable results across runs, loadtest CI job does two separate loadtest runs (with postgrest-loadtest-against script) and these numbers look in line within a run, doing their job after all.
Essentially, comparing loadtest metrics across CI runs is unreliable.
On the other hand, loadtest against main branch runs twice against the same commit. I've suggested we'd make it run against latest release instead, which would require some tweaking, which @steve-chavez approved.

Bottom line is: while there's probably no performance drop against latest release, I'll fix our loadtest CI to make sure there isn't for sure.

steve-chavez commented 1 year ago

Closing this as it was already clarified there was no noticeable perf loss.

PostgREST / postgrest

Perf drop in latest releases - unstable `postgrest-loadtest` results #2842

Problem

Proposal