Performance Metrics Stabilization

WordPress / performance

Performance plugin from the WordPress Performance Group, which is a collection of standalone performance modules.

https://wordpress.org/plugins/performance-lab/

GNU General Public License v2.0

338 stars 92 forks source link

Performance Metrics Stabilization #849

Open felixarntz opened 9 months ago

felixarntz commented 9 months ago

Overview issue for improving the reliability of performance benchmarks.

TL;DR: In an ideal world, running a benchmark twice against the exact same WordPress site should result in the exact same load time metrics. Realistically, this is probably impossible, but at the moment the variance is often too high to draw meaningful conclusions. So we need to find ways to reduce it.

swissspidy commented 8 months ago

Currently doing some in-depth testing on a dedicated VM to see how we could reduce variance and bring some consistency to the test results.

felixarntz commented 8 months ago

Sharing this Slack thread here for visibility.

It looks like GitHub Actions workers aren't producing consistent enough results.
Running workflows locally isn't reproducible though, so that's not a solution either. (Additionally, it doesn't even seem to be that much better in data consistency within itself.)
@dmsnell has been experimenting with benchmarking in a compute cluster and outlined his approach here: https://fluffyandflakey.blog/2023/10/20/profiling-wordpress-v1/
@swissspidy is also experimenting with benchmarking in a dedicated VM.
The results of this research should influence benchmarking tools throughout, i.e. the WP core automated performance tests, the additional release-to-release benchmarks we're doing, as well as a reusable testing environment that e.g. WordPress plugins could use (see #852).

Two very high level questions to answer:

Can we identify a workflow that is both reusable and produces consistent results (e.g. <5% variance for testing the exact same WordPress version/codebase)?
If not, or additionally, can we use a statistical approach to stabilize metrics despite their variance, e.g. to make benchmarking dashboards more helpful?

swissspidy commented 8 months ago

Ah neat, haven‘t seen that post yet 👍 Looks like we‘re doing similar things, though I am looking also at web vitals, as well as things like better resets, network throttling, etc. But no R unfortunately 😅

Ultimately I think we won‘t get around using some proper statistical approach if we want as much accuracy as possible.

dmsnell commented 8 months ago

It looks like GitHub Actions workers aren't producing consistent enough results.

I've spent a considerable amount of time working with GitHub Actions in Gutenberg and overall it has been more reliable than I expected based on all the testing we've done. That is, all of the problems we've experienced with performance testing problems have been resolved in the Gutenberg test runners, not by jumping off of GitHub Actions.

I'm guessing this will be harder in Core because I think there's evidence that we should be running our tests for at least a couple hours before drawing conclusions when testing "the small", and GitHub Actions has a cut off of six hours. Still, I wonder what is making us not trust it?

Is there more we can talk about to uncover the problems there? I suspect some of the problems we've witnessed on GitHub Actions are present in every testing environment, even the most isolated and controlled.

running a benchmark twice against the exact same WordPress site should result in the exact same load time metrics

Want to concur with this and also note the major caveat that we often have different runtime modes in Core that are based on certain computations and caches having already been handled or not. For instance, a site quickly reaches out to download the list of plugin versions for updates and that slows down one or more early requests, and we could disable this and all of transient behaviors, but then we'd be intentionally overlooking things that have real and measurable performance impacts for the sake of establishing stability in the reported numbers.

We can force identical results from the benchmark and still miss the mark. A different way I think we could talk about your goal is to eliminate as much external bias from the benchmarks as is possible.

it's quite possible that in the pursuit of eliminating variability in the testing we might find issues that are in Core. one possibility here would be finding things that themselves are performance optimizations that make average requests faster but occasionally make requests much slower.

we may not be able to eliminate variability because it might be inherent in the code (could be something as indirect as MySQL performing maintenance work, which isn't related to WordPress, but is related in that it happens "out there" in the deployed world and WordPress has some way it responds to that which affects sites). we may be able to start identifying clusters that represent those different code paths and try to benchmark them separately to understand their behavior, or examine ways to eliminate the clusters and re-unify the runtime performance.

at the cost of intentionally ignoring the worst performance issues, one way we can start if we haven't already, is to ignore outliers in the data before reporting any metrics. this can't happen if we are updating a metric as we go, such as the average. we'd need to have every sample before computing the aggregate measures so that we can determine if a given sample is an outlier or not. (and I say "outlier" but mostly I think many of these will be page renders that are much slower because they are doing more work than most requests).

taking this one step further we can always expose that raw data for further review. I've done this a few times in Gutenberg where we can save the raw data as an artifact on a CI performance test run in addition to reporting the averages and deltas for a PR. usually that artifact is ignored, but it's incredibly useful to have when you find something surprising in a test run.

I'm on the hunt for a large and representatively complicated website. empty testing sites are so unrealistic that the numbers don't mean too much. it's like comparing different implementations of a function to count grapheme clusters in a string, but only testing on empty strings. I'm guessing that certain things in WordPress that aren't performance sensitive are being over-emphasized.

On that note the twentytwentyfour home page might be a good change in introducing a lot of complexity on the home page because it will recenter the metrics on things that have more of a normal impact, but it would still be better I think if we had a site with a hundred posts, a thousand comments, lots of media, and lots of customization, maybe even having a handful of common plugins installed.

We can perform database dumps and restores to start in a pristine state on each run.

swissspidy commented 8 months ago

Still, I wonder what is making us not trust it?

Simply put: sometimes the reported median wp-total can be e.g. 50ms, and in the next run it can be 80ms. Makes it a bit hard to build trust and to find outliers.

eliminate as much external bias from the benchmarks as is possible.

Yeah that's probably better worded. This is partially what I'm looking into right now. Cron, external http requests, etc.

at the cost of intentionally ignoring the worst performance issues, one way we can start if we haven't already, is to ignore outliers in the data before reporting any metrics. this can't happen if we are updating a metric as we go, such as the average.

At least we're using medians right now to ignore outliers, so that's a start.

we'd need to have every sample before computing the aggregate measures so that we can determine if a given sample is an outlier or not. taking this one step further we can always expose that raw data for further review.

That's a long term goal; just putting all the raw numbers into something like Grafana and then do further analysis there.

empty testing sites are so unrealistic that the numbers don't mean too much.

FWIW core is not using empty sites. We use https://github.com/WordPress/theme-test-data as the base. Not hundred posts, but it's a start. But agreed that's still far away from a "typical" small WordPress site.

swissspidy commented 8 months ago

My draft doc with some preliminary testing results can be found here: https://docs.google.com/document/d/1elp9i9syVR6hCZEc_mRUl4E58JIbgBeO-Y6Xj85xiHE/edit

I would appreciate some more eyes on it already, even though I am not sure yet what to make of these results.

dmsnell commented 8 months ago

While this is to be expected in an inconsistent environment like GitHub Actions

For what it's worth, I spent a good amount of time trying to hunt down variability related to GitHub Actions test runners and in the end I couldn't find sufficient evidence to indicate that it was a real problem.

This high variance in performance metrics can be seen in places like the WordPress Code Vitals dashboard, which is fed data from GitHub Actions on every WordPress core commit:

The high variance can in some ways be a boost to performance tracking, because it filters out false positives in minor impacts. For something to register as a real performance improvement or regression it has to have an impact that's measurable above the noise.

also with the built-in Docker environment

Do we really want this since it introduces some artificial performance changes to the environment? I/O is particularly affected by this. Seems like running PHP, Apache, and mysql natively outside of Docker would lead to more realistic measurements.

Although the standard deviation dropped from within the Docker environment, the tests also took around 16% longer to run. I wonder if some I/O bottleneck within Docker is actually behind this reduction in standard deviation. For example, code that ran more erratically was covered up and normalized because the I/O wait time became more dominant in the metrics. They might not be better because the variability is lower; it could be that we're masking the variability and hiding the measurements we care about.

Mask these effects by applying some sort of sleep() between requests or some request latency in its network throttling.

Collecting results at a range of these delays might be quite revealing. I'm going to try and get some tests going on a Playground instance, because we control the entire environment within the Playground, letting us do interesting things like add arbitrary delay or bursting effects on file I/O and database I/O.

Thanks for the comprehensive post, @swissspidy. One thing that jumps out to me is that the count of test runs doesn't seem to correlate, at a glance, to a reduction in standard deviation across the various testing methodologies. I think there's probably something else hiding in there. Maybe it's more expression of a bi-modal or multi-modal distribution. I sure would like to understand better why things seem inherently so variable, but I suppose the answer may be that WordPress runtime performance is inherently variable.

For what it's worth I think time is on our side. Since we're still in our infancy of tracking performance across commits I don't see why we can't start slow and run experiments that take more time and result in more useful data. With Gutenberg I often wish we didn't run performance tests on every PR, but rather on trunk at some periodic schedule: once a day, once an hour, as fast as the tests allow etc… I still think there's value in running a test not only across multiple samples, but across multiple hours due to bias I've seen on my laptop and on a dedicated server. Something happens and requests take longer or shorter and it's hard to extract this bias from the results if the test run takes less wall time than a few hours.

Running tests on a time schedule loses the direct association of a performance change with a given PR, but I've also found that the promise of that per-PR test run has failed us just as much. Performance regressions in Gutenberg have often lagged the PR that introduced them because of one reason or another, often that the regression isn't triggered until some other work coincides with it. At least if tests run every day there's a small set of PRs that might have introduced it, so that means it's not that difficult to start bisecting and figuring out where the problem lies.

The WordPress organization on GitHub has an enterprise allotment of GitHub Action runners; we shouldn't need to worry about using up any quotas there.

There could be value in having some normal tests that are more variable that report over time (as I think @youknowriad was trying to communicate, how valuable that historical tracker has been for Gutenberg), and then shift into a separate statistical process when we think we have a regression or an improvement. The statistical methods can help uncover the real impact even in a highly variable environment, but they take a lot of time and are best for these individual questions: did this change impact performance and how?

In other words, maybe we start by getting some metrics going and not worry so much about the variability, because that variability doesn't obscure longer-term trends, and can even help keep us from misdiagnosing tiny performance impacts that may or may not exist. We can improve it with time, and even run experiments on our experiments in incremental ways. I ran many experiments in Gutenberg's performance tests to address different variability and bias issues, and most turned out to lack evidence that they improved things even thought they initially seemed to help.

swissspidy commented 8 months ago

Do we really want this since it introduces some artificial performance changes to the environment? I/O is particularly affected by this. Seems like running PHP, Apache, and mysql natively outside of Docker would lead to more realistic measurements.

Docker is not uncommon in the hosting space, plus it makes it easier to run tests locally. So there‘s pros and cons to this.

The WordPress organization on GitHub has an enterprise allotment of GitHub Action runners; we shouldn't need to worry about using up any quotas there.

Good to know! Then we could also try things like using larger runners with more cores.

In other words, maybe we start by getting some metrics going and not worry so much about the variability

I tend do agree, also with the point regarding testing multiple times a day etc. If we can then send the raw numbers to the dashboard and do the statistical analysis there, that would certainly be valuable. One of the reasons I‘ve been looking into things like Grafana.

joemcgill commented 8 months ago

I've spent a good amount of time over the last few weeks digging into the benchmarking automation that we've developed in https://github.com/swissspidy/compare-wp-performance for comparing different WP versions against each other. This automation makes use of the two main benchmarking CLI scripts for server timing and web vitals that we've been using as a performance team.

One of the questions I wanted to answer is whether there is a statistical way of representing the amount of variance that is expected during a set of runs rather than trying to eliminate the variance completely. That way, we could tell if a measured change was statistically significant, or if it was just within the expected variance that is inherent in our benchmarking methodology.

While, I haven't answered that question so far, I have noticed that even within the same environment, the dispersion between two sets of runs can have a big impact on how benchmarks of two versions compare. My working theory for reporting benchmarks was that if we took multiple runs of our comparison workflow (each of which takes 100 samples for each version) and averaged the median (p50) value across the runs, that we would at least have a useful measurement to report. To do this efficiently, I modified the GH workflow to do 5 runs in parallel and post the results to the summary for the GH action. By taking multiple runs in separate GH workers, rather than larger numbers of samples on a single runner, we could at least account for the bias the worker introduced into the results (in theory). Here's a recent example comparing WP 6.3.2 to 6.4-RC4.

These reports include some new variability numbers that we recently added to the benchmark scripts so we can see the standard deviation (SD), mean absolute deviation (MAD), and interquartile range (IQR) for each set of benchmark samples. Generally, I would expect two sets of benchmarks taken in the same GH worker to result in similar distributions, but that doesn't seem to be the case, which further skews the usefulness of direct comparisons. For example, if you look at the LCP (p50) numbers for Run 1 of this set of benchmarks, it shows that LCP got 52.52% worse between WP 6.2.3 and WP 6.3.2, which we know is not the case. However, you can also see that the dispersion calculated by MAD and IQR of the two runs is over 800% different, so these don't seem like useful comparisons at all because the shape of distributions are so divergent. When the dispersion is similar between the two benchmarks, we at least see results that are somewhat expected and repeatable, but I'm curious how to account for the comparisons where the distributions are so different.

felixarntz commented 8 months ago

@joemcgill I had also run into a benchmark recently where all metrics were suddenly 50% worse. As you say, this is clearly not the case. While statistically this is probably not appropriate, I would suggest we discard such a run and conduct another one if we are going to use the "average" of the 5 medians. Alternatively, we stick with the median of the 5 medians, in order to avoid such problems. I think statistically speaking the latter would be more appropriate.

I don't feel strongly either way, but I do think we need to make sure that such extreme outlier runs which are clearly wrong don't impact the overall results too much.

joemcgill commented 8 months ago

but I do think we need to make sure that such extreme outlier runs which are clearly wrong don't impact the overall results too much

Yes of course, I totally agree. When I first encountered this problem I assumed it was likely an edge case—just some random anomaly because of the GH worker behaving oddly—but I'm seeing this type of thing show up often enough in multi-run benchmark results that it seemed important to raise, since even in our WP Core and Gutenberg workflows are likely effected by the same type of dispersion anomaly.

One thing to keep in mind is that we are using very similar processes for slightly different purposes, so the tactics might be a bit different. Essentially we're using benchmarks for:

Taking regular benchmarks during development to identify changes (positive or negative) to performance.
Comparing two versions of WP in order to share a qualitative difference between versions.

For the former, we can accept a higher level of variance since we're really interested in the trends across time, not just one snapshot. For that use case, I think the article link @swissspidy included in his exploration doc about using step-fitting is an interesting idea to explore further.

For the latter, we really need to improve the methodology in ways that make the qualitative comparisons understandable without obscuring all of the environmental factors that necessarily impact the actual measurements. Understanding why some of these runs are demonstrating such unexpected divergence in dispersion on the same system would help us here, I think.

youknowriad commented 8 months ago

Comparing two versions of WP in order to share a qualitative difference between versions.

If the comparison between the two WP versions is done in the same job. (The same CI job runs the two versions), it is totally ok to trust the numbers IMO (assuming the numbers proved some stability in the graphs).

It is true that the numbers between different jobs can't be trusted that that's why we "normalize" in the graphs. My guess is that Github runners use virtualization which is highly impacted by the number of virtual instances on the same hardware. I'm not sure there's anything we can do here.

Edit: when I say trust the numbers, I mean trust the relative difference between the two versions, the absolute value of the numbers is very dependent on the runner instance and can't be used as a reference value.

joemcgill commented 8 months ago

If the comparison between the two WP versions is done in the same job. (The same CI job runs the two versions), it is totally ok to trust the numbers IMO (assuming the numbers proved some stability in the graphs).

This is the assumption that I had as well, but am surprisingly seeing many cases where even in the same CI job, data collected from two versions are not comparable because they are exhibiting very different characteristics in variance. To try to visualize this, think about two sets of samples being represented as box plots. Generally, on in the same job, I would expect the box plot for both versions being tested to have a similar shape, so we could measure how much the box has moved between test A and test B. However, I'm seeing many instances where the two boxes are completely different, even within the same CI run, which makes me doubt the initial assumption unless this could be accounted for (and hopefully eliminated).

youknowriad commented 8 months ago

I'm seeing many instances where the two boxes are completely different, even within the same CI run, which makes me doubt the initial assumption unless this could be accounted for (and hopefully eliminated).

We had cases like this in Gutenberg performance tests as well, once very recently as you can see on the "first block (site editor)" metric https://www.codevitals.run/project/gutenberg

Generally it's because our testing/metric is not deterministic enough and can be subject to things like "timing"/"autosave triggers"... or any random things. This, in general has nothing to do with the environment and need to be fixed metric by metric separately. For instance, this is how we fixed it for the metric above https://github.com/WordPress/gutenberg/pull/55922

joemcgill commented 8 months ago

Generally it's because our testing/metric is not deterministic enough

I think there's something to this. Even in the runs where there is wild variance in CWV measurements, the server response time metrics are pretty stable. Likely we need to further investigate how to stabilize CWV specifically.

swissspidy commented 8 months ago

Likely we need to further investigate how to stabilize CWV specifically.

If you check out the doc I shared above, I got some promising results with network throttling à la Lighthouse

joemcgill commented 8 months ago

Yep, was just looking at that! The Fast 3G results in solution F are the best results I've seen. I'm curious how you implemented the throttling and if we could introduce network latency without changing download speed and still see similar results with less of a sacrifice on overall timing? Here are the throttling options under the hood for Puppeteer's "Fast 3G" settings for comparison.

swissspidy commented 8 months ago

Here's some of my throwaway code for that: https://github.com/swissspidy/wordpress-develop/pull/42/files#diff-3ffa7f38ac5fb9027780e74315b92dcbc18eec9042c34cfcfa7d9f3185d89b8f

felixarntz commented 8 months ago

+1 for looking more into the throttling (I just finished a review of the doc finally).

We could also try updating https://github.com/swissspidy/compare-wp-performance with it as a test, which should be easy given that benchmark-web-vitals already supports throttling (props @westonruter).

joemcgill commented 8 months ago

For grins, I pushed a branch of the Compare WP Performance workflow that uses throttling for benchmarking web vitals and reduced the number of samples to 20, so I could get a quick idea of the impact of that change. Here's the results. While there aren't any major outliers, the dispersion didn't seem as good as the example in @swissspidy's doc, so not sure that this is the answer.

felixarntz commented 8 months ago

Regardless of all the ideas explored in that doc, the medians in that doc are already much closer to each other between different runs than what we're seeing in the GitHub workflows. That makes me think that maybe the most important change is to use a dedicated cloud environment like @swissspidy did for all of the numbers in the doc?

dmsnell commented 7 months ago

I would expect the box plot for both versions being tested to have a similar shape

@joemcgill we may want to be suspicious about this. any code change might have an impact on variance. in fact, the more caching we add the more variance we should expect, because that introduces at least two performance modes for a given code path: one primed, and one un-primed. another example would be if we add something that runs every X requests to perform some cleanup or maintenance, similar to a WP CRON job. every divergent path of code represents a different performance characteristic which will spread the expected variation in the samples.

it would be good to assume equal variance if we were measuring two groups from the same population, but in the case that we're changing code, we're actually changing the population, and the assumption can't hold. the same is true of a test runner; SSDs and spinning disks are likely to have different variance in their response times due to their different construction, as does system load on the test runner.

there's a lot of value in analyzing the variance on its own. often a major performance improvement might actually increase the median response time but reduce the variation, or more importantly, the long-tail skew, as much better in the worst case is often better than marginally better in the average case.

This, in general has nothing to do with the environment and need to be fixed metric by metric separately

Want to echo what @youknowriad said here with this. I think there are some biases that are inherent to every testing environment (things I've only been able to mitigate by testing across multiple hours of real clock time) and biases that are inherent to the code we're testing. That is, we're not accounting for something inside of WordPress that causes divergence in the render time for a given page in a very controlled setting.

we are using very similar processes for slightly different purposes, so the tactics might be a bit different

👍 entirely agree. seeing the trend is relatively easy. determining if a specific change made a real impact and what kind of impact it is is much much harder and we're almost always bound to get it wrong, but we can provide insight into the changes.

obvious wins will always stick out no matter how poorly the testing environment, but small or micro-optimizations are always going to take somewhat extreme measures to validate, and those validations will often be misleading because we're not asking the right questions in our tests. they are just as likely to harm performance in a place we aren't looking while they improve performance where we are.

swissspidy commented 3 months ago

Collecting results at a range of these delays might be quite revealing. I'm going to try and get some tests going on a Playground instance, because we control the entire environment within the Playground, letting us do interesting things like add arbitrary delay or bursting effects on file I/O and database I/O.

@dmsnell Did you ever end up doing that? It's something we haven't explored yet, so I'd be curious to see if that's feasible.

dmsnell commented 3 months ago

Unfortunately @swissspidy I haven't. I keep starting but getting pulled into other work.

there's an example of adding filesystem journaling which could be followed for introducing delays:

https://github.com/WordPress/wordpress-playground/blob/71419eb36bad02884809b7c751a6457d715da9aa/packages/php-wasm/fs-journal/src/lib/fs-journal.ts#L151-L159

my intention was to start with something basic, like a gaussian random distribution based on a reasonable SSD and a reasonable server HDD. that is, a function interception I/O would add a given latency to the requests based on something like µ = 9ms for an HDD simulation. over time, if the results warrant it, that would be a great opportunity to expand the modeling of real disk I/O and caching effects.

there surely another function or interface for network lag, though I would imagine I/O is more important.