Perf Testing: Incorrect numbers via a headless browser?

WunderBart commented 1 year ago

While migrating performance specs to Playwright, I noticed that I'm getting vastly different numbers for the theme specs while running headfully vs. headlessly. From a local run:

Metric	Headless	Headful
timeToFirstByte	176.7ms	727.8ms
largestContentfulPaint	251.6ms	812.1ms

I googled a bit but couldn't find a good read on testing performance headfully vs. headlessly, so I asked GPT. Here's the relevant bit:

(…) Since these (headless) browsers don't spend time loading images or executing JavaScript to render UI elements, they're often faster than traditional browsers.

(…) For measuring Time to First Byte (TTFB), headless testing should suffice, as it's about measuring the time between the client making an HTTP request and receiving the first byte of the response from the server. This measurement shouldn't be impacted by whether the browser is headless or not.

However, if you're measuring other performance metrics that might be affected by the rendering of the page (like First Contentful Paint, Time to Interactive, etc.), then you might want to consider headful testing, as it simulates a more realistic user experience.

The part about TTFB doesn't seem to check out, though, because I'm getting much larger numbers for that metric as well. Anyhow, the part about simulating a more realistic use experience makes sense to me, so I would like to discuss if we should start measuring via a headed browser instance to get more realistic metrics.

I'm curious what your thoughts are on this, @youknowriad, @dmsnell, @oandregal, @kevin940726 🙌

(I'm unsure yet if running headfully in CI is even possible, but I'm currently trying to verify it.)

kevin940726 commented 1 year ago

Seems like chromium headed mode has an overhead which skews the TTFB result when first load the page. I tested it in both puppeteer and playwright and they gave the same result. However, try reloading after goto() which seems to give a more consistent result for me.

await page.goto( '/' );
await page.reload();

const ttfb = await page.evaluate(
    () =>
        new Promise( ( resolve ) => {
            new window.PerformanceObserver( ( entryList ) => {
                const [ pageNav ] =
                    entryList.getEntriesByType( 'navigation' );
                resolve( pageNav.responseStart );
            } ).observe( {
                type: 'navigation',
                buffered: true,
            } );
        } )
);

dmsnell commented 1 year ago

I'm curious what your thoughts are on this

I'm not sure what the question is. Being different between headless and normal modes is something I would expect, and be really surprised if it weren't there.

What's the problem we're attempting to solve? In our perf tests I would generally suggest that consistency is of far more value than absolute numbers because the test environment is unlike normal user environments and is unlikely to ever be so. We want to be able to track the direction and relative magnitude of changes, but don't need the measurements themselves.

WunderBart commented 1 year ago

What's the problem we're attempting to solve?

I'm trying to find out if paint-based metrics like FCP, LCP or TTI are correctly measured in a headless browser. How do we know that those metrics are meaningful in a headless env?

dmsnell commented 1 year ago

How do we know that those metrics are meaningful in a headless env?

Aha, as an experiment then this makes sense, and I guess the answer is "no"? It's not strong evidence, but the fact that those are such drastically different numbers suggests to me (in corroboration with some intuition) that headless tests are not going to be a reliable predictor of real in-browser rendering performance.

WunderBart commented 1 year ago

Seems like chromium headed mode has an overhead which skews the TTFB result when first load the page. I tested it in both puppeteer and playwright and they gave the same result. However, try reloading after goto() which seems to give a more consistent result for me.

I think we can skip the first sample to account for the initialization overhead. For the subsequent samples, IMO, it would be best to fire a fresh context/page for each iteration. Something like this:

https://github.com/WordPress/gutenberg/blob/4bf1c779c9c60dc57252beb0dc0c3fc19416a8ce/test/e2e/specs/performance/site-editor.spec.js#L95-L128

kevin940726 commented 1 year ago

that headless tests are not going to be a reliable predictor of real in-browser rendering performance.

I don't think the goal was ever to capture real-life metrics but to have a solid baseline that we can track over time for regressions. This is a Lab test after all.

WunderBart commented 1 year ago

that headless tests are not going to be a reliable predictor of real in-browser rendering performance.

I don't think the goal was ever to capture real-life metrics but to have a solid baseline that we can track over time for regressions. This is a Lab test after all.

I think the question is if, using a headless browser, we will actually catch regressions on the paint-related metrics, like LCP. We could probably design a regression test to see if it gets reflected by the numbers. Or do we already have some historic data that shows the current metrics catching regressions?

dmsnell commented 1 year ago

@kevin940726 I was responding to the description

Anyhow, the part about simulating a more realistic use experience makes sense to me, so I would like to discuss if we should start measuring via a headed browser instance to get more realistic metrics.

I think the question is if, using a headless browser, we will actually catch regressions on the paint-related metrics, like LCP. We could probably design a regression test to see if it gets reflected by the numbers. Or do we already have some historic data that shows the current metrics catching regressions?

This topic is obviously incredibly complicated, and the more we look into it, also the more meaningless our numbers become. Nothing about the testing environment is normal. Also, if we want to narrow down more tightly into the metrics then we have to start making decisions that deviate from normal use cases. We should be taking the mininum value of the metrics, not the median, because nothing will ever be randomly fast that isn't inherently fast - random variation and process-competition only slows down the metrics. We should deterministically clear the garbage collector, avoid having multiple tabs open, shut down background processes on the testing server. And so these things end up giving us pure reliable clean numbers that don't correspond any more to the normative experience of someone using Gutenberg. The fact that the testing server has a high-bandwidth low-latency connection is another major source of disconnect. The CI servers have top-quality infrastructure powering them, and active cooling infrastructure that helps CPUs avoid thermal throttling. Almost nobody using Gutenberg is experiencing the tests like CI does. Our tests are also bizarre. We're typing hundreds of the same character with pauses in between. We're loading a post with 1000 blocks and instantly doing things with it.

My point is that it's a diversion to hunt some truth in the performance metrics. Also, I think that very little has come through hard-to-measure changes. Most of the performance work that matters is instantly observable because it makes such a dramatic improvement. Everything else, including the main performance issues, are the kind of death by a thousand cuts that are almost impossible to measure, and impossible to measure improvement on.

Therefore I don't know if the question is relevant if this catches regressions on the paint-related metrics. Will the existing tests? They haven't caught the most egregious ones that freeze up my writing experience. We probably don't have the right tests to catch that kind of regression, and so even today our performance numbers can improve while the editing experience gets slower.

You could try writing a new test and in a branch intentionally degrade the LCP. Compare the results as you did already in headless vs. normal mode. I'm not sure how much that would speak to the core question: will this catch regressions? But it could show if it's possible to catch some specific regressions.

oandregal commented 1 year ago

I think the question is if, using a headless browser, we will actually catch regressions on the paint-related metrics, like LCP. We could probably design a regression test to see if it gets reflected by the numbers. Or do we already have some historic data that shows the current metrics catching regressions?

@WunderBart I can share my experiments on this topic. In https://github.com/WordPress/gutenberg/pull/48728 I disabled per-block stylesheets, meaning that we'd enqueue the stylesheets of all core blocks instead of only the stylesheet of the blocks in use in a given page. This means more work for the browser, so I expected that the LCP-TTFB metric (the time spent painting) to be increased by that PR.

I then measured using three different mechanism: the CI job, curl, and a script created by the performance team. All of them confirmed the intuition, meaning they are catching regressions.

oandregal commented 1 year ago

Regarding numbers being different: it's my experience they'll be. While improving front-end performance and comparing numbers with several folks from the core performance team, we ran into this same thing.

The numbers depends a lot of the method in use. See for example, my reports (TTFB in the order of 15/70ms) vs what other folks in core reported (3x more). Despite my reports being a lot lower, they were more reliable (see). I wish I could say that larger numbers were more reliable, but my experience doesn't tell so.
The same method can provide different results depending on the environment. With my computer plugged or on battery, I'd have a 3x difference in the numbers using the same method. It makes sense, as the CPU profiles are different when plugged or running on battery. It's a reminder that unexpected things will influence the numbers, and that they are "in the lab" experiments. What's important is whether the comparison is reliable, if it finds regressions.

oandregal commented 1 year ago

As for the migration to playwright: it's unfortunate that the numbers will be affected, though it's not a blocker from the point of view of the front-end performance test. The major bumps I've seen in codevitals are related to the enviroment rather than a PR itself (changing WordPress base, etc.). Riad and I talked at some point about being able to "annotate" the graph, so it's clear what the bumps are.

For reassurance, we could use the experiment I shared above https://github.com/WordPress/gutenberg/pull/48728 to make sure the paint metrics reported by Playwright are working as expected.

WunderBart commented 1 year ago

The fact that the testing server has a high-bandwidth low-latency connection is another major source of disconnect. The CI servers have top-quality infrastructure powering them, and active cooling infrastructure that helps CPUs avoid thermal throttling. Almost nobody using Gutenberg is experiencing the tests like CI does.

The codevitals milliseconds are indeed unrealistic, but I don't think we should look at them as an actual performance indicator. Since they show the relative changes from commit to commit, their main objective should be catching regressions. It can only happen if the measurements are reliable, though. I think they're currently a bit too unstable due to many factors, but migrating to Playwright should be a good first step here.

Regarding real-world numbers, I don't think we'll ever be able to get them via a CI job running in a headless browser. The only way to get such numbers I can currently think of would be through some client-side hooks that would measure what we want to measure and send those measurements. I have no idea if that's even remotely feasible though. Any ideas?

We should be taking the mininum value of the metrics, not the median, because nothing will ever be randomly fast that isn't inherently fast - random variation and process-competition only slows down the metrics.

Interesting! I compared the standard deviation for perf metrics from the last 10 (published) jobs, and the minimum value seems to be the most stable for most metrics. Not FCP though, as it doubled the deviation compared to the average. It might be the issue with how we currently sample the FCP, though, so we'll need to check again after migrating to Playwright. /cc @youknowriad

The major bumps I've seen in codevitals are related to the enviroment rather than a PR itself (changing WordPress base, etc.). Riad and I talked at some point about being able to "annotate" the graph, so it's clear what the bumps are.

The annotations sound good. Adding the WP version used and commit info would be much better than just the hash.

(…) I then measured using three different mechanism: the CI job, curl, and a script created by the performance team. All of them confirmed the intuition, meaning they are catching regressions.

That's great to hear! I think the best we can do now is to ensure our metrics are reliable. We can also design more experiments like https://github.com/WordPress/gutenberg/pull/48728 to ensure we can catch specific regressions.

youknowriad commented 1 year ago

migrating to Playwright should be a good first step here.

This seems too confident to me :P Any data that proves that playwright numbers are more stable than puppeteer ones?

Regarding real-world numbers, I don't think we'll ever be able to get them via a CI job running in a headless browser. The only way to get such numbers I can currently think of would be through some client-side hooks that would measure what we want to measure and send those measurements. I have no idea if that's even remotely feasible though. Any ideas?

We don't really care about having real world numbers. We want numbers that we improve over time. "Real world numbers" is meaningless because it depends on the environment and there's so many real world environments.

The annotations sound good. Adding the WP version used and commit info would be much better than just the hash.

It would be cool to update codevitals to support some kind of "comment" per log and we could include the wp version in that comment.

They are catching regressions

💯 I've used the graphs so many times to catch regressions. Most recent example is the one fixed by this commit https://github.com/WordPress/gutenberg/pull/51675 Without the graphs, we wouldn't have caught this in time and we wouldn't even know where to start looking to fix it. So yeah, while they may seem unstable, I think they are stable enough to be very useful to us. Of course, the more stable they become, the better it is.

dmsnell commented 1 year ago

even remotely feasible though. Any ideas?

no, and my point was just that. we don't need to be shooting for realistic numbers because that's just not possible. it's liberating. as @youknowriad mentioned, all that really matters is long-term consistency.

they're currently a bit too unstable due to many factors

I've found good reliability since @sgomes fixed the open-a-new-tab issues in the tests. when I run the tests with --rounds 60 or --rounds 180 it's easy to discriminate changes from the noise.

in separate work it is worth storing raw data so we can see the distributions of the measurements. those are much more illuminating than any mean or minimum. but it doesn't surprise me that the variance of the minimum is smaller. that's a capped lower bound.

I hope the Playwright changes help, but they aren't going to drastically change the realities of the perf tests. currently the perf tests rarely fail anyway, so that's something they have that the e2e tests don't have.

WunderBart commented 1 year ago

even remotely feasible though. Any ideas?

no, and my point was just that. we don't need to be shooting for realistic numbers because that's just not possible. it's liberating. as @youknowriad mentioned, all that really matters is long-term consistency.

Alright, I was under the wrong impression then - I thought the metrics were a) not stable enough and b) not realistic enough. Now I'm learning that the current numbers are stable enough to catch regressions and that we're not really looking for realistic numbers. It's good news, of course – only less room for improvement than I thought.

migrating to Playwright should be a good first step here.

This seems too confident to me :P Any data that proves that playwright numbers are more stable than puppeteer ones?

By good first step, I meant using a framework that makes it easier to write stable tests and provides better debugging tools. I don't know if the numbers will improve, but at least (from my own experience) Playwright will make it much easier to analyze the current tests and work on improving them. Looking at the numbers from the migrated tests, I can only say that they don't seem to be less stable. Having said that, I can do --rounds=180 to be extra sure, which I planned to do after making a few improvements there, like ensuring each sample is taken from an isolated page. edit: Running the comparison here

WunderBart commented 1 year ago

Any data that proves that playwright numbers are more stable than puppeteer ones?

Some numbers comparison I just did: https://github.com/WordPress/gutenberg/pull/51084#issuecomment-1607658262

WordPress / gutenberg

Perf Testing: Incorrect numbers via a headless browser? #51376