GoogleChrome / lighthouse

Automated auditing, performance metrics, and best practices for the web.
https://developer.chrome.com/docs/lighthouse/overview/
Apache License 2.0
28.38k stars 9.38k forks source link

DZL Tracking Issue #6775

Closed patrickhulce closed 4 years ago

patrickhulce commented 5 years ago

See #6152 for historical discussion and DZL's purpose. Comment or edit this issue to add feature requests.

Features

Bugs ?

patrickhulce commented 5 years ago

Alright folks, if you're like me, you've found DZL results not very useful. They're interesting, but not really actionable. Every time it fails something I assume it wasn't really the PR and when it passes it doesn't really give me confidence that the PR is safe. Not a resounding success :/

Instead of adding features, I've spent the past (slow holiday) month or so when working on DZL playing around with different URLs and intentionally regressive LH versions to suss out what's going on. Here's a general dump of what I've learned and what I think that means for LH.

tl;dr - we need to isolate and surface every aspect of variance and their sources

We must separate "good", "bad", and "unnecessary" variance

There's "good", "bad", and "unnecessary" variance. "Good variance" is the variance due to changes in the user experience that were within the developer's control, i.e. resources changed and payloads got heavier, scripts got less efficient and took more time executing, etc. "Bad variance" is the variance due to changes in the user experience that were mostly outside the developer's control, i.e. connection speed changes, random spikes in server traffic that slow responses, etc. "Unnecessary variance" is a change in our metrics that does not track any change in the user experience. I'm not currently aware of any such variance, but it's worth calling out and eliminating if we find it :)

Metric graphs without page vitals are meaningless

Every time I see a graph of a FCP that's different, I want to know why it's different. Were there different resources loaded? Were the same resources just larger? Did our network analysis estimate different RTTs/server latency to the origins? Did the CPU tasks take longer? Why? Without this answer, I have no reason to believe the implementation changed or even had anything to do with the difference.

To me, a page vital is anything that lantern uses in its estimates.

  1. Number of requests
  2. Total byte weight
  3. Total CPU tasks
  4. Total CPU time
  5. Estimated Connection RTT
  6. Estimated RTT by origin
  7. Estimated server latency by origin

Any variation in performance metrics basically comes down to one of these things varying, so measuring each of these is going to be critical. We can then define the success of our implementation as multiples of the variance of these underlying sources.

We are tracking too many metrics for p-values to be meaningful

This one we kinda knew going in, but I didn't realize how powerful it would actually be. In the standard set of DZL runs, we're looking at 10 URLs with 243 timings and 119 audits. That's 10 * (243 + 119) = 3620 different data points that could be different. A low p-value means either A) the hypothesis is false OR B) something unusual happened. When we roll the dice 3000+ times, lots of "unusual" things happen at least a few times. The only solutions to this problem are to drastically increase the number of observations we have and drastically lower our p-value threshold, which isn't very feasible for this number of data points on every PR, or decrease the set of metrics we're observing for changes.

Decreasing the set of metrics we're observing for changes actually makes a lot of sense. If we think a few particular metrics are likely to change with a given PR, we can compare those few and the mechanics of p-value testing start to hold once again. We can still monitor other metrics for curiosity and exploration, but shouldn't read into any of them changing.

We need few metrics, identical environments, and stable URLs to PR-gate LH

All the above have strong implications for gating PRs. We'll need to select a narrow set of criteria that we're concerned about and ensure that all page vitals are similar, before failing a PR. This generally means identical environments and stable URLs whose code we control or at least changes infrequently with little non-determinism.

Analyzing/tracking real-world variance and getting an effective LH regression-testing system are two very different problems

When we take a look at all the variance we need to eliminate before reaching the conclusion that it was a fault of a code change, we're eliminating lots of things that are going to be encountered in the real-world. These are two very different use cases, and I think we need to accept that fact that it will take different strategies and potentially even different tools to solve both of these problems.

I think this is all I've got for now, but I've updated this issue with some of the action items I've mentioned here and we can discuss more at the inaugural variance meeting :)

patrickhulce commented 5 years ago

In our last variance meeting we discussed the future of DZL and things we want it to do.

@brendankenny said...

It needs to be as easily accessible as the report deployments we have now

Thoughts: Will have to explore more what easily accessible DZL results should look like :) We currently have the results link commented on any PR that has DZL enabled which isn't the most elegant, but this may have been referring more to the consumability of what lies at the end of the link.

@paulirish said...

I want to know how async stack gathering changes will affect our results

Thoughts: IMO, this is actually the only thing that DZL does well at the moment :) The issue for PRs is that we test on such a small basket of sites that my confidence level is not very high. The obvious solution to me here is radically increase our basket of sites. It does not seem like the speed with which DZL returns results has been the main problem so far, and I'd obviously rather them be useful but slow than fast but unhelpful.

I want to know how Chromium changes will affect our results

Thoughts: This and the previous request by Paul are what I see as the most promising for DZL's future. DZL was super helpful for reproing and finding URLs for the m74 series of issues. It could easily be modified to continuously test Chromium versions against our master and alert on error rate and performance changes.

I want to track performance regressions in our HTML report

Thoughts: IMO, this is a very similar goal to Lighthouse CI, so if this is one of our focuses I'd want to invest my time in CI where it'll benefit all Lighthouse users rather than some internal infra specific.

brendankenny commented 5 years ago

It needs to be as easily accessible as the report deployments we have now

Thoughts: Will have to explore more what easily accessible DZL results should look like :) We currently have the results link commented on any PR that has DZL enabled which isn't the most elegant, but this may have been referring more to the consumability of what lies at the end of the link.

I mostly meant auto running on every PR (if that's feasible) and having a big old link when it's done :)

Not sure if we'd also want some kind of status posted, or if clicking through is sufficient. I feel like none of us ever clicks through on our statuses unless something is broken or we're expecting something interesting, like the deploy links for PRs changing the report.

IMO, this is a very similar goal to Lighthouse CI, so if this is one of our focuses I'd want to invest my time in CI where it'll benefit all Lighthouse users rather than some internal infra specific.

👍 👍

patrickhulce commented 4 years ago

I'm not sure why this is still open :)

We have halted most effort here for over a year. We have some scripts like GCP lantern collection that could help revive one-off comparisons as we need them but as a general CI practice I think this is abandoned.