GoogleChrome / lighthouse

Automated auditing, performance metrics, and best practices for the web.
https://developer.chrome.com/docs/lighthouse/overview/
Apache License 2.0
28.39k stars 9.38k forks source link

One common interface for deriving score from raw values & deriving the weighted average. #11881

Open koraa opened 3 years ago

koraa commented 3 years ago

Hi there!

It would be really useful if Audit featured a static method scoreFromRaw(raw: Number) => Number that could be used to map the raw value into the score range (and maybe even an inverse rawFromScore(score).

This is currently mostly available with the p10 and median config options, but some classes like UnusedBytes/ByteEfficiencyMethod use custom conversion methods.

This feature would be really useful for anyone doing custom processing to measurements. E.g. I am trying to experiment with new methods to derive average scores.

(In my case generating an estimate of audit's raw values separately, mapping the estimate into the score space and then taking the weighted average instead of calculating the weighted average multiple times and then generating an estimate from that; as previously discussed in https://github.com/GoogleChrome/lighthouse/issues/11570)

In the same vein, it would be really useful if lighthouse-core/scoring.js was exposed and featured a function that could derive the weighted average from auditName => auditScore instead of requiring the entire auditName => auditObject where auditObject at least needs to include scoreDisplayMode. While I understand that scoreDisplayMode here is used to convey error information this error information may be better encoded as a score undefined or null value and scoringModes MANUAL & INFORMATIVE should be compile time constant so their weight should be zero anyways (or am I missing something here?).

Thank you!!

connorjclark commented 3 years ago

This may be immediately useful to you: https://github.com/paulirish/lh-scorecalc/blob/4c7724ad6587595bda807e1c2278a07c9f55e7ec/script/math.js#L78

koraa commented 3 years ago

Thank you!

EDIT: To expand on this a little bit; my biggest issue right now isn't really the math portion; it's the special cases I have to consider. My code looks something like this right now:

const lhScoreFromRaw = curry('lhScoreFromRaw', (raw, audit) => {
  const a = is_a(audit, String) ? lhAuditCls(audit) : audit;

  if (a.meta.scoreDisplayMode === 'binary')
    return raw; // Already either 1 or 0

  if (a instanceof ByteEfficiencyAudit)
    return ByteEfficiencyAudit.scoreForWastedMs(raw);

  const lnp = lhScoreParams(a);
  if (isReal(lnp.p10) && isReal(lnp.median))
    return a.computeLogNormalScore(lnp, raw);

  assert(false, `Don't know how to score ${typename(a)}`);
});

It considers (1) binary scoring (which don't have raw values) (2) numeric scoring from log normal (using the private computeLogNormalScore interface) (3) Byte efficiency audits (4) just now I discovered I neglected to consider #11882 so I'll have to add another special case for that…

koraa commented 3 years ago

Just found another special case: redirects audit which is 1 for zero and one redirects and uses the method outlined in #11883 in other cases. Raw value is set to the wasted ms in either case.

This seems a bit odd since in the zero and one redirects case the raw value and the score are entirely uncorrelated. This is also the first scoring function I've seen which depends on >1 variables.

Isn't the latency tested somewhere else too? One of the findings from harmonicobserver is that measurements tend to be correlated (graph – experiment is artificial because it shows the graph for an empty page. The effect is still present but less impactful for less extreme sites). This increases variance because the weighted average of multiple correlated functions still has a pretty high variance.

If latency is measured somewhere else too, it might be better (just for our mental models) to define the score based solely on the number of redirects (something like p90=.5, median=1.5).

Are you interested in more suggestions like these? Otherwise I'll leave it with making-lh-continous suggestions :)

brendankenny commented 3 years ago

These would be great improvements. It would allow use cases like yours, as well as making counterfactual tools like the score calculator ("what happens to my scores if I improved this audit by X") easier to implement without duplicating large chunks of internal code that's subject to change.

I'm sure it's also not the prettiest process to duplicate :) Scoring is a mix of some of the oldest code still around (chunks of aggregate.js live on!), code written while Lighthouse was still in flux but we needed to stabilize output for the PSI API (scoreDisplayMode, etc), and some of the most recent changes (mobile and desktop curves, the switch to p10 instead of podr, etc).

The good news is that this is almost all internal code, so we have a ton of leeway to change things. We also have pretty good input/output test coverage that's relatively agnostic to the implementation, so we can be bold without worrying too much about breaking anything.

But there are some other constraints that we can't really live without that makes this a difficult externalized (and especially static) API design space. Here's the ones I can think of, others can chime in with any more or if they think any in my list aren't actually a problem :)

Metric scoring

Metrics are probably the most straightforward case for current functionality. Could pretty easily expose a scoreFromNumericValue(numericValue: number) => number as suggested.

The main thing also needed would be the formFactor from the run's settings, since 'mobile' or 'desktop' determine which score-curve control points to use. This is available in the final LHR at lhr.configSettings.formFactor.

The open question would be do we want to commit to always scoring metrics like this? formFactor for scoring is a relatively new addition, and in theory there could be other audit options we want to add in the future, or have the score depend on more calculated state than just the raw metric value. Both of those would require state that's not in the final LHR, but from the initial config and the run's artifacts, respectively.

Audit scoring

This is a lot harder, as even for classes of audits that have a comparable numericValue -> score relationship there are exceptions, e.g. redirects, one of the byte efficiency audits:

https://github.com/GoogleChrome/lighthouse/blob/0d25b6b34baea44198b588297ccd92216d03e69a/lighthouse-core/audits/redirects.js#L149

for many of the less straightforward audits there are multiple levels of thresholds and heuristics to prune out things that aren't worth bringing to the user's attention, and (stepping out of the performance section) there are many audits with no meaningful numericValue at all (that's why we move from the required rawValue to the optional numericValue on audit results. see e.g. https://github.com/GoogleChrome/lighthouse/issues/6199#issuecomment-494943887).

If we limit ourselves to just perf audits, the other question is: does the score matter? All the non-metric audits in perf are scored but the score isn't visible in the report and they all have weight 0 so don't contribute to the overall category score. Maybe it makes more sense to be looking at e.g. overallSavingsMs, which is what's shown in the report for those audits?

Report scoring

In the same vein, it would be really useful if lighthouse-core/scoring.js was exposed and featured a function that could derive the weighted average from auditName => auditScore instead of requiring the entire auditName => auditObject where auditObject at least needs to include scoreDisplayMode. While I understand that scoreDisplayMode here is used to convey error information this error information may be better encoded as a score undefined or null value and scoringModes MANUAL & INFORMATIVE should be compile time constant so their weight should be zero anyways (or am I missing something here?).

This is an area where old code could maybe be cleared up in the split between scoring.js and the methods on audit.js to make it easier to follow.

If the score has come out of an audit, the score will be null if the audit was MANUAL or INFORMATIVE, but it's also null if it marked itself as NOT_APPLICABLE at runtime or threw an error (so was display mode ERROR):

https://github.com/GoogleChrome/lighthouse/blob/0d25b6b34baea44198b588297ccd92216d03e69a/lighthouse-core/audits/audit.js#L225-L229

All those except ERROR then get weight 0 over in scoring.js:

https://github.com/GoogleChrome/lighthouse/blob/0d25b6b34baea44198b588297ccd92216d03e69a/lighthouse-core/scoring.js#L65-L69

to make sure the category isn't scored if one of the weighted audits threw an error. So a null score isn't sufficient to determine if the weight should be 0 or not, and since NOT_APPLICABLE is usually determined at runtime, null + known MANUAL/INFORMATIVE isn't enough either.

For the general case, we also can't assume the audit weights will be whatever's in the default-config, so we need the weights too.

Maybe I'm missing the reasoning here for avoiding having to pass in an object audit result, but all this information is in the lighthouse JSON output (scores, scoreDisplayMode, audit weights). Could we instead make it easier to synthesize input to scoring.js from an existing LHR with changed audit scores in it? Things that might change from taking a number of lighthouse results and summarizing them with one would have to be dealt with regardless of this method (e.g. discard runs with errors in them? discard runs where audits became N/A?), and after that most of that audit object should be straightforward to reconstruct.

koraa commented 3 years ago

I just pushed my reimplementation of the scoring for reference. List of special cases should be easy to spot. Only thing left out is the redirect audit https://github.com/koraa/helix-harmonicabsorber/blob/6562a2757fd72b79fea07cf55895bb35a6fde1e4/src/report.js#L76

Thanks for exhaustive reply :) I'll reread it after my vacation!

connorjclark commented 3 years ago

FWIW all "calculate audit score" is rather straightforward - take a number, return a number. only csp-xss is different

patrickhulce commented 3 years ago

The other note I'll throw out there, score will depend also on the device type (metrics today) and possibly fraggle rock modes in the future.

connorjclark commented 3 years ago

Many audits use data from the context to define the scoring curve, so passing a value + the context should be enough. image