WebKit / Speedometer

An open source repository for the Speedometer benchmark
Other
592 stars 68 forks source link

With privacy.resistFingerprinting set to true in Firefox, Speedometer shows the result Infinity #399

Closed julienw closed 1 month ago

julienw commented 5 months ago

This was reported by some people on social networks. We found that setting privacy.resistFingerprinting to true was triggering this. But this might happen in some other cases too.

When doing it, some subtests take 0ms on every run, this may mess up the results. Any chance we can fix this?

bgrins commented 5 months ago

The simplest fix here IMO would be to detect an infinity score and instead show an error message

bgrins commented 5 months ago

I could imagine a warmup step where it looks for anomalies and shows a warning, like MotionMark does with refresh rate, but this case is likely rare in practice so a guardrail on the score display may be sufficient.

julienw commented 5 months ago

But I don't understand why one single substep being at 0 would yield an Infinity result. I'm not great at statistics, but I smell something fishy here.

julienw commented 5 months ago

(update: I got it wrong, I rewrote the comment)

But I don't understand why one single substep being at 0 would yield an Infinity result.

The problem isn't a single substep, but all substeps for 1 test being 0. I've seen it happening for TodoMVC-Preact-Complex-DOM for example, but svelte-complex and lit-complex could happen too.

The geomean is a product, so a suite at 0 poisons the geomean for this run. Then I think that we geomean the geomeans to get the result for all iterations. Then a score is 1000 / geomean, so we get Infinity.

The wikipedia page for geomean says:

The geometric mean can be an unreliable measure of central tendency for a dataset where one or more values are extremely close to zero in comparison to the other members of the dataset.

Therefore I wonder if the use of geomean is appropriate. Geomeaning the scores between runs, or the tests between runs, makes sense. But geomeaning the various tests together, I'm not so sure. I guess this ship has sailed for v3 though.

A quick fix might be to replace 0 by 1 when adding values in https://github.com/WebKit/Speedometer/blob/bd9ab77462618ce857c1e0aaf890936de18a33ad/resources/benchmark-runner.mjs#L530-L531 (possibly also in https://github.com/WebKit/Speedometer/blob/bd9ab77462618ce857c1e0aaf890936de18a33ad/resources/metric.mjs#L65-L69 but after a quick look I don't think the geomean calculated here is actually used).

What do you think @camillobruni ?

julienw commented 5 months ago

Note: I also noticed errors with NaN values in the SVG rendering of the Details view, I believe this comes from the same issue.

julienw commented 5 months ago

I find our score computation very difficult to follow, where we do the same thing in different ways in different files: benchmark-runner.mjs vs metric.mjs vs Statistics.mjs, with no clear comment about the formula we use to calculate the score. It's very unclear to understand where the geomean that we use as the basis for the score is computed from.

camillobruni commented 5 months ago

You are absolutely right :) I kind of kept some backwards compatibility with v2.1 but ultimately we should all migrate to metrics.mjs (the Statistics.mjs is from the original v.21 version as to no alter any existing scoring). So

Geomean over all the different tests makes sense to not favour the longer running ones disproportionally (we could spend some effort to make each test run the same length in the future?). For everything else, the arithmeticMean makes sense IMO.

Given that this is something that should never happen, we could also just fail and not display anything if the sum of all subtests equal to 0.

julienw commented 5 months ago

Yeah indeed, 0 is an absurd number in this case. Is it practical to ignore the test instead of failing generally ?

rniwa commented 5 months ago

The point of using geometric mean across suites is to give each suite the same weight. If we used arithmetic mean, we would give more weights to the long running / slow suites.