GoogleChrome / lighthouse

Automated auditing, performance metrics, and best practices for the web.
https://developer.chrome.com/docs/lighthouse/overview/
Apache License 2.0
28.48k stars 9.39k forks source link

Operation Point Reyes (and WPT throttling history) #9887

Closed paulirish closed 3 years ago

paulirish commented 5 years ago

Harkening to the days of Operation Yaquina Bay, we've got a new challenge in front of us...

Point Reyes is the windiest place on the Pacific Coast. And much like wind makes the physical world oscillate, variance makes our numbers vibrate. image

We have a few questions we need answered to get our Lantern-driven simulation in tip-top shape.

Questions

Actions


Team, please update this with anything it's missing.

connorjclark commented 5 years ago

We're gonna explore adding the ability to disable OOPIFs at runtime via the protocol. We'd only do this for mobile runs. This will allow us to keep lantern the same and still get accurate results for mobile. But first, let's confirm that OOPIFs are a source of error in (...traces with oopif disabled task above).

connorjclark commented 5 years ago

For comparison, this is the accuracy of the dataset the lantern test has today:

image

With newly corrected traces:

With OOPIFs on local desktop:

image

Without OOPIFs on local desktop:

image

Data sets: https://drive.google.com/file/d/1-3IjtVsllDgcSsY0S49S1ZeTd1p13-IH/view?usp=sharing

patrickhulce commented 5 years ago

So definitely better for TTI, but not nearly enough :/ I'm gonna start looking into the FCP traces here.

brendankenny commented 5 years ago

this seems implicit in the above, but just to be explicit, is the plan to

vs simply tune numbers based on the times alone, presumably because the dependency graph would be so completely different as to be useless for predicting the mobile speed? e.g. lots of long tasks from iframes.

That does mean we're going to have to stay on our toes for more site isolation changes. And for

We're gonna explore adding the ability to disable OOPIFs at runtime via the protocol

I wonder if there's a way on desktop we could get the password-triggered mobile behavior that android is getting if the G4 gets it or gets it someday soon.

Or should we just force site isolation off for the mobile device WPT runs as well, assuming WPT can do that? That might better simulate lower powered devices, since they're unlikely to get site isolation any time soon, and you'd get more main thread blocking.

patrickhulce commented 5 years ago

Oh wait, @connorjclark these are the same WPT traces from earlier, right?

So all the FCP issues discussed in https://github.com/GoogleChrome/lighthouse/pull/9662#issuecomment-542873598 and https://github.com/GoogleChrome/lighthouse/pull/9662#issuecomment-542875869 apply here.

connorjclark commented 5 years ago

Or should we just force site isolation off for the mobile device WPT runs as well, assuming WPT can do that? That might better simulate lower powered devices, since they're unlikely to get site isolation any time soon, and you'd get more main thread blocking.

The G4 and N5 are under the threshold regardless, so no need to intervene. But yes, it's possible (wpt has a cmdline option)

Oh wait, @connorjclark these are the same WPT traces from earlier, right?

Yeah, the WPT numbers are still from the N5 run. I can try getting G4 numbers tomorrow and see if that changes anything.

brendankenny commented 5 years ago

The M4 and N5 are under the threshold regardless, so no need to intervene

"While we investigate how to bring this support to more devices..."

I meant for better (near) future proofing but good to know for the present.

connorjclark commented 5 years ago

Ran WPT again for a subset of urls but for G4. n=5, chose median.

Note: I changed the WPT Chrome channel to Stable. But apparently it takes many weeks for Stable to release to Play Store, so I actually got M77. Hence the lack of LCP.

OOPIFs:

image

No OOPIFs:

image

patrickhulce commented 5 years ago

Suggested Action Items / tl;dr

Details

Redirects

We have a few URLs that are redirecting. Because of the existing discrepancy between how observed and simulated are calculated that I believe is being fixed for 6.0 (#8984) we need to use the resolved URL after redirects. Probably my bad for saving the golden set with the clean URL instead of the URL that was audited, sorry :/

Subset Comparison

It seems like we're only comparing a subset of the same URLs (32/99 with my fixUrl script, it looks like connor managed to match up 43 though). When the basket of URLs is significantly smaller than the original already small subset, it's difficult to compare error charateristics reliably. Lots of these URLs are in connor's list in the PR though, so I'm not sure if they weren't run just for the sake of time. The same redirect resolution point above will need to be applied to these too. That being said, after adjusting our script for just the ones in the basket, our original error rates on this subset was higher but not that much higher (in fact for TTFCPUI it was actually lower than normal), so this doesn't totally explain error rate differences.

image

- https://flipkart.com not found in new set
- https://vine.co/ not found in new set
- https://weather.com/ not found in new set
- http://www.4399.com/ not found in new set
- http://www.58.com/ not found in new set
- http://www.7k7k.com/ not found in new set
- http://www.amazon.co.jp/ not found in new set
- http://www.blogspot.com/ not found in new set
- http://www.brothersoft.com/ not found in new set
- http://www.china.com.cn/ not found in new set
- http://www.cntv.cn/ not found in new set
- http://www.conduit.com/ not found in new set
- http://www.craigslist.org/ not found in new set
- http://www.dawn.com/ not found in new set
- http://www.dion.ne.jp/ not found in new set
- http://www.ebay.com/ not found in new set
- http://www.espn.com/ not found in new set
- http://www.fc2.com/ not found in new set
- http://www.filestube.com/ not found in new set
- http://www.getpersonas.com/ not found in new set
- http://www.globo.com/ not found in new set
- http://www.hatena.ne.jp/ not found in new set
- http://www.hotfile.com/ not found in new set
- http://www.hp.com/ not found in new set
- http://www.huffingtonpost.com/ not found in new set
- http://www.hulu.com/ not found in new set
- http://www.java.com/ not found in new set
- http://www.livedoor.jp/ not found in new set
- http://www.liveperson.net/ not found in new set
- http://www.maktoob.com/ not found in new set
- http://www.metrolyrics.com/ not found in new set
- http://www.mlb.com/ not found in new set
- http://www.mozilla.org/ not found in new set
- http://www.optmd.com/ not found in new set
- http://www.orange.fr/ not found in new set
- http://www.orkut.com/ not found in new set
- http://www.partypoker.com/ not found in new set
- http://www.pcpop.com/ not found in new set
- http://www.pdfqueen.com/ not found in new set
- http://www.pptv.com/ not found in new set
- http://www.rakuten.co.jp/ not found in new set
- http://www.rakuten.ne.jp/ not found in new set
- http://www.scribd.com/ not found in new set
- http://www.shopping.com/ not found in new set
- http://www.skype.com/ not found in new set
- http://www.so-net.ne.jp/ not found in new set
- http://www.softonic.com/ not found in new set
- http://www.sogou.com/ not found in new set
- http://www.soso.com/ not found in new set
- http://www.symantec.com/ not found in new set
- http://www.t-online.de/ not found in new set
- http://www.tabelog.com/ not found in new set
- http://www.thefreedictionary.com/ not found in new set
- http://www.thepiratebay.org/ not found in new set
- http://www.thestar.com.my not found in new set
- http://www.tianya.cn/ not found in new set
- http://www.torrentz.com/ not found in new set
- http://www.tumblr.com/ not found in new set
- http://www.twitpic.com/ not found in new set
- http://www.typepad.com/ not found in new set
- http://www.verizonwireless.com/ not found in new set
- http://www.vevo.com/ not found in new set
- http://www.weather.com/ not found in new set
- http://www.wikipedia.org/ not found in new set
- http://www.ynet.com/ not found in new set
- http://www.youdao.com/ not found in new set
- http://www.zol.com.cn/ not found in new set

Unreproducable WPT Variance

This is likely selection bias because we're looking into sites that performed particularly poorly partly by chance. When we try to reproduce those results, we just get the more reasonable results. We probably didn't experience much of this in the first set because we determined the 100 golden set based on a much larger 1000 set so runs with unpredictable behavior and high variance were just excluded. AT&T is a good example of this where our golden value is 18.5s yet the median of 9 runs redone through WPT UI is 10s. More on this below.

Lantern Intentionally Makes Optimistic Decisions But We Measure Against Median

The worst error rates that aren't redirect-driven are all dramatic underestimations. We've made lots of improvements and decisions over the past two years that make lantern intentionally optimistic. Even in our pessimistic simulations we use optimistic per-origin RTTs, optimistic server response times, and optimistic SSL and HTTP/2 setups. The decision to exclude highly variable runs from the initial golden set is likely responsible for providing a rosier outlook on the difference between predicting the median vs. predicting the minimum than was realistically achievable.

image

As a result, we're systematically underestimating. Explanations for the top couple of errors found below.

  1. https://www.att.com - golden says 18.5s, WPT traces say 10.5s-20s, we say 7s. Here we get unlucky that the median TTI is actually one of the longest FCPs, so it's not a median-median comparison. Perhaps we should be using a different median selector? Median FCP might have more reasonable network characteristics for all metrics and TTI is so variable anyhow.
  2. https://www.56.com - golden says 7.5s, WPT traces say 6-11s, we say 3s. Systematic and intentional lantern error. Here we are bitten by our intentional optimism in per-origin RTTs. The servers are in APAC which result in very high and very variable RTTs. We use the min (and plan to not observe it at all), so we greatly underestimate the time taken to download required resources.
  3. https://www.deviantart.com/ - golden says 8.5s, WPT traces say 5.1-8.5s, we say 3.5s. Here we again get unlucky that the median TTI is actually the max FCP. Extremely variable root document request that sometimes takes ~10x the lantern estimate.
  4. https://www.linkedin.com/ - golden says 4.1s, WPT traces say 2-5.1s, we say 1.9s. Here we again get unlucky that the median TTI is actually the max FCP. Extremely variable root document request that sometimes takes ~5x the lantern estimate.

1 = either we track our accuracy against what we're explicitly attempting to simulate or we change our simulation to inject more pessimism. I strongly dislike the latter approach but am happy to discuss if you disagree :)

connorjclark commented 5 years ago

Re: Redirects

I did correct many of the redirects, but I did not take into account m.* redirects based on UA. Good catch.

It seems like we're only comparing a subset of the same URLs (32/99 with my fixUrl script, it looks like connor managed to match up 43 though). When the basket of URLs is significantly smaller than the original already small subset, it's difficult to compare error charateristics reliably. Lots of these URLs are in connor's list in the PR though, so I'm not sure if they weren't run just for the sake of time. The same redirect resolution point above will need to be applied to these too. That being said, after adjusting our script for just the ones in the basket, our original error rates on this subset was higher but not that much higher (in fact for TTFCPUI it was actually lower than normal), so this doesn't totally explain error rate differences.

I definitely did a subset in the interest of time. I attempted to make sure the sets were equivalent, but I was admittedly a bit cavalier.

When we try to reproduce those results, we just get the more reasonable results. We probably didn't experience much of this in the first set because we determined the 100 golden set based on a much larger 1000 set so runs with unpredictable behavior and high variance were just excluded.

I didn't know about the larger 1000 URL set. So it was whittled down based on just variance? We should definitely do something similar again, or at least measure the variance in our existing set of URLs and see what bad apples we have.

either we track our accuracy against what we're explicitly attempting to simulate or we change our simulation to inject more pessimism

Taking the minimum seems like maximum optimism. Would you described lantern with the same level of optimism?

connorjclark commented 5 years ago

For making collection easier, I would like to transition to a cloud-based operation. (it is very time consuming rn. especially since my corp machine falls asleep so I have to babysit it). I don't exactly know how to go about that. @patrickhulce any ideas? Probably the same approach you did with DZL (whatever that was).

In addition to convenience, it'd be a necessary step to automating the collection on a somewhat-regular basis.

patrickhulce commented 5 years ago

So it was whittled down based on just variance?

More or less randomness + variance. If I had thought ahead I would have been much more methodical about it :) Basically I randomly selected 110% of the URLs we wanted to keep in a basket and threw out the worst 10% that had very high variance.

Taking the minimum seems like maximum optimism. Would you described lantern with the same level of optimism?

Given a fixed graph and a move to totally ignore observed per-origin RTT, yeah I would say that the characteristics of our simulation produce the maximally optimistic result. The only thing pushing results to be more pessimistic is the pessimistic graph that includes things that potentially shouldn't be included, but in many cases there simply aren't any such things to include and so we remain maximally optimistic. If min is too extreme, something like the 25th or 10th percentile might make sense. Part of the motivation here is that there's frequently a bimodal distribution and by choosing the TTI-median we end up comparing ourselves with something that is absolute worst-case scenario, which lantern is simply never going to try to match.

For making collection easier, I would like to transition to a cloud-based operation.

I agree though this will likely come with it's own subtly different perf characteristics too to complicate things 😞

connorjclark commented 5 years ago

I agree though this will likely come with it's own subtly different perf characteristics too to complicate things

If we used PSI / LR (for the "unthrottled desktop" runs), at least we'd be accounting for things that will possibly be meaningful :)

patrickhulce commented 5 years ago

@patrickhulce any ideas?

ya, if the lantern collection script is G2G as-is I can put a script together for automating it on there and dumping results to cloud storage

patrickhulce commented 5 years ago

If we used PSI / LR (for the "unthrottled desktop" runs), at least we'd be accounting for things that will possibly be meaningful

Ooooh this is a great idea! Any change of exposing a "get trace and devtools log" to public PSI?

connorjclark commented 5 years ago

Not public, but we have the capability to grab that stuff if we hit LR internally. So it'd amount to running the collection script on borg (+ hitting the LR api, which is easy enough).

connorjclark commented 5 years ago

quick summary of more action items we decided today:

patrickhulce commented 5 years ago

pick 75 %ile (near-best) WPT run based on TTI (of 9 runs, thats the third best)

I'd be very curious to see how many metrics change if we base this on FCP instead. TTI is so variable anyway I have a hunch that it's less indicative for the other metrics.

connorjclark commented 5 years ago

I'd be very curious to see how many metrics change if we base this on FCP instead. TTI is so variable anyway I have a hunch that it's less indicative for the other metrics.

sgtm. let's try varying that next time we do a full collection.

connorjclark commented 4 years ago

This PR added the collection scripts: #9662

Just realized we are selecting the golden run based on 75pct FCP. Apparently we want to use TTI instead (https://github.com/GoogleChrome/lighthouse/issues/9887#issuecomment-547709048). Gotta fix that.

paulirish commented 4 years ago

CPU throttling on WPT braindump

My understanding of CPU throttling in WPT:

status as of jan 2020:


Updated Timeline of changes

Summarizing as good/bad months...

connorjclark commented 4 years ago
  1. Lantern doesn't know about OOPIFs, which might affect the simulation. Is that captured anywhere?
  2. Paul, did you resolve your things w/ WPT?
  3. I think Moto G4 still has OOPIF disabled, FWIW
  4. one day we'll do the same for LR, which will be useful for making sure that env is good too -> Should we do this?
patrickhulce commented 4 years ago

Lantern doesn't know about OOPIFs, which might affect the simulation. Is that captured anywhere?

There's a document outlining our plans here and we'll be converting it to issues shortly. doc

Paul, did you resolve your things w/ WPT?

Update in https://github.com/GoogleChrome/lighthouse/issues/9887#issuecomment-580837292 but tl;dr some things are fixed but not everything looks right still.

one day we'll do the same for LR, which will be useful for making sure that env is good too -> Should we do this?

tracked by #10358

patrickhulce commented 3 years ago

The variance mission will never end, but we're done with this as a standalone effort for now.