Operation Point Reyes (and WPT throttling history)

paulirish commented 5 years ago

Harkening to the days of Operation Yaquina Bay, we've got a new challenge in front of us...

Point Reyes is the windiest place on the Pacific Coast. And much like wind makes the physical world oscillate, variance makes our numbers vibrate.

We have a few questions we need answered to get our Lantern-driven simulation in tip-top shape.

Questions

What happened to the error rate?
- Potential causes: OOPIF, NetworkService, ??
Can we collect WPT mobile device data faster?
Did the HttpArchive runs (we use for scoring curve calibration) get affected by cpu throttling?

Actions

[x] Collect golden WPT data from Moto G4 henceforth [@connorjclark]
[x] Change LH's default from Nexus 5X to Moto G4. [@connorjclark]
- Update UA, Update UA in LR, update device metrics (?)
- Add device art to DevTools
[x] Regather lantern unthrottled-assets traces with oopif disabled [@connorjclark]
- Investigate error rate (return to level of old traces?) [@patrickhulce helping]
[ ] Investigate fidelity of WPT recordings [@patrickhulce]
- No background networking, etc.
[ ] Verify CPU throttling story on WPT [@brendankenny ]
[ ] Determine android memory threshold situation for OOPIF. [@connorjclark]
- Hard cutoff? Plans? Nexus 5X and MotoG4 have the same characteristics here?

Team, please update this with anything it's missing.

connorjclark commented 5 years ago

We're gonna explore adding the ability to disable OOPIFs at runtime via the protocol. We'd only do this for mobile runs. This will allow us to keep lantern the same and still get accurate results for mobile. But first, let's confirm that OOPIFs are a source of error in (...traces with oopif disabled task above).

connorjclark commented 5 years ago

For comparison, this is the accuracy of the dataset the lantern test has today:

With newly corrected traces:

With OOPIFs on local desktop:

Without OOPIFs on local desktop:

Data sets: https://drive.google.com/file/d/1-3IjtVsllDgcSsY0S49S1ZeTd1p13-IH/view?usp=sharing

patrickhulce commented 5 years ago

So definitely better for TTI, but not nearly enough :/ I'm gonna start looking into the FCP traces here.

brendankenny commented 5 years ago

this seems implicit in the above, but just to be explicit, is the plan to

pick a new mobile device (Moto G4)
base mobile emulation on whatever it actually does, e.g. OOPIF behavior

vs simply tune numbers based on the times alone, presumably because the dependency graph would be so completely different as to be useless for predicting the mobile speed? e.g. lots of long tasks from iframes.

That does mean we're going to have to stay on our toes for more site isolation changes. And for

We're gonna explore adding the ability to disable OOPIFs at runtime via the protocol

I wonder if there's a way on desktop we could get the password-triggered mobile behavior that android is getting if the G4 gets it or gets it someday soon.

Or should we just force site isolation off for the mobile device WPT runs as well, assuming WPT can do that? That might better simulate lower powered devices, since they're unlikely to get site isolation any time soon, and you'd get more main thread blocking.

patrickhulce commented 5 years ago

Oh wait, @connorjclark these are the same WPT traces from earlier, right?

So all the FCP issues discussed in https://github.com/GoogleChrome/lighthouse/pull/9662#issuecomment-542873598 and https://github.com/GoogleChrome/lighthouse/pull/9662#issuecomment-542875869 apply here.

connorjclark commented 5 years ago

Or should we just force site isolation off for the mobile device WPT runs as well, assuming WPT can do that? That might better simulate lower powered devices, since they're unlikely to get site isolation any time soon, and you'd get more main thread blocking.

The G4 and N5 are under the threshold regardless, so no need to intervene. But yes, it's possible (wpt has a cmdline option)

Oh wait, @connorjclark these are the same WPT traces from earlier, right?

Yeah, the WPT numbers are still from the N5 run. I can try getting G4 numbers tomorrow and see if that changes anything.

brendankenny commented 5 years ago

The M4 and N5 are under the threshold regardless, so no need to intervene

"While we investigate how to bring this support to more devices..."

I meant for better (near) future proofing but good to know for the present.

connorjclark commented 5 years ago

Ran WPT again for a subset of urls but for G4. n=5, chose median.

Note: I changed the WPT Chrome channel to Stable. But apparently it takes many weeks for Stable to release to Play Store, so I actually got M77. Hence the lack of LCP.

OOPIFs:

No OOPIFs:

patrickhulce commented 5 years ago

Suggested Action Items / tl;dr

Use the resolved URL after redirects
Compare the complete URL sets
Use the minimum observed WPT value for each metric instead of the median-TTI-based one¹
Re-adjust expectations that our success metrics will not be quite as good given the original set was whittled down from 1000, we've switched devices, and allowed 2 years of Android + Chrome updates to change underneath us.

Details

Redirects

We have a few URLs that are redirecting. Because of the existing discrepancy between how observed and simulated are calculated that I believe is being fixed for 6.0 (#8984) we need to use the resolved URL after redirects. Probably my bad for saving the golden set with the clean URL instead of the URL that was audited, sorry :/

Subset Comparison

It seems like we're only comparing a subset of the same URLs (32/99 with my fixUrl script, it looks like connor managed to match up 43 though). When the basket of URLs is significantly smaller than the original already small subset, it's difficult to compare error charateristics reliably. Lots of these URLs are in connor's list in the PR though, so I'm not sure if they weren't run just for the sake of time. The same redirect resolution point above will need to be applied to these too. That being said, after adjusting our script for just the ones in the basket, our original error rates on this subset was higher but not that much higher (in fact for TTFCPUI it was actually lower than normal), so this doesn't totally explain error rate differences.

- https://flipkart.com not found in new set
- https://vine.co/ not found in new set
- https://weather.com/ not found in new set
- http://www.4399.com/ not found in new set
- http://www.58.com/ not found in new set
- http://www.7k7k.com/ not found in new set
- http://www.amazon.co.jp/ not found in new set
- http://www.blogspot.com/ not found in new set
- http://www.brothersoft.com/ not found in new set
- http://www.china.com.cn/ not found in new set
- http://www.cntv.cn/ not found in new set
- http://www.conduit.com/ not found in new set
- http://www.craigslist.org/ not found in new set
- http://www.dawn.com/ not found in new set
- http://www.dion.ne.jp/ not found in new set
- http://www.ebay.com/ not found in new set
- http://www.espn.com/ not found in new set
- http://www.fc2.com/ not found in new set
- http://www.filestube.com/ not found in new set
- http://www.getpersonas.com/ not found in new set
- http://www.globo.com/ not found in new set
- http://www.hatena.ne.jp/ not found in new set
- http://www.hotfile.com/ not found in new set
- http://www.hp.com/ not found in new set
- http://www.huffingtonpost.com/ not found in new set
- http://www.hulu.com/ not found in new set
- http://www.java.com/ not found in new set
- http://www.livedoor.jp/ not found in new set
- http://www.liveperson.net/ not found in new set
- http://www.maktoob.com/ not found in new set
- http://www.metrolyrics.com/ not found in new set
- http://www.mlb.com/ not found in new set
- http://www.mozilla.org/ not found in new set
- http://www.optmd.com/ not found in new set
- http://www.orange.fr/ not found in new set
- http://www.orkut.com/ not found in new set
- http://www.partypoker.com/ not found in new set
- http://www.pcpop.com/ not found in new set
- http://www.pdfqueen.com/ not found in new set
- http://www.pptv.com/ not found in new set
- http://www.rakuten.co.jp/ not found in new set
- http://www.rakuten.ne.jp/ not found in new set
- http://www.scribd.com/ not found in new set
- http://www.shopping.com/ not found in new set
- http://www.skype.com/ not found in new set
- http://www.so-net.ne.jp/ not found in new set
- http://www.softonic.com/ not found in new set
- http://www.sogou.com/ not found in new set
- http://www.soso.com/ not found in new set
- http://www.symantec.com/ not found in new set
- http://www.t-online.de/ not found in new set
- http://www.tabelog.com/ not found in new set
- http://www.thefreedictionary.com/ not found in new set
- http://www.thepiratebay.org/ not found in new set
- http://www.thestar.com.my not found in new set
- http://www.tianya.cn/ not found in new set
- http://www.torrentz.com/ not found in new set
- http://www.tumblr.com/ not found in new set
- http://www.twitpic.com/ not found in new set
- http://www.typepad.com/ not found in new set
- http://www.verizonwireless.com/ not found in new set
- http://www.vevo.com/ not found in new set
- http://www.weather.com/ not found in new set
- http://www.wikipedia.org/ not found in new set
- http://www.ynet.com/ not found in new set
- http://www.youdao.com/ not found in new set
- http://www.zol.com.cn/ not found in new set

Unreproducable WPT Variance

This is likely selection bias because we're looking into sites that performed particularly poorly partly by chance. When we try to reproduce those results, we just get the more reasonable results. We probably didn't experience much of this in the first set because we determined the 100 golden set based on a much larger 1000 set so runs with unpredictable behavior and high variance were just excluded. AT&T is a good example of this where our golden value is 18.5s yet the median of 9 runs redone through WPT UI is 10s. More on this below.

Lantern Intentionally Makes Optimistic Decisions But We Measure Against Median

The worst error rates that aren't redirect-driven are all dramatic underestimations. We've made lots of improvements and decisions over the past two years that make lantern intentionally optimistic. Even in our pessimistic simulations we use optimistic per-origin RTTs, optimistic server response times, and optimistic SSL and HTTP/2 setups. The decision to exclude highly variable runs from the initial golden set is likely responsible for providing a rosier outlook on the difference between predicting the median vs. predicting the minimum than was realistically achievable.

As a result, we're systematically underestimating. Explanations for the top couple of errors found below.

https://www.att.com - golden says 18.5s, WPT traces say 10.5s-20s, we say 7s. Here we get unlucky that the median TTI is actually one of the longest FCPs, so it's not a median-median comparison. Perhaps we should be using a different median selector? Median FCP might have more reasonable network characteristics for all metrics and TTI is so variable anyhow.
https://www.56.com - golden says 7.5s, WPT traces say 6-11s, we say 3s. Systematic and intentional lantern error. Here we are bitten by our intentional optimism in per-origin RTTs. The servers are in APAC which result in very high and very variable RTTs. We use the min (and plan to not observe it at all), so we greatly underestimate the time taken to download required resources.
https://www.deviantart.com/ - golden says 8.5s, WPT traces say 5.1-8.5s, we say 3.5s. Here we again get unlucky that the median TTI is actually the max FCP. Extremely variable root document request that sometimes takes ~10x the lantern estimate.
https://www.linkedin.com/ - golden says 4.1s, WPT traces say 2-5.1s, we say 1.9s. Here we again get unlucky that the median TTI is actually the max FCP. Extremely variable root document request that sometimes takes ~5x the lantern estimate.

¹ = either we track our accuracy against what we're explicitly attempting to simulate or we change our simulation to inject more pessimism. I strongly dislike the latter approach but am happy to discuss if you disagree :)

connorjclark commented 5 years ago

Re: Redirects

I did correct many of the redirects, but I did not take into account m.* redirects based on UA. Good catch.

It seems like we're only comparing a subset of the same URLs (32/99 with my fixUrl script, it looks like connor managed to match up 43 though). When the basket of URLs is significantly smaller than the original already small subset, it's difficult to compare error charateristics reliably. Lots of these URLs are in connor's list in the PR though, so I'm not sure if they weren't run just for the sake of time. The same redirect resolution point above will need to be applied to these too. That being said, after adjusting our script for just the ones in the basket, our original error rates on this subset was higher but not that much higher (in fact for TTFCPUI it was actually lower than normal), so this doesn't totally explain error rate differences.

I definitely did a subset in the interest of time. I attempted to make sure the sets were equivalent, but I was admittedly a bit cavalier.

When we try to reproduce those results, we just get the more reasonable results. We probably didn't experience much of this in the first set because we determined the 100 golden set based on a much larger 1000 set so runs with unpredictable behavior and high variance were just excluded.

I didn't know about the larger 1000 URL set. So it was whittled down based on just variance? We should definitely do something similar again, or at least measure the variance in our existing set of URLs and see what bad apples we have.

either we track our accuracy against what we're explicitly attempting to simulate or we change our simulation to inject more pessimism

Taking the minimum seems like maximum optimism. Would you described lantern with the same level of optimism?

connorjclark commented 5 years ago

For making collection easier, I would like to transition to a cloud-based operation. (it is very time consuming rn. especially since my corp machine falls asleep so I have to babysit it). I don't exactly know how to go about that. @patrickhulce any ideas? Probably the same approach you did with DZL (whatever that was).

In addition to convenience, it'd be a necessary step to automating the collection on a somewhat-regular basis.

patrickhulce commented 5 years ago

So it was whittled down based on just variance?

More or less randomness + variance. If I had thought ahead I would have been much more methodical about it :) Basically I randomly selected 110% of the URLs we wanted to keep in a basket and threw out the worst 10% that had very high variance.

Taking the minimum seems like maximum optimism. Would you described lantern with the same level of optimism?

Given a fixed graph and a move to totally ignore observed per-origin RTT, yeah I would say that the characteristics of our simulation produce the maximally optimistic result. The only thing pushing results to be more pessimistic is the pessimistic graph that includes things that potentially shouldn't be included, but in many cases there simply aren't any such things to include and so we remain maximally optimistic. If min is too extreme, something like the 25th or 10th percentile might make sense. Part of the motivation here is that there's frequently a bimodal distribution and by choosing the TTI-median we end up comparing ourselves with something that is absolute worst-case scenario, which lantern is simply never going to try to match.

For making collection easier, I would like to transition to a cloud-based operation.

I agree though this will likely come with it's own subtly different perf characteristics too to complicate things 😞

connorjclark commented 5 years ago

I agree though this will likely come with it's own subtly different perf characteristics too to complicate things

If we used PSI / LR (for the "unthrottled desktop" runs), at least we'd be accounting for things that will possibly be meaningful :)

patrickhulce commented 5 years ago

@patrickhulce any ideas?

ya, if the lantern collection script is G2G as-is I can put a script together for automating it on there and dumping results to cloud storage

patrickhulce commented 5 years ago

If we used PSI / LR (for the "unthrottled desktop" runs), at least we'd be accounting for things that will possibly be meaningful

Ooooh this is a great idea! Any change of exposing a "get trace and devtools log" to public PSI?

connorjclark commented 5 years ago

Not public, but we have the capability to grab that stuff if we hit LR internally. So it'd amount to running the collection script on borg (+ hitting the LR api, which is easy enough).

connorjclark commented 5 years ago

quick summary of more action items we decided today:

pick 75 %ile (near-best) WPT run based on TTI (of 9 runs, thats the third best)
@patrickhulce will get collection process working on GCP
@connorjclark one day we'll do the same for LR, which will be useful for making sure that env is good too
@connorjclark continue developing lantern LCP separate from updating trace collection

patrickhulce commented 5 years ago

pick 75 %ile (near-best) WPT run based on TTI (of 9 runs, thats the third best)

I'd be very curious to see how many metrics change if we base this on FCP instead. TTI is so variable anyway I have a hunch that it's less indicative for the other metrics.

connorjclark commented 5 years ago

I'd be very curious to see how many metrics change if we base this on FCP instead. TTI is so variable anyway I have a hunch that it's less indicative for the other metrics.

sgtm. let's try varying that next time we do a full collection.

connorjclark commented 4 years ago

This PR added the collection scripts: #9662

Just realized we are selecting the golden run based on 75pct FCP. Apparently we want to use TTI instead (https://github.com/GoogleChrome/lighthouse/issues/9887#issuecomment-547709048). Gotta fix that.

paulirish commented 4 years ago

CPU throttling on WPT braindump

In https://github.com/GoogleChrome/lighthouse/issues/9668 we discussed with speedcurve the throttling setup for LH on WPT. Summary:
- WPT forces on its network traffic shaping specifically for LH runs. 👍
- LH on WPT is configured to disable builtin throttling and use provided throttling. 👍
- However due to conditionals, LH runs (using a desktop host) weren't getting WPT-provided CPU throttling. WPT tasks were 6x longer than LH tasks 👎
- Pat then fixed that. 👍

My understanding of CPU throttling in WPT:

There's the desktop_browser class and the devtools_browser class. Both are used in the case of chrome_desktop, which is what HA is using.
self.options are global for the wpt instance. job is a WPT run. within the run there may be "first view and repeat view". those individual loads are tasks. The lighthouse run is a job, though there will then be a LH task within it.
using --throttling on the WPT instance will allow it to use cgroup CPU throttling, but only if each specific job also wants throttling.
- if --throttling was not set a but a mobile-emulation (on desktop host) job wants throttling, WPT uses devtools CPU throttling.
pat's fix made sure that CPU throttling (via devtools) is flipped on for Lighthouse runs/tasks.

status as of jan 2020:

Reading the code it seems like LH runs would be double throttled. :/
Running WPT now, WPT tasks are still 2.5x longer than LH tasks. This is unexpected. I've asked pat to look into it.

Updated Timeline of changes

sept 25 2019: LH on WPT wasnt getting any cpu throttling at all. pat flipped on devtools cpu throttling for it.
june 4 2020: https://github.com/WPO-Foundation/wptagent/pull/352 landed which accidentally applied double cpu throttling (cgroups and devtools) .. (Though I thought there was double throttling prior to May...)
july 21 2020 https://github.com/WPO-Foundation/wptagent/pull/366 fixes the double CPU throttling situation
- @patrickhulce sez he now has a Moto G4.. and even with the 366 fix, he still can't explain the detail between his device and WPT's numbers.
july 22 2020: WPT switches all cpu throttling to devtools and drops cgroups-based throttling. however there's a bug..
Aug 31 2020: While throttling was applied "correctly", lighthouse's config wasn't set to match. Pat fixed this.
Nov 2023: Pat wrote a PR to use last 9 benchmarks to determine throttling multiplier: Improve stability of benchmark-based CPU throttle scaling by pmeenan · Pull Request #644 · catchpoint/WebPageTest.agent · GitHub

Summarizing as good/bad months...

before sept 2019: wrong (lacking)
from oct 2019 to may 2020: good-ish 👍 (though different than before, also quantized)
june + july 2020: wrong (doubled)
aug 2020: wrong (bad config)
sept 2020: good 👍

connorjclark commented 4 years ago

Lantern doesn't know about OOPIFs, which might affect the simulation. Is that captured anywhere?
Paul, did you resolve your things w/ WPT?
I think Moto G4 still has OOPIF disabled, FWIW
one day we'll do the same for LR, which will be useful for making sure that env is good too -> Should we do this?

patrickhulce commented 4 years ago

Lantern doesn't know about OOPIFs, which might affect the simulation. Is that captured anywhere?

There's a document outlining our plans here and we'll be converting it to issues shortly. doc

Paul, did you resolve your things w/ WPT?

Update in https://github.com/GoogleChrome/lighthouse/issues/9887#issuecomment-580837292 but tl;dr some things are fixed but not everything looks right still.

one day we'll do the same for LR, which will be useful for making sure that env is good too -> Should we do this?

tracked by #10358

patrickhulce commented 3 years ago

The variance mission will never end, but we're done with this as a standalone effort for now.

GoogleChrome / lighthouse