Closed paulirish closed 3 years ago
We're gonna explore adding the ability to disable OOPIFs at runtime via the protocol. We'd only do this for mobile runs. This will allow us to keep lantern the same and still get accurate results for mobile. But first, let's confirm that OOPIFs are a source of error in (...traces with oopif disabled
task above).
For comparison, this is the accuracy of the dataset the lantern test has today:
With newly corrected traces:
With OOPIFs on local desktop:
Without OOPIFs on local desktop:
Data sets: https://drive.google.com/file/d/1-3IjtVsllDgcSsY0S49S1ZeTd1p13-IH/view?usp=sharing
So definitely better for TTI, but not nearly enough :/ I'm gonna start looking into the FCP traces here.
this seems implicit in the above, but just to be explicit, is the plan to
vs simply tune numbers based on the times alone, presumably because the dependency graph would be so completely different as to be useless for predicting the mobile speed? e.g. lots of long tasks from iframes.
That does mean we're going to have to stay on our toes for more site isolation changes. And for
We're gonna explore adding the ability to disable OOPIFs at runtime via the protocol
I wonder if there's a way on desktop we could get the password-triggered mobile behavior that android is getting if the G4 gets it or gets it someday soon.
Or should we just force site isolation off for the mobile device WPT runs as well, assuming WPT can do that? That might better simulate lower powered devices, since they're unlikely to get site isolation any time soon, and you'd get more main thread blocking.
Oh wait, @connorjclark these are the same WPT traces from earlier, right?
So all the FCP issues discussed in https://github.com/GoogleChrome/lighthouse/pull/9662#issuecomment-542873598 and https://github.com/GoogleChrome/lighthouse/pull/9662#issuecomment-542875869 apply here.
Or should we just force site isolation off for the mobile device WPT runs as well, assuming WPT can do that? That might better simulate lower powered devices, since they're unlikely to get site isolation any time soon, and you'd get more main thread blocking.
The G4 and N5 are under the threshold regardless, so no need to intervene. But yes, it's possible (wpt has a cmdline
option)
Oh wait, @connorjclark these are the same WPT traces from earlier, right?
Yeah, the WPT numbers are still from the N5 run. I can try getting G4 numbers tomorrow and see if that changes anything.
The M4 and N5 are under the threshold regardless, so no need to intervene
"While we investigate how to bring this support to more devices..."
I meant for better (near) future proofing but good to know for the present.
Ran WPT again for a subset of urls but for G4. n=5, chose median.
Note: I changed the WPT Chrome channel to Stable
. But apparently it takes many weeks for Stable to release to Play Store, so I actually got M77. Hence the lack of LCP.
OOPIFs:
No OOPIFs:
We have a few URLs that are redirecting. Because of the existing discrepancy between how observed and simulated are calculated that I believe is being fixed for 6.0 (#8984) we need to use the resolved URL after redirects. Probably my bad for saving the golden set with the clean URL instead of the URL that was audited, sorry :/
It seems like we're only comparing a subset of the same URLs (32/99 with my fixUrl script, it looks like connor managed to match up 43 though). When the basket of URLs is significantly smaller than the original already small subset, it's difficult to compare error charateristics reliably. Lots of these URLs are in connor's list in the PR though, so I'm not sure if they weren't run just for the sake of time. The same redirect resolution point above will need to be applied to these too. That being said, after adjusting our script for just the ones in the basket, our original error rates on this subset was higher but not that much higher (in fact for TTFCPUI it was actually lower than normal), so this doesn't totally explain error rate differences.
- https://flipkart.com not found in new set
- https://vine.co/ not found in new set
- https://weather.com/ not found in new set
- http://www.4399.com/ not found in new set
- http://www.58.com/ not found in new set
- http://www.7k7k.com/ not found in new set
- http://www.amazon.co.jp/ not found in new set
- http://www.blogspot.com/ not found in new set
- http://www.brothersoft.com/ not found in new set
- http://www.china.com.cn/ not found in new set
- http://www.cntv.cn/ not found in new set
- http://www.conduit.com/ not found in new set
- http://www.craigslist.org/ not found in new set
- http://www.dawn.com/ not found in new set
- http://www.dion.ne.jp/ not found in new set
- http://www.ebay.com/ not found in new set
- http://www.espn.com/ not found in new set
- http://www.fc2.com/ not found in new set
- http://www.filestube.com/ not found in new set
- http://www.getpersonas.com/ not found in new set
- http://www.globo.com/ not found in new set
- http://www.hatena.ne.jp/ not found in new set
- http://www.hotfile.com/ not found in new set
- http://www.hp.com/ not found in new set
- http://www.huffingtonpost.com/ not found in new set
- http://www.hulu.com/ not found in new set
- http://www.java.com/ not found in new set
- http://www.livedoor.jp/ not found in new set
- http://www.liveperson.net/ not found in new set
- http://www.maktoob.com/ not found in new set
- http://www.metrolyrics.com/ not found in new set
- http://www.mlb.com/ not found in new set
- http://www.mozilla.org/ not found in new set
- http://www.optmd.com/ not found in new set
- http://www.orange.fr/ not found in new set
- http://www.orkut.com/ not found in new set
- http://www.partypoker.com/ not found in new set
- http://www.pcpop.com/ not found in new set
- http://www.pdfqueen.com/ not found in new set
- http://www.pptv.com/ not found in new set
- http://www.rakuten.co.jp/ not found in new set
- http://www.rakuten.ne.jp/ not found in new set
- http://www.scribd.com/ not found in new set
- http://www.shopping.com/ not found in new set
- http://www.skype.com/ not found in new set
- http://www.so-net.ne.jp/ not found in new set
- http://www.softonic.com/ not found in new set
- http://www.sogou.com/ not found in new set
- http://www.soso.com/ not found in new set
- http://www.symantec.com/ not found in new set
- http://www.t-online.de/ not found in new set
- http://www.tabelog.com/ not found in new set
- http://www.thefreedictionary.com/ not found in new set
- http://www.thepiratebay.org/ not found in new set
- http://www.thestar.com.my not found in new set
- http://www.tianya.cn/ not found in new set
- http://www.torrentz.com/ not found in new set
- http://www.tumblr.com/ not found in new set
- http://www.twitpic.com/ not found in new set
- http://www.typepad.com/ not found in new set
- http://www.verizonwireless.com/ not found in new set
- http://www.vevo.com/ not found in new set
- http://www.weather.com/ not found in new set
- http://www.wikipedia.org/ not found in new set
- http://www.ynet.com/ not found in new set
- http://www.youdao.com/ not found in new set
- http://www.zol.com.cn/ not found in new set
This is likely selection bias because we're looking into sites that performed particularly poorly partly by chance. When we try to reproduce those results, we just get the more reasonable results. We probably didn't experience much of this in the first set because we determined the 100 golden set based on a much larger 1000 set so runs with unpredictable behavior and high variance were just excluded. AT&T is a good example of this where our golden value is 18.5s yet the median of 9 runs redone through WPT UI is 10s. More on this below.
The worst error rates that aren't redirect-driven are all dramatic underestimations. We've made lots of improvements and decisions over the past two years that make lantern intentionally optimistic. Even in our pessimistic simulations we use optimistic per-origin RTTs, optimistic server response times, and optimistic SSL and HTTP/2 setups. The decision to exclude highly variable runs from the initial golden set is likely responsible for providing a rosier outlook on the difference between predicting the median vs. predicting the minimum than was realistically achievable.
As a result, we're systematically underestimating. Explanations for the top couple of errors found below.
1 = either we track our accuracy against what we're explicitly attempting to simulate or we change our simulation to inject more pessimism. I strongly dislike the latter approach but am happy to discuss if you disagree :)
Re: Redirects
I did correct many of the redirects, but I did not take into account m.*
redirects based on UA. Good catch.
It seems like we're only comparing a subset of the same URLs (32/99 with my fixUrl script, it looks like connor managed to match up 43 though). When the basket of URLs is significantly smaller than the original already small subset, it's difficult to compare error charateristics reliably. Lots of these URLs are in connor's list in the PR though, so I'm not sure if they weren't run just for the sake of time. The same redirect resolution point above will need to be applied to these too. That being said, after adjusting our script for just the ones in the basket, our original error rates on this subset was higher but not that much higher (in fact for TTFCPUI it was actually lower than normal), so this doesn't totally explain error rate differences.
I definitely did a subset in the interest of time. I attempted to make sure the sets were equivalent, but I was admittedly a bit cavalier.
When we try to reproduce those results, we just get the more reasonable results. We probably didn't experience much of this in the first set because we determined the 100 golden set based on a much larger 1000 set so runs with unpredictable behavior and high variance were just excluded.
I didn't know about the larger 1000 URL set. So it was whittled down based on just variance? We should definitely do something similar again, or at least measure the variance in our existing set of URLs and see what bad apples we have.
either we track our accuracy against what we're explicitly attempting to simulate or we change our simulation to inject more pessimism
Taking the minimum seems like maximum optimism. Would you described lantern with the same level of optimism?
For making collection easier, I would like to transition to a cloud-based operation. (it is very time consuming rn. especially since my corp machine falls asleep so I have to babysit it). I don't exactly know how to go about that. @patrickhulce any ideas? Probably the same approach you did with DZL (whatever that was).
In addition to convenience, it'd be a necessary step to automating the collection on a somewhat-regular basis.
So it was whittled down based on just variance?
More or less randomness + variance. If I had thought ahead I would have been much more methodical about it :) Basically I randomly selected 110% of the URLs we wanted to keep in a basket and threw out the worst 10% that had very high variance.
Taking the minimum seems like maximum optimism. Would you described lantern with the same level of optimism?
Given a fixed graph and a move to totally ignore observed per-origin RTT, yeah I would say that the characteristics of our simulation produce the maximally optimistic result. The only thing pushing results to be more pessimistic is the pessimistic graph that includes things that potentially shouldn't be included, but in many cases there simply aren't any such things to include and so we remain maximally optimistic. If min is too extreme, something like the 25th or 10th percentile might make sense. Part of the motivation here is that there's frequently a bimodal distribution and by choosing the TTI-median we end up comparing ourselves with something that is absolute worst-case scenario, which lantern is simply never going to try to match.
For making collection easier, I would like to transition to a cloud-based operation.
I agree though this will likely come with it's own subtly different perf characteristics too to complicate things π
I agree though this will likely come with it's own subtly different perf characteristics too to complicate things
If we used PSI / LR (for the "unthrottled desktop" runs), at least we'd be accounting for things that will possibly be meaningful :)
@patrickhulce any ideas?
ya, if the lantern collection script is G2G as-is I can put a script together for automating it on there and dumping results to cloud storage
If we used PSI / LR (for the "unthrottled desktop" runs), at least we'd be accounting for things that will possibly be meaningful
Ooooh this is a great idea! Any change of exposing a "get trace and devtools log" to public PSI?
Not public, but we have the capability to grab that stuff if we hit LR internally. So it'd amount to running the collection script on borg (+ hitting the LR api, which is easy enough).
quick summary of more action items we decided today:
pick 75 %ile (near-best) WPT run based on TTI (of 9 runs, thats the third best)
I'd be very curious to see how many metrics change if we base this on FCP instead. TTI is so variable anyway I have a hunch that it's less indicative for the other metrics.
I'd be very curious to see how many metrics change if we base this on FCP instead. TTI is so variable anyway I have a hunch that it's less indicative for the other metrics.
sgtm. let's try varying that next time we do a full collection.
This PR added the collection scripts: #9662
Just realized we are selecting the golden run based on 75pct FCP. Apparently we want to use TTI instead (https://github.com/GoogleChrome/lighthouse/issues/9887#issuecomment-547709048). Gotta fix that.
self.options
are global for the wpt instance. job
is a WPT run. within the run there may be "first view and repeat view". those individual loads are task
s. The lighthouse run is a job
, though there will then be a LH task
within it.--throttling
on the WPT instance will allow it to use cgroup CPU throttling, but only if each specific job also wants throttling.
--throttling
was not set a but a mobile-emulation (on desktop host) job wants throttling, WPT uses devtools CPU throttling.one day we'll do the same for LR, which will be useful for making sure that env is good too
-> Should we do this?Lantern doesn't know about OOPIFs, which might affect the simulation. Is that captured anywhere?
There's a document outlining our plans here and we'll be converting it to issues shortly. doc
Paul, did you resolve your things w/ WPT?
Update in https://github.com/GoogleChrome/lighthouse/issues/9887#issuecomment-580837292 but tl;dr some things are fixed but not everything looks right still.
one day we'll do the same for LR, which will be useful for making sure that env is good too -> Should we do this?
tracked by #10358
The variance mission will never end, but we're done with this as a standalone effort for now.
Harkening to the days of Operation Yaquina Bay, we've got a new challenge in front of us...
Point Reyes is the windiest place on the Pacific Coast. And much like wind makes the physical world oscillate, variance makes our numbers vibrate.
We have a few questions we need answered to get our Lantern-driven simulation in tip-top shape.
Questions
Actions
unthrottled-assets
traces with oopif disabled [@connorjclark]Team, please update this with anything it's missing.