HTTPArchive / almanac.httparchive.org

HTTP Archive's annual "State of the Web" report made by the web community
https://almanac.httparchive.org
Apache License 2.0
616 stars 177 forks source link

Improve CJK font detection #1641

Open rviscomi opened 3 years ago

rviscomi commented 3 years ago

WebPageTest locally installs Noto fonts to better emulate how Android devices come preinstalled with it. This means that analyzing network activity for fonts would miss Noto.

My knee-jerk reaction is that it would improve our font detection to ignore locally installed Noto at the WebPageTest level, so we can measure the number and size of font requests over the network. But it's more complicated than that. If we tested on native mobile hardware rather than emulated, using the locally installed fonts would be technically correct.

For example, if a website uses Helvetica, testing on Windows vs Mac would affect whether that font appears as a system vs web font, assuming it doesn't fall back to other system fonts like Arial or sans-serif.

This is something we should discuss and improve for 2021.

rsheeter commented 3 years ago

Do we have enough data on @font-face use of local(), the breakdown of users by (browser, OS+version), and what system fonts exist on each OS to do some napkin math on how significantly we believe we might be skewing the result?

rviscomi commented 3 years ago

Do we have enough data on @font-face use of local()

Yes, we have more data than we know what to do with when it comes to stylesheet contents, and that's after having written 100+ queries for the CSS chapter! It's a bit tougher to extract granular data like local fonts within @font-face declarations, but possible.

, the breakdown of users by (browser, OS+version), and what system fonts exist on each OS to do some napkin math on how significantly we believe we might be skewing the result?

Not from HTTP Archive data (see the Methodology for more info). The dataset is based on the Chrome UX Report, which does include a coarse phone/tablet/desktop breakdown, but only includes usage from non-iOS Chrome browsers.

rviscomi commented 3 years ago

Since we're primarily interested in detecting web fonts by their network log, I'm inclined to explore the option of disabling the WPT functionality that emulates native mobile system fonts. This would arguably add unrealistic bytes and load time to the page, but I think the advantages outweigh it. @rsheeter do you agree with this approach?

@pmeenan how much flexibility do we have to turn off system fonts like Noto in WPT? Are there any other special case fonts like that that may be skewing our font analysis?

rsheeter commented 3 years ago

I worry that if we disable system font emulation entirely that might change the results enough to care. That makes me think rather than immediately disabling system font emulation we should try to estimate to see what impact this system font emulation is having.

Or, even better perhaps, run an experiment where we gather a given runs data a second time with native font emulation disabled and see if the results look alarmingly different?

tunetheweb commented 3 years ago

With Google Fonts dropping use of local(), it might not make as much difference as it used to...

pmeenan commented 3 years ago

FWIW, it's not "emulation". The Noto fonts are installed on the VM's. The only way to disable them at the system level would be to completely uninstall them.

rsheeter commented 3 years ago

Google Fonts dropping use of local()

We still issue it in specific high traffic cases such as Android Roboto

rsheeter commented 3 years ago

The Noto fonts are installed on the VM's

If we install only Noto, as opposed to say the exact fonts available on some version of Android, that's likely going to over-represent Android's other system fonts. Less of an issue for iOS as users don't usually fetch those fonts over the network.

Maybe we should back up and ask what environment the VM is meant to match? - initially I thought Android but now I'm less sure.

pmeenan commented 3 years ago

Specifically it is

ttf-mscorefonts-installer fonts-noto fonts-roboto fonts-open-sans

https://github.com/WPO-Foundation/wptagent-install/blob/master/debian.sh#L125

We test both desktop and mobile from the same VM's so it is a mix of Windows, Android and CJK fonts. The goal at the time was to have a representative set of fonts that users in the relevant countries would likely have installed on their systems so we don't over-represent the font bytes downloaded when local fallbacks are used and frequently available.

rsheeter commented 3 years ago

My gut reaction is that sounds reasonable. We could try to have different VMs that install different fonts to approximate different environments but I'm guessing that would be a significant nuisance.

It would be very interesting to know how much this is influencing the result. Can we tell from the archive data when a font resolves to a local font? If not I suppose an experiment might be needed?

rviscomi commented 3 years ago

Can we tell from the archive data when a font resolves to a local font?

This might be good enough for font-related analysis.

We could also scan all CSS for @font-face declarations but I'm not sure if that'd have too many false positives for sites that never use the font.

davelab6 commented 1 month ago

@bramstein do you know the latest status on this? cc @charlesberret