Expand third-party classification to include unrecognized entities

alexnj commented 2 years ago

Lighthouse Third Party Summary audit currently drops third-party origins that don't have an entity match in third-party-web library. This causes several third-party origins that have either high transfer size, or main thread blocking (or both) to be dropped out from Third-party summary audit as a side effect. For example, if we audit theverge.com homepage, blismedia.com, narrativ.com, vercel-insights.com are not part of the third-party summary audit. Numerous (newer) TikTok CDNs also drop out despite their large transfer size, as the data-set is not updated with the recent changes the origin has made. There is a clear process defined by third-party-web to update this data, as a manual operation. This is a disconnected process and perhaps could be improved upon if we model it as a feedback loop.

Granted this issue is not severe when the same asset is detected elsewhere for transfer size or execution time in other audits that focus on those. Thus from an issue-detection perspective, the user might find the third-party impact in other audits. However from a proper third-party classification perspective, this could be improved.

To show an example of a third-party that does block main thread quite long, but gets excluded from third-party summary, here's a test-case:

http://lighthouse-thirdparty-mtb.alexnjose.com/ is a first party that integrates mtb-thirdparty.surge.sh
mtb-thirdparty.surge.sh blocks main thread for 2000+ ms.

Currently, this is what the summary audit of this test-case origin would produce (and it passes):

where it should've been the following (and fail due to high blocking time):

Proposal

I think one option we could pursue is to make up a third-party entity based on the root level domain, and not drop unrecognized third-party domains. This should be fairly straightforward by maintaining an in-memory lookup table of entities recognized during the audit, while maintaining compatibility with the IEntity interface exposed by third-party-web.

The drawback with this approach is the duplication of already-recognized entities with their new, unrecognized origins. TikTok is an example of a recognized entity, and a duplicate entry would be created for a CDN host that's not recognized (example today: ttwstatic.com, tiktokcdn-us.com, etc.). This could be improved further as below:

Closing the loop with third-party-web

One option to reintegrate these unrecognized entities is to help the user contribute back to third-party-web. We could mark up the unrecognized links in the report, as below:

We could use GitHub issue-creation link to automatically fill in the required title and meta data.

brendankenny commented 2 years ago

For the "closing the loop" section, I did an HTTP Archive query based on the LHRs rather than the raw requests to get a sense of what kind of reports we would likely see when enabling those links.

The query looked for cross-origin network requests identified by the protocol as resourceType Script (with mainThreadTime for each script joined if available in thebootup-time audit), then eliminated any that have a known third-party-web entity as of yesterday's 0.20.2 release.

This listing then groups by domain and requires at least 50 occurrences, but we could set other thresholds (e.g. blocking time, transfer size) or group by NET.REG_DOMAIN or TLD+x or whatever

2022_09_mobile: third-party origins not covered by third-party-web

(note that mainThreadTime is floored to 0 if < 50ms by the bootup-time audit, and "median" is actually "median of medians" just to give a ballpark)

Definitely at least a lot of CDNs associated with particular sites in the top results that could be added (e.g. static.wikia.nocookie.net with fandom wikis, quoracdn.net for quora subdomains).

alexnj commented 2 years ago

Looping in @patrickhulce

patrickhulce commented 2 years ago

Love the concept to group cross-origin domain work in this audit and provide prompt to file with third-party-web ❤️ 😃

Manual whack-a-mole worked quite well when I had bandwidth to run and investigate every month but web changes fast :)

paulirish commented 1 year ago

Mostly complete. The remaining bit is under "Closing the loop with third-party-web"

GoogleChrome / lighthouse

Expand third-party classification to include unrecognized entities #14440

Proposal

Closing the loop with third-party-web