HTTPArchive / custom-metrics

Custom metrics to use with WebPageTest agents
Apache License 2.0
19 stars 22 forks source link

[Cookies Chapter 2024] Adding Privacy Sandbox well-known files #127

Closed yohhaan closed 3 months ago

yohhaan commented 3 months ago

This pull request modifies the well-known.js custom metric to parse 2 well-known files related to Google's Privacy Sandbox:


Test websites:

github-actions[bot] commented 3 months ago
Custom metrics for https://almanac.httparchive.org/en/2022/ WPT test run results: http://webpagetest.httparchive.org/results.php?test=240603_5D_G
Custom metrics for https://mercadolibre.com WPT test run results: http://webpagetest.httparchive.org/results.php?test=240603_G6_H Changed custom metrics values: ```json { "_well-known": { "/.well-known/assetlinks.json": { "found": true }, "/.well-known/apple-app-site-association": { "found": true }, "/.well-known/related-website-set.json": { "found": true }, "/.well-known/privacy-sandbox-attestations.json": { "found": false }, "/.well-known/gpc.json": { "found": false }, "/robots.txt": { "found": true, "data": { "matched_disallows": {} } }, "/.well-known/security.txt": { "found": false, "data": { "status": 404, "redirected": false, "url": "https://mercadolibre.com/.well-known/security.txt", "signed": false } }, "/.well-known/change-password": { "found": false, "data": { "status": 404, "redirected": false, "url": "https://mercadolibre.com/.well-known/change-password" } }, "/.well-known/resource-that-should-not-exist-whose-status-code-should-not-be-200/": { "found": false, "data": { "status": 404, "redirected": false, "url": "https://mercadolibre.com/.well-known/resource-that-should-not-exist-whose-status-code-should-not-be-200/" } } } } ```
Custom metrics for https://www.media.net WPT test run results: http://webpagetest.httparchive.org/results.php?test=240603_1C_J Changed custom metrics values: ```json { "_well-known": { "/.well-known/assetlinks.json": { "found": false }, "/.well-known/apple-app-site-association": { "found": false }, "/.well-known/related-website-set.json": { "found": false }, "/.well-known/privacy-sandbox-attestations.json": { "found": true }, "/.well-known/gpc.json": { "found": false }, "/robots.txt": { "found": true, "data": { "matched_disallows": {} } }, "/.well-known/security.txt": { "found": false, "data": { "status": 404, "redirected": false, "url": "https://www.media.net/.well-known/security.txt", "signed": false } }, "/.well-known/change-password": { "found": false, "data": { "status": 404, "redirected": false, "url": "https://www.media.net/.well-known/change-password" } }, "/.well-known/resource-that-should-not-exist-whose-status-code-should-not-be-200/": { "found": false, "data": { "status": 404, "redirected": false, "url": "https://www.media.net/.well-known/resource-that-should-not-exist-whose-status-code-should-not-be-200/" } } } } ```
Custom metrics for https://yu.ru WPT test run results: http://webpagetest.httparchive.org/results.php?test=240603_F7_K Changed custom metrics values: ```json { "_well-known": { "/.well-known/assetlinks.json": { "found": false }, "/.well-known/apple-app-site-association": { "found": false }, "/.well-known/related-website-set.json": { "found": false }, "/.well-known/privacy-sandbox-attestations.json": { "found": false }, "/.well-known/gpc.json": { "found": false }, "/robots.txt": { "found": false }, "/.well-known/security.txt": { "found": false, "data": { "status": 404, "redirected": false, "url": "https://yu.ru/.well-known/security.txt", "signed": false } }, "/.well-known/change-password": { "found": false, "data": { "status": 404, "redirected": false, "url": "https://yu.ru/.well-known/change-password" } }, "/.well-known/resource-that-should-not-exist-whose-status-code-should-not-be-200/": { "found": false, "data": { "status": 404, "redirected": false, "url": "https://yu.ru/.well-known/resource-that-should-not-exist-whose-status-code-should-not-be-200/" } } } } ```
yohhaan commented 3 months ago

I meant https://ya.ru/ for automated tests

yohhaan commented 3 months ago

Good point @tunetheweb

The current plan is to get a list of domains that potentially have one of these 2 files through the HTTP archive crawl. Then, use a custom crawler (that I already have) to actually check if the detected JSON files are compliant with the expected JSON schemas, and then do some further analyses on the valid files.

I would be happy to do this parsing and JSON validation in the custom metric directly, but I would need to be able to call a JSON schema validator like the Ajv library and I am not sure how I would go about it. It is unclear to me if and how I can install further dependencies that these custom metrics would have access to (I am also working on another metric where this would be useful as I would need the Public Suffix list to extract the eTLD+1 of the hostnames).

max-ostapenko commented 3 months ago

@yohhaan I had an idea from the documentation that Chrome will consume this list of submitted and validated domains. So I thought it doesn't require additional scanning, no?

max-ostapenko commented 3 months ago

FYI Privacy Sandbox attestations implemented as a simple check within Privacy chapter PR, as part of privacy-sandbox custom metric.

yohhaan commented 3 months ago

@max-ostapenko:

Related Website Set: this is indeed supposed to be the "canonical" list consumed by Chrome, but in practice some domains are not listed there... As an example: https://google.com/.well-known/related-website-set.json

I would like to get from the HTTP Archive crawl viewpoint which websites may have this file set, and then do further post-analysis to check if they are in that "canonical" list or no, if the file they host is exactly the same as the one published, etc.

Attestation: yes, I saw the privacy chapter PR (I am currently working on adding detection of other Privacy Sandbox APIs based on the proposed privacy-sandbox.js metric by @Yash-Vekaria ).

Checking in well-known.js and privacy-sandbox.js would actually be complementary as I see it: