HTTPArchive / custom-metrics

Custom metrics to use with WebPageTest agents
Apache License 2.0
19 stars 22 forks source link

Add custom metric listing the cookie store #116

Closed rviscomi closed 5 months ago

rviscomi commented 5 months ago

Progress on #112

Adds a custom metric to dump the contents of the cookie jar


Test websites:

rviscomi commented 5 months ago

@nrllh I've parsed out the httpOnly cookies from the response headers and merged them in the cookie array.

Here's a more complex example using the NY Times site: https://www.webpagetest.org/result/240409_BiDcJS_79A/1/details/

Results ```json { "allCookies": [ { "domain": "nytimes.com", "expires": 1744235576387.214, "name": "nyt-a", "partitioned": false, "path": "/", "sameSite": "none", "secure": true, "value": "kvLPILWWcZksweQIzek_84" }, { "domain": "nytimes.com", "expires": 1712721175988.096, "name": "nyt-gdpr", "partitioned": false, "path": "/", "sameSite": "lax", "secure": false, "value": "0" }, { "domain": "nytimes.com", "expires": 1744235560009.721, "name": "nyt-purr", "partitioned": false, "path": "/", "sameSite": "lax", "secure": true, "value": "cfshcfhshckfhdfshg" }, { "domain": "nytimes.com", "expires": 1712721160009.771, "name": "nyt-geo", "partitioned": false, "path": "/", "sameSite": "lax", "secure": false, "value": "US" }, { "domain": "nytimes.com", "expires": null, "name": "nyt-b3-traceid", "partitioned": false, "path": "/", "sameSite": "none", "secure": true, "value": "b80ea2fae43e46d7bf6a5e6173f731f5" }, { "domain": "nytimes.com", "expires": 1744235563215.201, "name": "nyt-jkidd", "partitioned": false, "path": "/", "sameSite": "lax", "secure": false, "value": "uid=0&lastRequest=1712699563194&activeDays=%5B0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C0%2C1%5D&adv=1&a7dv=1&a14dv=1&a21dv=1&lastKnownType=anon&newsStartDate=&entitlements=" }, { "domain": "nytimes.com", "expires": 1746395564000, "name": "__gads", "partitioned": false, "path": "/", "sameSite": "lax", "secure": false, "value": "ID=33af9a77d42a2448:T=1712699564:RT=1712699564:S=ALNI_MY9bOtga7JgSrtENBENRDaCC0Ofmg" }, { "domain": "nytimes.com", "expires": 1746395564000, "name": "__gpi", "partitioned": false, "path": "/", "sameSite": "lax", "secure": false, "value": "UID=00000ddc29532380:T=1712699564:RT=1712699564:S=ALNI_MaMDmuaf76zbnfw8SaKWQfJOUcDfg" }, { "domain": "nytimes.com", "expires": 1728251564000, "name": "__eoi", "partitioned": false, "path": "/", "sameSite": "lax", "secure": false, "value": "ID=2c19d8ca5b1def9b:T=1712699564:RT=1712699564:S=AA-AfjbRnKD9hK_1SIu0f71CtK_z" }, { "domain": "www.nytimes.com", "expires": 1744235565280.481, "name": "datadome", "partitioned": false, "path": "/", "sameSite": "lax", "secure": true, "value": "PNzEPCDoc4b4D8LTHJ6zY7_iMlVP0gdQSv1W0EzrOPjnyGzVzWt1V2ddSDVIaJVGdFOuStnuFn6T_3w~26ZXOyn2cKdgolQMrE23wFzFcQfvxdrS_ZzKDgorih1ghW88" }, { "domain": "nytimes.com", "expires": 1746827567000, "name": "_cb", "partitioned": false, "path": "/", "sameSite": "lax", "secure": true, "value": "DcFyVCDKsokJChqbCb" }, { "domain": "nytimes.com", "expires": 1746827567000, "name": "_chartbeat2", "partitioned": false, "path": "/", "sameSite": "lax", "secure": true, "value": ".1712699567412.1712699567412.1.BcyRhCDzBEoEB9oAZHCkOBnXDCekYY.1" }, { "domain": "nytimes.com", "expires": 1712701367000, "name": "_cb_svref", "partitioned": false, "path": "/", "sameSite": "lax", "secure": true, "value": "external" }, { "domain": "nytimes.com", "expires": 1746827567000, "name": "_v__chartbeat3", "partitioned": false, "path": "/", "sameSite": "lax", "secure": true, "value": "fr4anC70D3oCCZb-a" }, { "domain": null, "expires": 1712785968521.009, "name": "_lr_geo_location_state", "partitioned": false, "path": "/", "sameSite": "lax", "secure": false, "value": "" }, { "domain": null, "expires": 1712785968522.1099, "name": "_lr_geo_location", "partitioned": false, "path": "/", "sameSite": "lax", "secure": false, "value": "US" }, { "domain": "nytimes.com", "expires": 1720475568000, "name": "_gcl_au", "partitioned": false, "path": "/", "sameSite": "lax", "secure": false, "value": "1.1.887181119.1712699569" }, { "domain": "nytimes.com", "expires": 1747259570204.0898, "name": "iter_id", "partitioned": false, "path": "/", "sameSite": "lax", "secure": false, "value": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJhaWQiOiI2NjE1YjhiMmQzYjI0ZDAwMDEwMTk5MzQiLCJjb21wYW55X2lkIjoiNWMwOThiM2QxNjU0YzEwMDAxMmM2OGY5IiwiaWF0IjoxNzEyNjk5NTcwfQ.Xb-H1ZDdnwQ7LWj5UaMMlaXNrt5PPtpvKc-fadgoHto" }, { "domain": null, "expires": 1712700479000, "name": "_dd_s", "partitioned": false, "path": "/", "sameSite": "none", "secure": true, "value": "rum=0&expire=1712700460858" }, { "name": "receive-cookie-deprecation", "value": "1", "expires": 1744235580443, "domain": ".openx.net", "path": "/", "sameSite": "none", "httpOnly": true, "secure": true, "partitioned": true }, { "name": "receive-cookie-deprecation", "value": "1", "expires": 1720475580443, "domain": ".3lift.com", "path": "/", "sameSite": "none", "httpOnly": true, "secure": true, "partitioned": true }, { "name": "receive-cookie-deprecation", "value": "1", "expires": 2027195580443, "domain": ".adnxs.com", "path": "/", "sameSite": "none", "httpOnly": true, "secure": true, "partitioned": true }, { "name": "receive-cookie-deprecation", "value": "1", "expires": 1744235580443, "domain": "casalemedia.com", "path": "/", "sameSite": "none", "httpOnly": true, "secure": true, "partitioned": true }, { "name": "purr-cache", "value": "
rviscomi commented 5 months ago

Done

github-actions[bot] commented 5 months ago
Custom metrics for https://almanac.httparchive.org/en/2022/ WPT test run results: http://webpagetest.httparchive.org/results.php?test=240409_9Z_S
Custom metrics for https://example.com/ WPT test run results: http://webpagetest.httparchive.org/results.php?test=240409_SN_T Changed custom metrics values: ```json { "_cms": { "wordpress": { "block_theme": false, "has_embed_block": false, "embed_block_count": { "total": 0, "total_by_type": [] }, "scripts": [], "content_type": { "template": "unknown", "post_type": "", "taxonomy": "" } } }, "_cookies": { "allCookies": [] } } ```
Custom metrics for https://web.dev/ WPT test run results: http://webpagetest.httparchive.org/results.php?test=240409_3D_V Changed custom metrics values: ```json { "_cms": { "wordpress": { "block_theme": false, "has_embed_block": false, "embed_block_count": { "total": 0, "total_by_type": [] }, "scripts": [], "content_type": { "template": "unknown", "post_type": "", "taxonomy": "" } } }, "_cookies": { "allCookies": [ { "domain": null, "expires": 1747266788519.529, "name": "_ga_devsite", "partitioned": false, "path": "/", "sameSite": "lax", "secure": false, "value": "GA1.2.3297890470.1712706776", "httpOnly": false }, { "domain": null, "expires": 1746402781000, "name": "cookies_accepted", "partitioned": false, "path": "/", "sameSite": "lax", "secure": false, "value": "true", "httpOnly": false }, { "domain": null, "expires": 1728258781000, "name": "django_language", "partitioned": false, "path": "/", "sameSite": "lax", "secure": false, "value": "en", "httpOnly": false }, { "domain": "web.dev", "expires": 1747266782465.985, "name": "_ga", "partitioned": false, "path": "/", "sameSite": "lax", "secure": false, "value": "GA1.1.524183990.1712706782", "httpOnly": false }, { "domain": "web.dev", "expires": 1747266782530.719, "name": "_ga_18JR3Q8PJ8", "partitioned": false, "path": "/", "sameSite": "lax", "secure": false, "value": "GS1.1.1712706782.1.1.1712706782.0.0.0", "httpOnly": false } ] } } ```
nrllh commented 5 months ago

Just seeing it; @rviscomi do you know why some domain fields are null?

rviscomi commented 5 months ago

Domain is optional according to the docs. The domain property of the cookie should fall back to the current document URL, but I guess the literal value in the cookie store stays null when Domain is omitted.

bstandaert-wustl commented 3 months ago

We're interested in using this metric as part of the privacy 2024 almanac chapter (https://docs.google.com/document/d/1WJT9kfKHxwNl5HNAhIddefWC0vCvH3DUIxajT1gQWao/edit). However, we've noticed that for cookies set through JS, this metric only captures first-party cookies - third-party JS cookies set in an iframe aren't included, because cookieStore only returns first-party cookies.

@nrllh We're thinking that to capture all cookies, we would need to inject part of the custom metric code into third-party iframes to read the cookie store there. Do you know if that's possible, or whether we could extend the crawler to do that?

pmeenan commented 3 months ago

If it will help, it would be trivial for the agent to pull all of the cookies directly from dev tools network or storage interface and report it as part of the page data.

bstandaert-wustl commented 3 months ago

@pmeenan That would be helpful! Do you have an example of how that could be implemented?

pmeenan commented 3 months ago

It would be a raw dump of the dev tools output into the har. That might blow up the size of the page data though (raising the cost of all queries) so I'd want @rviscomi to weigh-in. I could also do some level of processing to make it look more like the cookies custom metric output if size is a concern.

Currently, something like this:

                "_storage": {
                    "cookies": [
                        {
                            "name": "thirdparty",
                            "value": "yes",
                            "domain": "widgets.outbrain.com",
                            "path": "/nanoWidget/externals/cookie",
                            "expires": 1716142633.634125,
                            "size": 13,
                            "httpOnly": false,
                            "secure": true,
                            "session": false,
                            "sameSite": "None",
                            "priority": "Medium",
                            "sameParty": false,
                            "sourceScheme": "Secure",
                            "sourcePort": 443
                        },
                        {
                            "name": "sync",
                            "value": "CgoIoQEQtvztjvkxCgoI5gEQtvztjvkxCgoIhwIQtvztjvkxCgoItwIQtvztjvkxCgkIOhC2_O2O-TEKCQgbELb87Y75MQoKCIwCELb87Y75MQoKCKwCELb87Y75MQoKCK0CELb87Y75MQoJCF8Qtvztjvkx",
                            "domain": ".3lift.com",
                            "path": "/sync",
                            "expires": 1723915032.393132,
                            "size": 160,
                            "httpOnly": false,
                            "secure": true,
                            "session": false,
                            "sameSite": "None",
                            "priority": "Medium",
                            "sameParty": false,
                            "sourceScheme": "Secure",
                            "sourcePort": 443
                        },
                        {
                            "name": "receive-cookie-deprecation",
                            "value": "1",
                            "domain": ".adnxs.com",
                            "path": "/",
                            "expires": 1750699073.955507,
                            "size": 27,
                            "httpOnly": true,
                            "secure": true,
                            "session": false,
                            "sameSite": "None",
                            "priority": "Medium",
                            "sameParty": false,
                            "sourceScheme": "Secure",
                            "sourcePort": 443,
                            "partitionKey": "https://cnn.com"
                        },
                    ]
                },
pmeenan commented 3 months ago

The other option would be to make the cookie data available to custom metrics in their raw form (say, as $WPT_COOKIES) and let the cookies custom metric do whatever processing it wants (and use it as the source of the data).

bstandaert-wustl commented 3 months ago

The current custom metric's output is pretty long already - will that turn into an issue?

I think the only data we actually need is a list of top cookie names and domains setting cookies (perhaps split into tracking/non-tracking) - if we had WPT_COOKIES, we could write a more tailored metric whose output would be pretty small.

pmeenan commented 3 months ago

Done. $WPT_COOKIES is now available to custom metrics scripts and is an array that is a raw dump of the dev tools output.

It's worth noting that it is only available on the httparchive instance (webpagetest.httparchive.org) and as part of the HTTP Archive crawl and won't work on the public WebPageTest.

bstandaert-wustl commented 3 months ago

@pmeenan Awesome, thanks! Is there any other way to test the metric implementation, or should I just assume that the output is identical to the cookies array in your sample above?

pmeenan commented 3 months ago

You can test it with arbitrary custom metrics on https://webpagetest.httparchive.org/ or the PR activity for custom metrics will do it automatically.

I tested it manually with a metric that simply was:

[pat]
return $WPT_COOKIES;

Just to make sure it was working.

pmeenan commented 3 months ago

If I were building a new custom metric, I'd use something like the above to extract the value for a site (like cnn.com) and then write the custom metric in JS, testing locally with the value that came back from the one test and then when it is working the way I want, swap the test data back to $WPT_COOKIES and try it out as a custom metric.

bstandaert-wustl commented 3 months ago

@pmeenan @rviscomi Is anything relying on the cookies custom metric currently? Given Patrick's earlier comment:

That might blow up the size of the page data though

It seems like that would be an issue with the current metric as well, but my understand is that there hasn't been a large-scale crawl since this was merged to know whether it's a problem.

I'm proposing to replace the cookies metric with this:

// [cookies]

return  $WPT_COOKIES?.map(cookie => {
  const {name, value, domain, sameParty} = cookie
  return {name, value, domain, sameParty}
})

Which would capture more cookies but have a smaller total size. It would be sufficient for the privacy almanac chapter work, but I don't want to remove anything that other analyses depend on.

rviscomi commented 3 months ago

What if we omitted value?

bstandaert-wustl commented 3 months ago

@rviscomi That's fine with me. So you are OK with doing the above, but with just name, domain, sameParty?

nrllh commented 3 months ago

I would like to keep the structure shared in this comment , if the output is not a significant issue. For many further potential studies, other fields are essential.

bstandaert-wustl commented 3 months ago

@nrllh OK, good to know. @pmeenan Is there any kind of specific threshold or criteria for how large of a data size we're willing to accept?

pmeenan commented 3 months ago

There's not really a hard limit, just side-effects so my ask would be to make sure you REALLY need everything you are collecting and if it makes sense to have protections in place to make sure it doesn't explode. We already have a couple of metrics that have that problem (rendered html and some of the CSS metrics - combine those with inline base64-encoded fonts and...).

Each page record has a 10MB limit which includes the summary stats, lighthouse data and custom metrics. It used to be the case that records would be dropped if they exceeded the limit but we recently put in protection to drop parts of the data to get the record under the needed size (starting by dropping individual custom metrics that are over 100k).

Probably more of a concern is that the size of the data directly impacts everyone's query costs for querying any custom metrics since they are all in the same column.

Neither is a hard limit but just things we should take into account when adding metrics that might explode in edge cases.

yohhaan commented 3 months ago

Hello,

Thanks for exposing $WPT_COOKIES to custom metrics! For the cookies chapter of the almanac 2024, it would be very nice to have the full dump of all cookies. But, for the analyses we expect to run, we would need only the following properties:

// [cookies]

return  $WPT_COOKIES?.map(cookie => {
  const {name, domain, path, expires, size, httpOnly, secure, session, sameSite, sameParty, partitionKey, partitionKeyOpaque} = cookie
  return {name, domain, path, expires, size, httpOnly, secure, session, sameSite, sameParty, partitionKey, partitionKeyOpaque}
})

i.e., without these properties: value, priority, sourceScheme, sourcePort

Thanks!