Closed rviscomi closed 5 months ago
@nrllh I've parsed out the httpOnly
cookies from the response headers and merged them in the cookie array.
Here's a more complex example using the NY Times site: https://www.webpagetest.org/result/240409_BiDcJS_79A/1/details/
Done
Just seeing it; @rviscomi do you know why some domain fields are null?
Domain
is optional according to the docs. The domain property of the cookie should fall back to the current document URL, but I guess the literal value in the cookie store stays null when Domain
is omitted.
We're interested in using this metric as part of the privacy 2024 almanac chapter (https://docs.google.com/document/d/1WJT9kfKHxwNl5HNAhIddefWC0vCvH3DUIxajT1gQWao/edit). However, we've noticed that for cookies set through JS, this metric only captures first-party cookies - third-party JS cookies set in an iframe aren't included, because cookieStore
only returns first-party cookies.
@nrllh We're thinking that to capture all cookies, we would need to inject part of the custom metric code into third-party iframes to read the cookie store there. Do you know if that's possible, or whether we could extend the crawler to do that?
@pmeenan That would be helpful! Do you have an example of how that could be implemented?
It would be a raw dump of the dev tools output into the har. That might blow up the size of the page data though (raising the cost of all queries) so I'd want @rviscomi to weigh-in. I could also do some level of processing to make it look more like the cookies custom metric output if size is a concern.
Currently, something like this:
"_storage": {
"cookies": [
{
"name": "thirdparty",
"value": "yes",
"domain": "widgets.outbrain.com",
"path": "/nanoWidget/externals/cookie",
"expires": 1716142633.634125,
"size": 13,
"httpOnly": false,
"secure": true,
"session": false,
"sameSite": "None",
"priority": "Medium",
"sameParty": false,
"sourceScheme": "Secure",
"sourcePort": 443
},
{
"name": "sync",
"value": "CgoIoQEQtvztjvkxCgoI5gEQtvztjvkxCgoIhwIQtvztjvkxCgoItwIQtvztjvkxCgkIOhC2_O2O-TEKCQgbELb87Y75MQoKCIwCELb87Y75MQoKCKwCELb87Y75MQoKCK0CELb87Y75MQoJCF8Qtvztjvkx",
"domain": ".3lift.com",
"path": "/sync",
"expires": 1723915032.393132,
"size": 160,
"httpOnly": false,
"secure": true,
"session": false,
"sameSite": "None",
"priority": "Medium",
"sameParty": false,
"sourceScheme": "Secure",
"sourcePort": 443
},
{
"name": "receive-cookie-deprecation",
"value": "1",
"domain": ".adnxs.com",
"path": "/",
"expires": 1750699073.955507,
"size": 27,
"httpOnly": true,
"secure": true,
"session": false,
"sameSite": "None",
"priority": "Medium",
"sameParty": false,
"sourceScheme": "Secure",
"sourcePort": 443,
"partitionKey": "https://cnn.com"
},
]
},
The other option would be to make the cookie data available to custom metrics in their raw form (say, as $WPT_COOKIES
) and let the cookies custom metric do whatever processing it wants (and use it as the source of the data).
The current custom metric's output is pretty long already - will that turn into an issue?
I think the only data we actually need is a list of top cookie names and domains setting cookies (perhaps split into tracking/non-tracking) - if we had WPT_COOKIES
, we could write a more tailored metric whose output would be pretty small.
Done. $WPT_COOKIES
is now available to custom metrics scripts and is an array that is a raw dump of the dev tools output.
It's worth noting that it is only available on the httparchive instance (webpagetest.httparchive.org) and as part of the HTTP Archive crawl and won't work on the public WebPageTest.
@pmeenan Awesome, thanks! Is there any other way to test the metric implementation, or should I just assume that the output is identical to the cookies
array in your sample above?
You can test it with arbitrary custom metrics on https://webpagetest.httparchive.org/ or the PR activity for custom metrics will do it automatically.
I tested it manually with a metric that simply was:
[pat]
return $WPT_COOKIES;
Just to make sure it was working.
If I were building a new custom metric, I'd use something like the above to extract the value for a site (like cnn.com) and then write the custom metric in JS, testing locally with the value that came back from the one test and then when it is working the way I want, swap the test data back to $WPT_COOKIES and try it out as a custom metric.
@pmeenan @rviscomi Is anything relying on the cookies custom metric currently? Given Patrick's earlier comment:
That might blow up the size of the page data though
It seems like that would be an issue with the current metric as well, but my understand is that there hasn't been a large-scale crawl since this was merged to know whether it's a problem.
I'm proposing to replace the cookies metric with this:
// [cookies]
return $WPT_COOKIES?.map(cookie => {
const {name, value, domain, sameParty} = cookie
return {name, value, domain, sameParty}
})
Which would capture more cookies but have a smaller total size. It would be sufficient for the privacy almanac chapter work, but I don't want to remove anything that other analyses depend on.
What if we omitted value
?
@rviscomi That's fine with me. So you are OK with doing the above, but with just name, domain, sameParty
?
I would like to keep the structure shared in this comment , if the output is not a significant issue. For many further potential studies, other fields are essential.
@nrllh OK, good to know. @pmeenan Is there any kind of specific threshold or criteria for how large of a data size we're willing to accept?
There's not really a hard limit, just side-effects so my ask would be to make sure you REALLY need everything you are collecting and if it makes sense to have protections in place to make sure it doesn't explode. We already have a couple of metrics that have that problem (rendered html and some of the CSS metrics - combine those with inline base64-encoded fonts and...).
Each page record has a 10MB limit which includes the summary stats, lighthouse data and custom metrics. It used to be the case that records would be dropped if they exceeded the limit but we recently put in protection to drop parts of the data to get the record under the needed size (starting by dropping individual custom metrics that are over 100k).
Probably more of a concern is that the size of the data directly impacts everyone's query costs for querying any custom metrics since they are all in the same column.
Neither is a hard limit but just things we should take into account when adding metrics that might explode in edge cases.
Hello,
Thanks for exposing $WPT_COOKIES
to custom metrics!
For the cookies chapter of the almanac 2024, it would be very nice to have the full dump of all cookies.
But, for the analyses we expect to run, we would need only the following properties:
// [cookies]
return $WPT_COOKIES?.map(cookie => {
const {name, domain, path, expires, size, httpOnly, secure, session, sameSite, sameParty, partitionKey, partitionKeyOpaque} = cookie
return {name, domain, path, expires, size, httpOnly, secure, session, sameSite, sameParty, partitionKey, partitionKeyOpaque}
})
i.e., without these properties: value, priority, sourceScheme, sourcePort
Thanks!
Progress on #112
Adds a custom metric to dump the contents of the cookie jar
Test websites: