Closed Yash-Vekaria closed 3 months ago
@Yash-Vekaria @max-ostapenko what's the latest on this PR as need to merge it today if we want it included in the Web Almanac 2024.
@Yash-Vekaria @max-ostapenko what's the latest on this PR as need to merge it today if we want it included in the Web Almanac 2024.
@tunetheweb Topics API is ready to merge. Protected Audience API and Attribution APIs have been correctly implemented but I don't have a good example that simulates all the cases, so partial response was recorded on the current test website. I'll let @yohhaan comment if he plans to incorporate any other APIs. @max-ostapenko informed me that he shall be taking a look at the code later today to finalize these APIs and perhaps remove stale code to make the PR ready to merge.
@tunetheweb let's merge it. I'll make my PR on top of main then
Hi, I am modifying at the moment the Topics API implementation (some errors in the current one), I will push shortly and add more details about the rest.
Can we add the following test sites to this PR for unit tests (I am still checking if header detection is working as expected)?
https://protected-audience-demo-publisher.web.app/ (Protected Audience)
Also, I am not sure the async check for the attestation is working/returning
This sort of thing is very verbose and needlessly increases the size of custom metrics for all 15 million sites (most of which likely don't use shared storage):
"sharedStorage": {
"shared-storage": true,
"shared-storage-select-url": true,
"append": [],
"clear": [],
"delete": [],
"set": [],
"run": [],
"selectURL": [],
"addModule": []
},
Can we exclude empty values in all the data in this custom metric?
I made a similar comment on https://github.com/HTTPArchive/custom-metrics/pull/125#pullrequestreview-2097390279 because if EVERY custom metric adds this sort of thing then the custom metric (already 8.32TB or $50 for a single query) will become too large for anyone to query it without taking out a mortgage.
Sounds good, I am on it
This sort of thing is very verbose and needlessly increases the size of custom metrics for all 15 million sites (most of which likely don't use shared storage):
"sharedStorage": { "shared-storage": true, "shared-storage-select-url": true, "append": [], "clear": [], "delete": [], "set": [], "run": [], "selectURL": [], "addModule": [] },
Can we exclude empty values in all the data in this custom metric?
I made a similar comment on #125 (review) because if EVERY custom metric adds this sort of thing then the custom metric (already 8.32TB or $50 for a single query) will become too large for anyone to query it without taking out a mortgage.
Per @tunetheweb 's comment: I changed the format of the result object to make it smaller if APIs are not called.
@Yash-Vekaria: can you see if you can port the ProtectedAudience and Attribution Reporting APIs to a similar format?
@yohhaan I couldn't find any way to verify cross-origin files existence. Fetch doesn't fail because of 404. Do you have any ideas?
Otherwise I would say we remove the attestation checks from custom metrics completely.
Same comment, do we need these all to be set to empty values or can we skip them when nothing of interest in them to save storage space/query costs?
"fencedFrames": [],
"floc": [],
"privateAggregation": [],
"privateStateTokens": [],
"protectedAudienceAPI": {
"interestGroups": {
"joinAdInterestGroup": [],
"leaveAdInterestGroup": [],
"updateAdInterestGroups": [],
"clearOriginJoinedAdInterestGroups": []
},
"runAdAuction": [],
"generateBid": [],
"scoreAd": [],
"reportWin": [],
"reportResult": []
},
"sharedStorage": [],
"storageAccess": [],
"topicsAPI": [],
"userAgentClientHints": []
I am done with the edits I wanted to do on this custom metric, the tests have started to return results as expected since I fixed the double escape issue for the regex.
If people have more cycles to commit, they could try to find websites to test on for each API and see that the calls are actually correctly detected. I have done a few tests + the ones from the automated GitHub actions and things look good to me for now. Also, note: some websites listed to test on require user interaction, thus the empty results sometimes.
Thanks to @Yash-Vekaria for all the help along the implementation of this custom metric and to @max-ostapenko as well!
Thanks @yohhaan and @max-ostapenko for bringing this to completion.
@tunetheweb All the changes are locked from my side as well. We are ready to merge this PR.
@max-ostapenko Extraction of attestation has been removed as it was working in WPT without Privacy Sandbox API capturing in WPT but not with it on live checks (some async/promise issue thats hard to debug due to different behavior). We have decided to analyse it in analysis phase and remove it in custom metrics altogether.
I still think that is a lot of extra data to store per URL when in a lot of cases there won’t be any different. Is there no way to only report if the defaults are different?
I tried adding https://www.example.com to see what it would produce and reran the test but it didn’t pick them up. But I’m presuming it would produce the same as the blank ones?
@tunetheweb One way to optimize further is that we remove all permission-policy
flags (as shown below) because now that I'm thinking more about it, they should always be true
(except for "top-level-storage-access"). Using command line activation fo Privacy Sandbox APIs for the browser being used makes these set to true
by default.
"permissionsPolicy": {
"attribution-reporting": true,
"browsing-topics": true,
"identity-credentials-get": true,
"interest-cohort": true,
"join-ad-interest-group": true,
"private-aggregation": true,
"run-ad-auction": true,
"shared-storage": true,
"shared-storage-select-url": true,
"storage-access": true,
"top-level-storage-access": false
},
Next, for other APIs, we can have a domain-level mapping where key is domain and the value is a list of APIs used by that domain on the current website, rather than having separate keys for each API and same domain repeating in each of them.
So, reduced output will look like the following for operafootball.com:
{
"_privacy-sandbox": {
"top-level-storage-access": false
"securepubads.g.doubleclick.net": ["runAdAuction", "getHighEntropyValues"],
"pagead2.googlesyndication.com": ["runAdAuction", "joinAdInterestGroup", "getHighEntropyValues"],
"www.googletagmanager.com": ["joinAdInterestGroup", "getHighEntropyValues"],
"googleads.g.doubleclick.net": ["browsingTopicsHeader|true",],
}
}
That looks a lot better if you can do that. And happy to include permission policy data if it;'s different from the default (i.e. explicitly blocked) but seems silly to log that same default of all true millions of times.
@tunetheweb I have refactored the code as shared in the previous comment. If it looks good, we can merge.
Thanks for that. Merging!
Implemented capturing the following Privacy Sandbox APIs:
Test websites: