Technical-only enforcement of "UA Policy"?

WICG / first-party-sets

https://wicg.github.io/first-party-sets/

283 stars 72 forks source link

Technical-only enforcement of "UA Policy"? #43

Open krgovind opened 3 years ago

krgovind commented 3 years ago

@erik-anderson suggested over on the TAG review thread that we consider technical mechanisms in lieu of the "UA Policy" to verify formation of acceptable sets.

A non-comprehensive list of areas I'd like to explore to mitigate the potential impact (which are not mutually exclusive) of the governance concern: make the max size of lists small enough to not need any approval (may not be practical due to the past concern about a lack of objective, user-intuitive criteria for when sites can join the same set); an independent entity to approve and/or revoke the ability to use a set, using a common set of criteria that multiple implementers agree to (a bit like CAs and web PKI, which carries its own set of challenges, though perhaps smaller in scope here); or "GREASE"ing of when First-Party Sets are used (e.g. disabling them some small percentage of the time and/or revoking the right to use them at all if the site doesn't function without them) to help sites prove/validate that they will function adequately for browsers and/or users who configure their browsers to limit or disallow the use of First-Party Sets.

The proposal currently calls for a "UA Policy" (relevant issue) to ensure that site-declared sets meet acceptance criteria. This was added to the proposal primarily to address feedback received from Safari (#6) and Mozilla (#7):

We believe a feature like First Party Sets will cause new consortiums to be formed for the sole purpose of cross-site data sharing since third-party data restrictions are relaxed within the set. Combine that with affiliations being unclear to users and you have a situation where users are effectively tracked across sites/contexts that they think are distinct.

Is there a combination of technical mechanisms; along with a revocation mechanism, transparency logs to aid auditability, etc. that could address these concerns?

dmarti commented 3 years ago

It should be possible for anyone to run a crawler to figure out if a set is valid. (I would be interested in collaborating on a crawler project to produce a validation tool and directory of sets. If anyone else is working on a crawler/validator/directory for first party sets, I would appreciate a link.) Three categories of items that a crawler could check:

A set could define a list of resources under .well-known that are required to be identical across site members. For example, a policy could require that if /.well-known/gpc.json is present on a site, then only sites with an identical resource at that path would be eligible to be a member of a first-party set.
Other resources outside of /.well-known can also be compared to determine validity of a set if present and identical. For example, if both site A and site B have an /ads.txt and the content does not match, then that is evidence that A and B are administered separately for purposes of some business relationships, and therefore not members of a valid set with each other.
Other items clearly need to be common across sites from the user point of view, but are more complicated to check. For example, the privacy policies for two members of a set could be identical in text, but different in content because of different styling. Privacy policies would have to use markup to facilitate comparison.

Common branding resources and guidelines are also clearly necessary, so that a user is aware when they are using sites that share a set. This might include a common set of graphic elements and size at which the elements must be visible -- but there are a11y concerns. We would need to be confident that a user of assistive technologies will be able to recognize when they are using sites that are members of the same set.

cfredric commented 3 years ago

For example, if both site A and site B have an /ads.txt and the content does not match, then that is evidence that A and B are administered separately for purposes of some business relationships, and therefore not members of a valid set with each other.

My opinion is that using ads.txt content as a proxy for "owning entity" is probably not a good fit, as there may be reasons for different sites owned by the same publisher to have different ads.txt files, if the sites serve different purposes. Using ads.txt this way also seems to imply some connection between First-Party Sets and ads use-cases, which would be unfortunate since First-Party Sets is not connected to ads use cases.

dmarti commented 3 years ago

There are also good reasons for sites owned by the same publishing group not to be eligible for the same first-party set. Whether or not two sites can reasonably be parts of a first-party set is more about user-visible branding and expectations of data handling than about ownership structure. (For example, two independently owned radio station sites that are part of the same network and run the same news and talk shows might be part of the same first-party set, but a scientific journal and a local news site that are two divisions of the same corporation might not be.)

An crawler could reasonably produce two results from comparing two ads.txt files: either

these two sites have data sharing relationships that are different enough that they could not be a first-party set
- these two sites are similar enough that their ad data sharing does not disqualify them from being a first-party set

michael-oneill commented 3 years ago

Common response headers could also help automatic verification of FPS members. A common Permisions-Policy and Consent-Security-Policy should not be too hard to arrange, not only to show common ownership, but also encourage good cross-site security practice. Perhaps it could be tightened further by restricting wildcard strings (*) in allow lists.

michael-oneill commented 3 years ago

Other data points (to help automatic verification) could be:

DN components of SSL X.509 certificate SubjectDN e.g. CN=
WhoIs record data e.g. domain registrant
DNS Resource records e.. TXT records Objects in the .well-known resource could indicate which of these are to be used so external verifiers/browsers can verify them for example: "ownerName": "Example-Company Inc.", "indicatesWith": [{"DNS": ""}, {"X.509-Subject":"CN"}, {"WHOIS":"registrant"}], "owner": "example.com", "members":[ "member-one.com", "example.eu" ],

"indicatesWith" is an array of objects to make it possible to identify the particular record.

Browsers/regulators could specify how many and what data points would be necessary to verify a valid set. Technical documents like this are machine readable, but could also eventually be seen as a legal declaration of identity/ownership of domain origins.

thegreatfatzby commented 8 months ago

@krgovind and others, way late here: in assessing various options here, was anything considered in which the browser would put something on the screen from a ./well-known resource that would "enforce visual co-branding"? Something that would "extend" the browser bar, like:

A well-known brand banner, 728x90, from the primary site (primary/.well-known/brand.png|jpg) is put under the browser bar by the browser, will only load with HTTPS, and has to link to secondary/.well-known/brand-affiliation.html for curious users?
A primary/.well-known/privacy-policy.html, /.well-known/opt-out, that the browser bar could visually display and link to. This could be coupled to an extra field in the Privacy Sandbox Attestation.

Even something that was "obtrusive" that allowed for greater flexibility and decentralization might be preferable for some businesses. With the new RWS Subsets concept, something like this could define a type of subset, and browsers might make different choices about what to allow in terms of storage/network access for those subsets (SAA auto-grants maybe, but other options: maybe Topics API considers all the sites in this type of set if the API is called on one of them, or Interest Group TTL can be reset based on a visit to one of the sites).

dmarti commented 8 months ago

@thegreatfatzby There is a suggestion to check that some common branding element is present in the DOM: https://github.com/WICG/first-party-sets/issues/95 (There are probably some good browser-based software testing tools that could be repurposed to check that a specific element is present and viewable, or this might be a good use case for machine vision: render the page in a headless browser and check for common branding elements)

The challenge here is a11y though: is the common party or context clear to users who are visiting the site using a variety of assistive technologies?

thegreatfatzby commented 8 months ago

@dmarti thanks for the info:

Branding Check vs Injection

Think I get the above proposals, but those would not involve the browser actually placing the branding/links/something on the page to "enforce co-branding", right? They would be checking rather than injecting something. I'm thinking something like:

Subset type "co-branded" has additional technical requirements on top of the what there is for "associated": 1a. A 728x90 image at the primary's ./well-known/rws/cobrand.png. The technical checks on that can go as far as our as we're willing, from straight existence, to some machine vision that it's not just a white blob or some text that says "Relax Guy! Put your feet up!" 1b. A ./well-known/privacy-policy, ./well-known/opt-out, and ./well-known/co-branding-explainer
On any page in the "co-branded" set the browser would actually inject the 728x90 right beneath the browser bar, and have a "prominently displayed" link next to that logo of the privacy policy, opt out, and explainer.
The browser would then make a choice about what API access to elevate based on that. For discussion I'll propose auto-granting SAA access for the co-branded set, as well as allowing IG TTL extension to extend to all sites in the set if one of the sites is visited.

This would be more obtrusive but allowing businesses to make their own choices about site structure with enforced branding might be preferable to choices about your own branding but enforced site structure.

Accessibility

This is an area I'll go dig on, but in the meantime can you help me understand the issue? I'm trying to think through what A11Y cases would be marginally worse (marginal in the economic sense, not size sense) in the case of an additional visual element used to indicate privacy scope.