WICG / floc

This proposal has been replaced by the Topics API.
https://github.com/patcg-individual-drafts/topics
Other
934 stars 90 forks source link

transparency to users about cohort interpreted meanings #101

Open npdoty opened 3 years ago

npdoty commented 3 years ago

One way to support user privacy is to provide transparency to a user about what information they are disclosing using a particular technology.

On-device, in-browser learning could allow for an improvement in that kind of transparency -- a user can see exactly what identifier is being calculated and transmitted. However, opaque identifiers calculated from browsing history don't make that kind of inspection easy. (The explainer does argue that short names will show the user that "they cannot carry detailed information" -- a bold but dubious claim!)

Furthermore, if the design intends for separate parties to collect cohort IDs and browsing histories (perhaps combined with other data) and sell mappings to marketing sectors, the meanings could differ based on the recipient. This is true of many status quo systems, of course, but there is some existing work in ad management interfaces to disclose (with varying levels of detail) to the end user the inferences about them.

For example, testing with Chrome just now I can see that my cohort ID is 4724. What have I just revealed to you all? Could a browser (or some other party) give me some confidence about what is likely interpreted about my history from my id? Could access to my browser-generated cohort ID be limited to parties who disclose their inferences from it?

michaelkleber commented 3 years ago

Yes, the UX question of how to offer good transparency into the meaning of cohort IDs is extremely interesting.

In part this is bound up in the question of how a cohort is calculated. The clustering algorithm in the current Origin Trial (version=="chrome.2.1") is based on domain names, which means the browser could show which domains contributed to your cohort (based just on local information), and could show things like other popular domains in the cohort (if it communicated with some central server). This at least hints at what another party might infer from the cohort.

But that all completely changes if cohorts are, for example, clusters based on topics you're interested in (whether entered by the user or inferred by the browser).

Could a browser (or some other party) give me some confidence about what is likely interpreted about my history from my id?

The "some confidence" part here is the tricky bit, of course. We could let consumers make claims about what they infer, but if we're worried about malicious use of the signal then I don't think this helps.

Could access to my browser-generated cohort ID be limited to parties who disclose their inferences from it?

Very interesting. Is there any web platform precedent? Obviously we don't want to risk going the way of P3P, where everyone makes a required disclosure that says "we do not make a required disclosure".

npdoty commented 3 years ago

Could a browser (or some other party) give me some confidence about what is likely interpreted about my history from my id?

The "some confidence" part here is the tricky bit, of course. We could let consumers make claims about what they infer, but if we're worried about malicious use of the signal then I don't think this helps.

While some threats of cohort identifiers come from the use by malicious actors, there are also non-malicious consumers where transparency could be meaningful to a user.

Could access to my browser-generated cohort ID be limited to parties who disclose their inferences from it?

Very interesting. Is there any web platform precedent? Obviously we don't want to risk going the way of P3P, where everyone makes a required disclosure that says "we do not make a required disclosure".

I tend to think that inferences from the history of P3P compact policies has been overstated. However, it does seem likely that if disclosures were required in an automated way but never checked or used in any other way that there would be an incentive towards false or empty disclosures.

Here's one relevant proposal I read recently that connects a machine-readable policy assertion to further access to fingerprintable APIs:

We propose introducing a signed attestation (perhaps in the form of an HTTP header) that advertises the fact that a server masks IP addresses and other identifying network information from the application layer of the services that it hosts.

Happy to connect you and @bslassey if that would be helpful ;)

michaelkleber commented 3 years ago

Heh, yes, I've met that @bslassey guy. And certainly I'm open to policy assertions having a role to play, especially when past the limits of where we can rely purely on technical protections.

The assertions relevant to Willful IP Blindness are of the form "We don't do X", and they can be backed up by an audit that is plausibly able to tell whether the server actually does X or not. So what is the analogy here?

It seems to me that your "parties who disclose their inferences from it" would have to mean something like: Any party that uses FLoC must host an endpoint that responds to request of the form https://adtech.example/wtflock_is/12345.chrome.2.1 with a best-effort human-readable explanation of what they think {id: 12345, version: "chrome.2.1"} means.

That sounds like a fascinating effort in ML Explainability. But the parallel to P3P does seem warranted here: the browser (or a human auditor) could check that these pages exist, but I don't know how to tell that the information on those pages really is responsive to the question being asked or truly embodies what the party believes.

Is this along the lines of what you're imagining, or am I completely off base here?

[edit: corrected to @bslassey instead of blassey who is probably confused about why I claimed to know her 😳 ]

dmarti commented 3 years ago

Revealing cohort inferences might also reveal sensitive cohorts that are not pre-screened by the browser, including cohorts that are not sensitive in the context they're used but may be sensitive in other parts of the world. (For example, a California supermarket might not flag beef or pork shoppers as sensitive cohorts.) Related: #71

skaurus commented 3 years ago

If cohorts are based on browsing history, isn't the meaning of cohort is just a set of sites that contributed to it?

Any meanings like "shoe lover" are not revealed by a cohort itself, but learned in an experiment. Where experiment is "try to show that cohort some ads and see what happens", and so meanings will be DSP-specific (or "party-that-uses-FLoC"-specific in general).

npdoty commented 3 years ago

The assertions relevant to Willful IP Blindness are of the form "We don't do X", and they can be backed up by an audit that is plausibly able to tell whether the server actually does X or not. So what is the analogy here?

It seems to me that your "parties who disclose their inferences from it" would have to mean something like: Any party that uses FLoC must host an endpoint that responds to request of the form https://adtech.example/wtflock_is/12345.chrome.2.1 with a best-effort human-readable explanation of what they think {id: 12345, version: "chrome.2.1"} means.

Something along those lines, yes. We could speculate on other designs (e.g. publishing a full list rather than querying a single value, or returning a set of information about how an ad was targeted that included but wasn't limited to the interest cohort interpretation).

That sounds like a fascinating effort in ML Explainability.

I don't know that any groundbreaking research work is needed here. For one example, Google provides an Ad personalization interface (maybe previously known as the "ad preferences manager") with at least some of the inferences that they've drawn about a logged-in user from their browsing activity (and other data sources), presented in human-readable text.

If the only use-case for interest cohorts is targeting advertising to groups based on inferences about those groups, then it seems that consumers should, definitionally, be able to provide the inferences they are acting on.

But the parallel to P3P does seem warranted here: the browser (or a human auditor) could check that these pages exist, but I don't know how to tell that the information on those pages really is responsive to the question being asked or truly embodies what the party believes.

Auditing would be required and auditors would need some access to internal systems to have confidence about how the data is being used, but that seems very similar to the access needed to confirm IP blindness. While completely external auditing (just confirming that pages exist and that they seem to return plausible or consistent results) may not provide the same level of confirmation, having that information documented and actually exposed to end users (as opposed to the brief P3P CP historical example) could provide a hook for regulatory intervention if that information were falsified.

michaelkleber commented 3 years ago

@skaurus

If cohorts are based on browsing history, isn't the meaning of cohort is just a set of sites that contributed to it?

I think that's too limiting a notion of "meaning".

Consider a somewhat extreme version of cohort assignment, in which all FLoC does is give each person a random cohort ID, shared with 2000 other people. Now there is no information about you that contributed to the cohort assignment, so you might think it automatically has zero meaning.

But now suppose 5% of people are interested in motorcycles. Then in each cohort, you'd expect 100 motorcycle enthusiasts on average, but it might be higher or lower just due to random chance. The probability of a given cohort having >133 is about 0.05% — so if there were 64K cohorts total, then around 30 of them would "mean" that a person in that cohort is 1/3 more likely to be into motorcycles than people in general.

Of course we don't intend to assign cohorts randomly, we intend them to be influenced by browsing behavior. So maybe that leads to a situation where some cohorts contain 10% or 15% motorcycle enthusiasts, 2x or 3x the base rate. From an advertising utility point of view, that's probably useful — I'd expect motorcycle companies would like their ad dollars to be twice or three times as effective.

But even in this case, does being in such a cohort "mean" that you are into motorcycles? Of course not; indeed 85% or 90% of the people in the cohort are not.

@npdoty How does this example map onto your disclosure ideas?

npdoty commented 3 years ago

Of course we don't intend to assign cohorts randomly, we intend them to be influenced by browsing behavior. So maybe that leads to a situation where some cohorts contain 10% or 15% motorcycle enthusiasts, 2x or 3x the base rate. From an advertising utility point of view, that's probably useful — I'd expect motorcycle companies would like their ad dollars to be twice or three times as effective.

But even in this case, does being in such a cohort "mean" that you are into motorcycles? Of course not; indeed 85% or 90% of the people in the cohort are not.

@npdoty How does this example map onto your disclosure ideas?

I think at this point we don't know what the probabilities are likely to be for various categories, under the domain-hashing approach or under alternative algorithms.

I don't think it eliminates the privacy concern to say "we don't know with a high probability that you are into motorcycles, but we do think you are X% more likely than the general population to be into motorcycles". People have privacy concerns about incorrect or uncertain conclusions made about them. But this is a useful explanation of why the inferences to marketing categories are not identical to the list of relevant domains -- in some cases users might easily understand (ilikemotorcycles.com) but in other cases they may very much not.

Going further, if the ad preferences categories were displayed in the browser UI (or even explicitly selected by the user as intentional interests), that might narrow that gap significantly. If a user selects {"interested in motorcycles", "shopping for men's apparel", "home improvement"}, the user may realize that some consumers of that information will draw additional inferences, some correct and some incorrect and some the user can anticipate and occasionally some the user didn't anticipate at all. But those meanings are more transparent and useful than domain lists or numeric codes.

michaelkleber commented 3 years ago

Yes, if the browser had an intrinsic way of associating cohorts with interests (or if it observed interests and derived cohorts from those), this would be a very powerful approach.

My example with motorcycles vs. random cohort assignment was intended to show why this kind of explanation might be wholly impossible using on-device info. But of course the truth is somewhere in between.