Publish the meaning of cohorts to even the playing field, enable transparency features, and enable scrutiny of protection against sensitive targeting

WICG / floc

This proposal has been replaced by the Topics API.

https://github.com/patcg-individual-drafts/topics

Other

936 stars 90 forks source link

Publish the meaning of cohorts to even the playing field, enable transparency features, and enable scrutiny of protection against sensitive targeting #104

Open johnwilander opened 3 years ago

johnwilander commented 3 years ago

Issue https://github.com/WICG/floc/issues/101 argues that the browser should "offer transparency to users about cohort interpreted meanings." An even better way is for the browser vendor approving the cohorts to make their meaning public.

If the meaning of cohort IDs is not made public, these things seem to hold:

The browser vendor who approved the cohorts based on centralized browsing history and t-closeness analysis will have a significant advantage if the vendor is also in the web ad tech business.
Prevalent trackers who have managed to convince many, many sites to execute their third-party JavaScript will have a significant advantage since it will see cohorts on all these sites and can much more quickly "decipher" them. Note that the prevalent scripts don't have to be for ad tech but could for instance be analytics scripts from the same vendor.
Trackers will be incentivized to convince as many websites as possible to deploy their third-party JavaScript so that they can decipher cohort IDs. This would be the case even for sites that don't have ads and will lead to more third-party JavaScript.
Trackers will likely form consortiums where they build up shared knowledge of what lies behind cohort IDs. This will lead to an advantage for the trackers who are on the inside of that consortium.
What trackers know about cohort IDs will be deliberate kept secret since not even the source of the cohort IDs is willing to share their meaning. This tells the world that knowing about cohort IDs is for insiders.

However, if the meaning of cohort IDs is made public, we'd get this:

A reasonably even playing field for any ad tech vendor.
The ability for extensions and even websites to help inform users of what cohort ID they've been assigned and how they might be targeted with ads but also which type of content they might not be seeing because someone decided to filter it out for their cohort.
The ability for advertising experts, lawmakers, and privacy advocates to scrutinize existing cohorts and the effectiveness of the t-closeness sensitivity protection deployed by the browser vendor.

I'd be surprised if listing the meaning of cohort IDs would be deemed sensitive in any way. If so, the whole premise of ad tech "deciphering" cohort IDs is equally sensitive and the privacy analysis doesn't hold up. It would then be "privacy by obscurity."

The only way to prove that the browser vendor believes in the privacy aspects of FLoC would be to make all its own knowledge about cohort IDs public.

pdehaye commented 3 years ago

(cue GDPR, Art. 15 and 22)

benjaminsavage commented 3 years ago

Issue #101 argues that the browser should "offer transparency to users about cohort interpreted meanings." An even better way is for the browser vendor approving the cohorts to make their meaning public.

FLoC cohorts do not have any "meanings". They are simply projections onto randomly selected vectors. There is no meaning or interpretation of what they connote. Chrome is simply creating a vector space that represents browsing histories and performing a clustering algorithm to group similar browsing histories into randomly selected groupings based on randomly selected projection vectors.

The closest thing to a "meaning" that I can think of would be a histogram. For each FLoC cohort, Chrome should be capable of producing a histogram which shows, for each of the browsers in that cohort, how often each domain appears in the browsing history (with differential privacy noise added of course, and with rare outliers removed... in fact, just the top 20 domains is probably all you really need to get some kind of an impression of it). Is this what you had in mind?

The ability for advertising experts, lawmakers, and privacy advocates to scrutinize existing cohorts and the effectiveness of the t-closeness sensitivity protection deployed by the browser vendor.

I too believe this is a desirable outcome. Although there is no "meaning" as such to the cohorts, here are a few things we could try to do, to try to detect cohorts that might be highly correlated with "sensitive" characteristics not detectable from browsing history alone (and which should be invalidated for this reason). Basically, we could try to perform the same type of "t-closeness" approach, looking at other types of sensitive data to see if FLoC cohort IDs might inadvertently be exposing such data.

Facebook could potentially measure the age and gender composition of each FLoC and see how much variability there is along these dimensions. If there are cohorts that skew particularly towards particular demographics, this might be something the Chrome team would be interested in knowing so as to invalidate cohorts of that nature.
It might be possible for the Chrome team to set up some kind of a 2-party "Secure Multiparty Computation" with some entity who has access to other types of sensitive information (e.g. Census data about race or religion) to privately measure the compositions of FLoCs along dimensions for which the Chrome team has no data (and would probably never WANT to have this type of data). If such an analysis showed certain FLoCs to skew particularly along such dimensions, that also seems like a reason to invalidate such FLoCs.

joshuakoran commented 3 years ago

@benjaminsavage If cohorts are designed not to have any meaning, does this mean they are designed to have limited utility?

If any cohort discovered to have meaning ("demographic skew" in your example), would invalidate the cohort from being available, exactly which marketer use case(s) are they mean to address?

If I understand the proposal correct, it seems they are not mean to address measurement, attribution or optimization use cases. If they are not meant to provide utility for focusing marketers limited budget when advertising across publishers, then is what exactly are the success criteria we should be evaluating FLoCs?

dmarti commented 3 years ago

@joshuakoran Because the cohort is available to all sites, the set of personal attributes revealed by the cohort has to be the intersection of all the sets of personal attributes that the user would choose to reveal to each site they visit. For example, a user might be willing to reveal A/S/L to an online fashion retailer, but not to a local blogger. But you don't have a separate cohort for shopping and blog reading, so FLoC ends up having to treat a personal attribute that is sensitive in any web context as a sensitive attribute. (Measurement, attribution and conversion tracking are handled by other systems.)

joshuakoran commented 3 years ago

The original issue raised was to ensure FLoCs support a "level the playing field."

@dmarti your answer is about how FLoCs are generated, rather than which marketer use cases they are designed to support or even how well.

I agree with @benjaminsavage that FLoCs are not intentionally designed to have any meaning, due in large part to the unsupervised clustering and large "crowd" of people grouped into each one.

So to my open question, what exactly are the success criteria we should be evaluating FLoCs?

benjaminsavage commented 3 years ago

@benjaminsavage If cohorts are designed not to have any meaning, does this mean they are designed to have limited utility?

Not necessarily. I cannot speak from experience - having not myself tested the utility of FLoC. But in principle, even if cohorts all contain a roughly equivalent distribution of people (as measured by things like demographics) that doesn't mean this information will be irrelevant from the perspective of selecting a relevant ad.

What will matter is to see if there is any correlation between membership in a given cohort and the likelihood to make a purchase on a given ad. Advertisers can simply try to run their ads to all people in all cohorts and see if there is a higher conversion rate from some cohorts compared to others. If there is, they can use that to "bid more" for the higher-converting cohorts and "bid less" for the lower-converting cohorts.

If any cohort discovered to have meaning ("demographic skew" in your example), would invalidate the cohort from being available, exactly which marketer use case(s) are they mean to address?

Well, the Google ads team used them to try to help with "interest based ads" shown on their 3rd party publisher ad network.

So let's say you're NOT Google search, or Facebook. You're just a small publisher. You know little to nothing about the visitors to your website. How will you select a relevant ad to show them?

One option is "retargeting ads", which Chrome is proposing TURTLEDOVE / FLEDGE to try to accomplish. Another option is "contextual ads", which (depending on your type of website) may or may not really contain any useful commercial intent. (i.e. that works great for Google search, less well for a news publisher writing about US politics... how do you use that context to select a "contextually relevant ad"? Another option is to show them "interest based ads". FLoC is meant to address this use case. Today, Google's ad network serves these "interest based ads" based on that person's historical browsing data (as measured using 3rd party cookies).

The Google ads team ran a test that aimed to measure how much the performance of these "interest based ads" would suffer for publishers on their 3rd party ad network if instead of using full browsing histories, they just clustered all browsers into cohorts of thousands of people. It worked reasonably well as a replacement for that specific use case.

If I understand the proposal correct, it seems they are not mean to address measurement, attribution or optimization use cases. If they are not meant to provide utility for focusing marketers limited budget when advertising across publishers, then is what exactly are the success criteria we should be evaluating FLoCs?

Agree this proposal doesn't help with measurement or attribution. It does help with a specific "optimization" use-case (the small publisher use-case outlined above).

dmarti commented 3 years ago

@benjaminsavage There are also some interesting dynamic pricing use cases. Retail sites will likely be able to identify more or less price-sensitive and price-insensitive cohorts in order to optimize discount offers.

johnwilander commented 3 years ago

FLoC cohorts do not have any "meanings". They are simply projections onto randomly selected vectors. There is no meaning or interpretation of what they connote. Chrome is simply creating a vector space that represents browsing histories and performing a clustering algorithm to group similar browsing histories into randomly selected groupings based on randomly selected projection vectors.

The closest thing to a "meaning" that I can think of would be a histogram. For each FLoC cohort, Chrome should be capable of producing a histogram which shows, for each of the browsers in that cohort, how often each domain appears in the browsing history (with differential privacy noise added of course, and with rare outliers removed... in fact, just the top 20 domains is probably all you really need to get some kind of an impression of it). Is this what you had in mind?

It doesn't have to be a histogram but some representation of the browsing history that forms the cohort. It could be based on the website categorization that underlies the t-closeness analysis or even higher level labels.

Nit: I don't want to discuss this as a Chrome thing. It's a proposed web standard and we have to think about it as deployed by arbitrary browser vendors. I think the proposal says that each browser vendor would have to form their own validated cohorts and so we have to talk about "meanings" as something all potential implementers can produce.

othermaciej commented 3 years ago

FLoC cohorts do not have any "meanings". They are simply projections onto randomly selected vectors. There is no meaning or interpretation of what they connote. Chrome is simply creating a vector space that represents browsing histories and performing a clustering algorithm to group similar browsing histories into randomly selected groupings based on randomly selected projection vectors.

Seems like the vector space of browsing histories, the projection, and the clustering algorithm, if made public, would allow mapping a FLoC ID back to a set of projections in the vector space, and then to infer differences in likely browsing history for users with this FLoC ID relative to others. i.e. a meaning. Is there enough public information shared per spec (either per spec, or in Chrome's current implementation) to do this kind of analysis?

npdoty commented 3 years ago

I was highlighting the need for transparency of interpretations of cohorts in part because the algorithm that generates the cohort identifier from browsing history (or even from some other set of data, like a user selecting topics of interest) won't tell the user all that can or will be inferred about them from the identifier, even if the browser's code is open source.

My expectation is that under a widely-deployed cohort system that some firms (especially in market research) will survey or otherwise gather various information from a panel of people with their cohorts, and then sell access to mappings of cohort identifiers to marketing categories. e.g. "to target Cooking Enthusiasts, buy ads for cohorts 12345, 45678, 98765 and 40404 for version chrome.1.1", or "women ages 25-34 are most heavily represented in cohorts 34567 and 87654". It definitely provides an ongoing incentive to gather data from a population of users to 1) enrich the cohort identifiers and 2) subdivide the cohorts (e.g. cohort 12345 merges two distinctive groups, but if this particular user has visited X.example, they're probably the first kind).

The browser vendor publishing some data publicly (about the distinctive domains or other data sources for each cohort, or some market survey data) could help with review (by policymakers or researchers), transparency (to users) and lower the barrier to using cohorts for targeting, but I suspect it would always be incomplete.

michaelkleber commented 3 years ago

Hi folks, sorry for the delay in joining into the conversation. It was a busy week for FLoC.

I do think the discussion in @npdoty's #101 is of great relevance here. In essence, @johnwilander asked for "the browser vendor approving the cohorts to make their meaning public", while #101 is about asking the same thing from parties who want to use FLoC. These both seem like reasonable things to ask for, but perhaps where it's hard to know whether to be satisfied with your answer.

Do you have thoughts on how to decide what constitutes a good or bad answer to the question of what a particular FLoC id means? I see that John floated the proposal that "the browser vendor [...] make all its own knowledge about cohort IDs public", but that doesn't seem plausible to me — that would make sense if mere aggregation across a cohort were enough to protect privacy, but in fact the privacy properties here depend on a lot more than aggregation alone.

othermaciej commented 3 years ago

I see that John floated the proposal that "the browser vendor [...] make all its own knowledge about cohort IDs public", but that doesn't seem plausible to me — that would make sense if mere aggregation across a cohort were enough to protect privacy, but in fact the privacy properties here depend on a lot more than aggregation alone.

Does this mean the privacy properties do not hold with respect to the browser vendor or other party that generates the IDs?

michaelkleber commented 3 years ago

Does this mean the privacy properties do not hold with respect to the browser vendor or other party that generates the IDs?

I'm not sure what you mean. Of course my browser knows all the URLs I've visited; the "History" menu in Chrome or Safari gives access to that information. But equally obviously, we can't make that information public, even on a cohort level.

I'm just pointing out that the answer to "What should we make public about cohort X?" is definitely not "Everything the browser knows about everyone in cohort X."