WICG / floc

This proposal has been replaced by the Topics API.
https://github.com/patcg-individual-drafts/topics
Other
936 stars 90 forks source link

Communication Between Google's FLoC Service and Chrome Browser #131

Closed geeeoff closed 2 years ago

geeeoff commented 2 years ago

I'd like to get a better understanding of the communication paths and data shared between Google and the Chrome browser in order to enable FLoC if/when this goes mainstream (no longer an origin trial).

From this site: https://web.dev/floc/

"The FLoC service used by the browser creates a mathematical model with thousands of "cohorts", each of which will correspond to thousands of web browsers with similar recent browsing histories." Assuming the flow here involves Google's FLoC service for the Chrome browser:

  1. How does the browser become aware of this model?
  2. How often does the browser receive a new model?
  3. Does the service ensure that any cohort included in the model is of sufficient size (k). What happens when k becomes too small?
  4. If the user elects NOT to share browsing history with Google via synch, and visits a completely new site or sites not in the model, what is the behavior of the model? Does it matter that the model isn't aware of the site?
  5. For the origin trial(s), is there any requirement for participants to share browsing history with Google?
michaelkleber commented 2 years ago

For the first version of FLoC (from the now-completed Origin Trial), see additional technical details at https://www.chromium.org/Home/chromium-privacy/privacy-sandbox/floc. I believe this answers all 5 of your questions already!

geeeoff commented 2 years ago

Thanks for sharing that document, Michael.

For the origin trial, did Google first generate the cohort space using anonymous, historical browsing history it has on hand?

I wasn't able to find an answer to //2. How often does the Chrome-operated, server-side pipeline provide a new model to the browsers? Apologies in advance if I overlooked this.

As a follow-up on //4, it seems possible that a browser may calculate its own 50-bit hash, and not find the unique prefix of that vector in the list of cohorts passed to browser by the Chrome-operated server-side pipeline. If yes, what is the behavior of document.interestCohort() ? Would it simply generate a new cohort ID that would be valid in the cohort space if/when the server-side pipeline eventually became aware of this particular browsing history pattern?

Related to //5, if the server-side pipeline is counting based on the 50-bit hash sent by the browser, then why is syncing of browsing history data required by each browser participating in the trial?

michaelkleber commented 2 years ago

For the origin trial, did Google first generate the cohort space using anonymous, historical browsing history it has on hand?

For the origin trial, we first generated the cohort space using a Chrome-operated server-side pipeline that counted how many times each 50-bit hash occurred among qualifying users — those for whom we log cohort calculations along with their sync data. This doesn't require looking at browsing data, just at SimHash counts.

  1. How often does the Chrome-operated, server-side pipeline provide a new model to the browsers?

The whole model was computed just one time. The SimHash-to-cohort mapping stayed the same for the entire origin trial.

it seems possible that a browser may calculate its own 50-bit hash, and not find the unique prefix of that vector in the list of cohorts passed to browser by the Chrome-operated server-side pipeline.

No — every SimHash has one and only one prefix that places it into a cohort. Cohorts were created by starting with all SimHash values in one big pool and then dividing that pool over and over again until further division would violated the size constraint. ("The 50-bit hashes start in two big cohorts: all hashes whose first bit is 0, versus all hashes whose first bit is 1. Then each cohort is repeatedly divided into two smaller cohorts by looking at successive bits of the hash value, as long as such a division yields two cohorts each with at least 2000 qualifying users.") So even if a browser generates a never-seen-before SimHash, the same division based on successive bits of the SimHash means the browser can tell which cohort it belongs in.

if the server-side pipeline is counting based on the 50-bit hash sent by the browser, then why is syncing of browsing history data required by each browser participating in the trial?

Great question! This is because hashing is not anonymization. (https://twitter.com/LeaKissner/status/1327404252268875776)

If a Chrome browser instance sends our server the 50-bit SimHash of what sites you've visited in the last week, then — with a lot of computing power — we might be able to figure out what the sites were. For people who are already sync'ing their browsing history, there's no risk of us doing this kind of brute-force computation and learning new previously-private data.