WICG / floc

This proposal has been replaced by the Topics API.
https://github.com/patcg-individual-drafts/topics
Other
934 stars 90 forks source link

Lookalike targeting using FLoC? #24

Open benjaminsavage opened 4 years ago

benjaminsavage commented 4 years ago

I'm not sure if you've seen this proposal: https://github.com/w3c/web-advertising/blob/master/privacy_preserving_lookalike_audience_targeting.md

The key idea there is to use the Aggregated Reporting API to perform logistic regression on embedding vectors with boolean labels. In that proposal, the suggestion was for publishers to provide custom embedding vectors for use in this process. I am wondering if FLoCs could be used as well?

While the proposal talks about FLoCs as "cohorts", I get the sense that they are not meaningless, arbitrary numbers. Specifically this part:

The browser uses machine learning algorithms to develop a flock based on the sites that an individual visits. The algorithms might be based on the URLs of the visited sites, on the content of those pages, or other factors. The central idea is that these input features to the algorithm, including the web history, are kept local on the browser and are not uploaded elsewhere — the browser only exposes the generated flock

I assume what this means is:

  1. The browser will use "Federated Learning" to train a Machine Learning model.
  2. This model will use the user's complete browsing history as "features" in this model
  3. This model will use XXX as labels (unknown and not stated... but super important and I hope you do clarify...)
  4. The trained model will be used to produce an "embedding vector" for each browser instance that captures the concept of "similarity" between different users
  5. To preserve privacy, the full, raw embedding is not shareable (it has too much entropy and could be used as a fingerprinting vector). As such, there is a dimensionality reduction down to just 16 bits (using something like Locality-Sensitive Hashing), and possibly some kind of differential noise is added to these 16 bits after that, and probably there is some kind of server-side coordination to ensure the distribution isn't too skewed and there is a minimum number of browsers in each of the 65,536 FLoCs.

My question is:

Will step 5 render the FLoC ID useless as anything but a random "cohort ID"? Or will it maintain some kind of meaning like the original embedding vector?

Let me give a concrete example to make my question more clear. Assume:

- Person A is in FLoC 0x1FEB (0001111111101011)
- Person B is in FLoC 0x1BEB (0001101111101011)
- Person C is in FLoC 0xE223 (1110001000100011)

Are they person A and person B "more similar" than person A and person C? The Hamming Distance between person A and person B is 1. The Hamming Distance between person A and person C is 10.

If step 5 preserves some kind of meaning (for example, we can compare the Hamming Distance between FLoCs and use this as some kind of measure of similarity) then it seems like one could potentially apply the same "Logistic Regression in MPC" approach to FLoC IDs.

michaelkleber commented 4 years ago

Hi Ben,

It's still an open question whether there might be meaning in the individual bits of the ID assigned by FLoC. There are a wide range of possible clustering algorithms to consider, and they don't all have the same properties.

But even aside from the question of whether logistic regression on flocks is semantically meaningful, I would have expected the large cohort sizes to leave something too coarse to be useful for most lookalike audience use cases.