WICG / floc

This proposal has been replaced by the Topics API.
https://github.com/patcg-individual-drafts/topics
Other
934 stars 90 forks source link

FLoC Simulation and Algorithm #95

Closed shigeki closed 3 years ago

shigeki commented 3 years ago

I've made a FLoC Simulator that calculates cohort ID by using host list and cluster file at https://github.com/shigeki/floc_simulator. It is good for me to understand the FLoC algorithm and its behavior deeply.

I have some questions regarding the current FLoC algorithms.

  1. As described in https://www.chromium.org/Home/chromium-privacy/privacy-sandbox/floc, the current algorithm is different from what was described in the white paper. SimHash does not use the sum of random projection but that of random gaussian. Do you have any plan to publish a new paper to explain it in detail?
  2. CityHash64 v103 is used for hashing. It is not the latest and would have a vulnerability. Why is v103 used here?
  3. The file of SortingLshClusters is one of the important data. How often is it updated? Is the version number related to this file format?
  4. The blocked flag is included in the cluster file. If one simhash has t-correctness > 0.1, are other simhashes included in the same cohort ID blocked as well? It seems that its number is more than 2000. Right?
michaelkleber commented 3 years ago

Bravo! Thank you for putting this together.

  1. [...]the current algorithm is different from what was described in the white paper. SimHash does not use the sum of random projection but that of random gaussian.

These are two different mathematical ways of describing the same thing.

In the white paper it's described this way: The original vector is a 1-hot encoding of the 64-bit hashes of the visited domains — that is, take a vector of length 2^64, with a 1 in location h if you've visited a domain that has h as its 64-bit hash. Then you perform dimension reduction from dim=2^64 to dim=50 by multiplying by a 2^64-by-50 matrix whose entries are pseudorandom draws from a gaussian distribution.

To implement that in code, of course, you don't really want to compute all 2^64 * 50 matrix entries! So instead you only compute the height-50 columns of the matrix that correspond to the 64-bit hashes of the domains actually visited. (All the other matrix entries would get multiplied by 0 anyway, so they don't matter.) Adding up the 50-dimensional gaussian vectors is exactly the matrix multiplication operation.

  1. CityHash64 v103 is used for hashing. It is not the latest[...]

See here for Chrome's guidance on choice of hash functions. Basically, we have one standard 64-bit hash function that we never change, so that it's OK to persist hashes across Chrome versions. There's no harm from a domain name hash collision, so the precise choice is not particularly important.

  1. The file of SortingLshClusters is one of the important data. How often is it updated?[...]

The clusters file will mostly stay the same for as long as we're using the current clustering algorithm ("Chrome.2.1"). The prefixes certainly will not change. We may need to set the is_blocked bit on some additional clusters, for example if their size gets too small.

  1. The blocked flag is included in the cluster file. If one simhash has t-correctness > 0.1, are other simhashes included in the same cohort ID blocked as well?

We perform the t-closeness computation for the people in each cohort, not for each SimHash. That is, if we detect a large enough number of people in a particular cohort visiting some type of sensitive site, then we block the use of that entire cohort. You can't tell that just from looking at the SimHash or the visited domains — the sensitive sites might even be sites whose domains do not contribute to the FLoC calculation at all.

[...] It seems that its number is more than 2000. Right?

Sorry, I don't understand what you're asking here.

shigeki commented 3 years ago

Thanks, @michaelkleber. Your answers help me good understandings very much.

1. Sum of Random Gaussians

Adding up the 50-dimensional gaussian vectors is exactly the matrix multiplication operation.

I've got it. Thanks. I misunderstood the Original x:(0.3, 0.8) described in Fig 2 of the white paper. I draw the figure of the matrixes and vectors below for my understandings. Please correct me if I'm wrong.

FLoC_Calculation 001

2. CityHash64v103

There's no harm from a domain name hash collision, so the precise choice is not particularly important.

I agree with it. The only annoyance for me is that there are no libraries that support v103 other than Chrome.

3. SortingLshClusters and version

The prefixes certainly will not change. We may need to set the is_blocked bit on some additional clusters, for example if their size gets too small.

It is good for us to keep its compatibilities for some time. We just only have to care the blocked cohorts will change.

4. Blocked cohort

It seems that its number is more than 2000. Right?

Sorry, I don't understand what you're asking here.

I thought that t-closeness computation was made in each SimHashes. So if one SimHash in a certain cohort has to be blocked, other innocent SimHashed in the cohort would be also blocked, where more than k-anonymity, 2000 peoples, are included. I understood that it was wrong. Thanks.

Simple FLoC Simulation Results

I've made two simple FLoC simulations by using the domain names of example[0-n].com to see how cohort Ids behave between two users who have the same domain size in their histories.

Simulation#1: Effects of only one domain difference

User A and User B have the same domain size of n but only one domain is different and others are the same. With an increase in domain size, the difference between the two cohorts would be smaller. In the case of example[0-n].com, more than 1000 of domain size seems to be better, and more than 5000 is the best to evaluate their cohort Ids. cohortIddiff_10000domains

Simulation#2: Effects of number of different domains

This simulation indicates that how many differences of domains can keep the same cohorts between User A and User B when n=1000. It shows even up to 1.2% differences in 1000 domains in their histories leads the same cohorts. cohortId_diff_1000domains

The simulation results above would be largely different if the domain names are changed and their size but it would help me good understandings of me that how FLoC is working and cohorts behave.

michaelkleber commented 3 years ago

Great, thank you for the analysis so far. Yes, I think you understand all the details correctly.

For the initial origin trial (Version "chrome.2.1"), the input to the SimHash calculation is limited only to domains that were visited in the past seven days. So I think that we need to expect each person to visit only a small number of domains, certainly much less than 1000.

This is an important difference from your simulation. If a person visits 1000 domains, then the 50-dimensional point corresponding to their history is probably a large distance away from the origin, and even a large distance away from any of the <=20 hyperplanes that separate regions from one another. So I'm not very surprised that you need to change more than 12 domains to get to a different cohort — after 1000 random steps, the walls are quite far away.

If a person has only visited ten domains in the past seven days, though, they are much more likely to be near some of the walls, and I would expect it to be much easier to observe the cohort changing when you add, remove, or modify even a single domain.

shigeki commented 3 years ago

Thanks for your comments.

So I think that we need to expect each person to visit only a small number of domains, certainly much less than 1000.

In https://www.chromium.org/Home/chromium-privacy/privacy-sandbox/floc, I found the description of

Minimum number of different qualifying Chrome user browsing histories (sets of visited domains) in a cohort: 735

So I thought that 1000 domains can be reasonable for the simulations. Is the minum number of 735 not a unique number of domains they visited or is it not limited to 7 days but all stored histories?

shigeki commented 3 years ago

If a person has only visited ten domains in the past seven days, though, they are much more likely to be near some of the walls, and I would expect it to be much easier to observe the cohort changing when you add, remove, or modify even a single domain.

That's right. The below shows cohort Id diff as the number of different domains included in the histories of two users when they visited 10 domains in the past. It leads to the same cohort Id only when the domain list of UserA is exactly equal to that of UserB.

cohortId_diff_10domains

I thought that this case does not so many benefits for using FLoC because it is rare when two domain lists are exactly the same between users. Instead, do we need to look for users where some fractions of the domains they visited are common with each other according to their interests?

michaelkleber commented 3 years ago

Ahhh I'm sorry, what I wrote on the floc page is not clear enough.

Minimum number of different qualifying Chrome user browsing histories (sets of visited domains) in a cohort: 735

What I meant is: Every cohort has at least 2000 different people in it, and also, among those 2000 people there are at least 735 different sets of visited domains. Each person has only one set of visited domains, let's say example0.com, example1.com, ..., example9.com if they visited 10 different domains in the past week. And sometimes that whole set is the same for two different people.

From the SimHash calculation point of view: Each person has a history which maps to a 50-dimensional real-valued point, and then taking the signs and masking off many bits is a way of carving that 50-dim space up into regions. There are at least 2000 people in each region of 50-dim space, and also there are at least 735 different points in each region of 50-dim space. But each point is only based on the sites a person visited in the past week, so probably involves only a short list of domains.

Is that clearer now?

shigeki commented 3 years ago

What I meant is: Every cohort has at least 2000 different people in it, and also, among those 2000 people there are at least 735 different sets of visited domains.

Thanks. I understood it clearly. To be honest, I thought that 1000 domains in the users’ histories seemed to be too large. In that case above, 2.7 people are included in one set of visited domains on average but the ratio would vary depending on its size.

But each point is only based on the sites a person visited in the past week, so probably involves only a short list of domains.

I think that one of the objectives for this origin trials is to estimate how much cohorts are effective for audience interests even when they have a small number of domains in their histories.

Recently, a weighting feature has just been deployed in Chrome. If the domain is weighted based on the count of domains in the history, the divergence of users' cohorts would be decreased according to the sites they are frequently viewing.

michaelkleber commented 3 years ago

I think that one of the objectives for this origin trials is to estimate how much cohorts are effective for audience interests even when they have a small number of domains in their histories.

That is true, but for this version of clustering we're requiring at least seven different domains visited. But like everything else, we might change that in the future, as we get better data to understand the privacy and utility of the choices.

Recently, a weighting feature has just been deployed in Chrome. If the domain is weighted based on the count of domains in the history, the divergence of users' cohorts would be decreased according to the sites they are frequently viewing.

The "one-hot encoding", where we just look at the boolean of whether or not someone visited a domain, is better for privacy. But we are investigating other clustering algorithms that take more information into account.

shigeki commented 3 years ago

Thanks for your explanation. I will submit a new issue when I have any new findings on my simulation. Close this.