WICG / floc

This proposal has been replaced by the Topics API.
https://github.com/patcg-individual-drafts/topics
Other
935 stars 90 forks source link

Randomly join cohorts to frustate tracking #59

Open kuro68k opened 3 years ago

kuro68k commented 3 years ago

One major concern that has come up several times is that FLoC membership will be used to track users, through de-anonymization, combination with PII, observation over time and other techniques.

One way to frustrate this would be to randomly join cohorts, in addition to those joined based on perceived interests. For example, 25% of the cohorts a user is part of might be randomly selected. The advertiser is able to target with 75% accuracy, and the data becomes far less useful for tracking purposes.

The randomization would need to be done carefully to avoid it being detected, e.g. it would need to change at approximately the same rate as genuine cohort membership.

phaabe commented 3 years ago

The question is, how can you access and change it? Just by calling random websites of roller blades?

And what would be the impact if only a few people did this?

kuro68k commented 3 years ago

I should have mentioned, the proposed opt-out mechanism is to send random cohorts. If that is adopted it will have to be done in a way that cannot be detected, e.g. it can't be used to detect private browsing mode.

So I propose incorporating that into the opt-in use case.

michaelkleber commented 3 years ago

@kuro68k A person is only in a single cohort at any time. So I'm not sure what "25% of the cohorts a user is part of might be randomly selected" means.

Our plan is for a person who doesn't want to participate in FLoC to just give out no signal at all, rather than a random one. This is the same as how we plan to handle people with too little browsing history, or people in incognito mode, for example. The TAG review recommended this over random cohorts, and we agree.

kuro68k commented 3 years ago

@michaelkleber I don't think sending no signal is a good idea, it reveals too much information. Either the user has disabled FLoC, which is itself a cohort, or they are in private browsing mode. I will check the tag review.

What I mean by 25% is that 25% of the time the FLoC should be randomized.

Sora2455 commented 3 years ago

Keep in mind that if the site can track the user independently (say, they're logged in) they'll be able to observe the changing cohorts and figure out the "real" one unless they display the random one the entire time.

dmarti commented 3 years ago

If there is an adverse consequence to being in a lower-rated FLoC, then the same consequence is likely to happen to users who turn off FLoC and are assumed to have a low rating.

For example, when you visit a site to apply for a visa to visit certain countries, the cohort (or the fact that the cohort is missing) will likely be collected to be used later as one input to the algorithm to decide which visitors receive an extra search on arrival.

TheMaskMaker commented 3 years ago

If I understand this correctly, the suggestion here is for opted in users to have 25% of their cohorts randomly assigned to thwart the tracking of opted out users? Am I understanding this correctly? If so what would be the impact on the originally reported accuracy and usefulness of cohorts?

dmarti commented 3 years ago

@Sora2455 The longitudinal privacy section is relevant to this. Once a site has been given one cohort it either can't change, or the rate of changes needs to be limited.

samuelweiler commented 3 years ago

Our plan is for a person who doesn't want to participate in FLoC to just give out no signal at all, rather than a random one. This is the same as how we plan to handle people with too little browsing history, or people in incognito mode, for example. The TAG review recommended this over random cohorts, and we agree.

I read the TAG review differently. I see:

In what circumstances in regular browsing mode [would sites calling the API receive an invalid/null result]? When a user hasn't been assigned to a valid cohort yet? Is that a common enough case that the probability of a 'null' result being due to use of incognito mode is relatively low? (Sites should not be able to detect the use of incognito mode.)

@lknik built a demonstration of using this odd response to (help) detect incognito mode: https://github.com/w3ctag/design-reviews/issues/601#issuecomment-799539696

lknik commented 3 years ago

One major concern that has come up several times is that FLoC membership will be used to track users, through de-anonymization, combination with PII, observation over time and other techniques.

One way to frustrate this would be to randomly join cohorts, in addition to those joined based on perceived interests. For example, 25% of the cohorts a user is part of might be randomly selected. The advertiser is able to target with 75% accuracy, and the data becomes far less useful for tracking purposes.

The randomization would need to be done carefully to avoid it being detected, e.g. it would need to change at approximately the same rate as genuine cohort membership.

I wonder though, if FloC is not a tracking mechanism, how would one "frustrate tracking" if among the design goals is not to facilitate tracking?

kuro68k commented 3 years ago

Tracking can be done by collecting as many unique bits of data about the user as possible, and then inferring that it is the same user even in incognito mode. For example an adversary might record

And note that although a client is now reporting no cohort membership all the rest of it indicates that they are the same user they saw previously and they can re-use the previously collected cohort data.

The cohort itself is another item of data to add to the fingerprint of the client used for tracking. Therefore, just like with canvas fingerprinting and similar techniques, privacy conscious users will want to randomize it. In that sense it may be good for some users as polluting the FLoC data will probably be quite helpful privacy-wise, but of course at the expense of all the other users who aren't aware that they need to do it.

lknik commented 3 years ago

Tracking can be done by collecting as many unique bits of data about the user as possible, and then inferring that it is the same user even in incognito mode. For example an adversary might record

  • Browser and version
  • HTTP headers
  • Canvas fingerprint
  • Battery level
  • IP address

This can be done regardless of FloC.

The cohort itself is another item of data to add to the fingerprint of the client used for tracking. Therefore, just like with canvas fingerprinting and similar techniques, privacy conscious users will want to randomize it. In that sense it may be good for some users as polluting the FLoC data will probably be quite helpful privacy-wise, but of course at the expense of all the other users who aren't aware that they need to do it.

In this case you treat FloC as a mere identifier. If one wants to randomise this (or anything else), so be it.

kuro68k commented 3 years ago

The point I was trying to make is that FLoC is adding more bits of identifying information to what adversaries can collect, at a time when many of us are working hard to reduce the number of bits, or the ratio of static bits to random ones.

TheMaskMaker commented 3 years ago

The point I was trying to make is that FLoC is adding more bits of identifying information to what adversaries can collect, at a time when many of us are working hard to reduce the number of bits, or the ratio of static bits to random ones.

I think you are right to point out that floc is a tracker. I'd recommend identifying specific ways of de-anonymizing and patching each rather than making the cohorts less useful. Currently some of the biggest privacy backdoors in floc skip cohorts entirely.

To my current understanding, floc is not intending to eliminate user experience customization. It is intended to increase privacy. What constitutes 'privacy' is under debate in the various proposals, but none of them wish to eliminate customization.

Floc 'privacy' is based on the concept of cohorts, and making cohorts 75% accurate would, with additional factors such as imperfect prediction, etc, drop customization accuracy so low a random guess would be more efficient. Or at best, make only 50% of a cohort valuable, with variance that could drop it lower.

Quite a few publishers would be very concerned with this. I think a more focused approach would be more effective, though I understand your desire to address the general issue, I think the complexities make that difficult.