Cohort IDs can be collected over time to create cross-site tracking IDs

WICG / floc

This proposal has been replaced by the Topics API.

https://github.com/patcg-individual-drafts/topics

Other

934 stars 90 forks source link

Cohort IDs can be collected over time to create cross-site tracking IDs #100

Open johnwilander opened 3 years ago

johnwilander commented 3 years ago

In https://github.com/WICG/floc/issues/99, it is stated that "FLoC is not useful for tracking." I don't think that's accurate.

As far as I know, the user's cohort will not be partitioned per first party site so multiple sites can observe the cohort ID in sync as it changes week after week. A hash of the cohorts seen so far will likely get more and more unique as the weeks go by.

Websites or tracker scripts on websites can expose arrays of the cohorts they've seen to help all trackers identify the user, like this:

let cohortCollectionForWebsiteA = [
  "week01_2022" : "0666",
  "week03_2022" : "A566",
  "week04_2022" : "2111",
  "week05_2022" : "1171",
  "week07_2022" : "749B",
]

let cohortCollectionForWebsiteB = [
  "week01_2022" : "0666",
  "week02_2022" : "0030",
  "week05_2022" : "1171",
  "week06_2022" : "7311",
  "week07_2022" : "749B",
]

Trackers send these to a server for matching across websites, in the example above, resulting in the intersection [ "week01_2022", "week05_2022", "week07_2022" ].

The cohort collections can be tied to PII on sites that have access to such information about the user. This would allow a tracker with just a collection on one site to call a server and get back PII for that user.

If cohorts were partitioned (maybe they are?), the tracking effort would take longer but observed partitioned cohort IDs can be sorted to potentially create a unique ID across websites. You get something like snippets of a DNA that eventually become unique, as a set, and trackers will know which cohorts are widespread and which quickly reduce the search space.

Even if the tracker cannot get to a unique ID for a particular user, the entropy boost from collected cohort IDs is tremendous and can easily be combined with existing fingerprinting entropy such as language settings.

Sorry if I'm missing something in the above analysis or if this was filed earlier.

johnwilander commented 3 years ago

To take this to the crowd metaphor: Before the pandemic and some time back, I attended a Mew concert, a Ghost concert, Disney on Ice, and a Def Leppard concert. At each of those events I was part of a large crowd. But I bet you I was the only one to attend all four.

dmarti commented 3 years ago

There is a suggestion to make the cohort "sticky" for a given site, so that once a site has seen the cohort ID once, it will not see a different one. ("Longitudinal Privacy" section: https://github.com/WICG/floc/commit/d822a35f4bfe7d5003fda4a7628fca2da8ace8d3 )

johnwilander commented 3 years ago

There is a suggestion to make the cohort "sticky" for a given site, so that once a site has seen the cohort ID once, it will not see a different one. ("Longitudinal Privacy" section: d822a35 )

Thanks. That's more or less the same analysis.

I don't think updating cohort IDs at different times solves the problem. See the sorting attack I mentioned.

Making it sticky is interesting. First of all, I assume the website will not be allowed to delete it so it becomes a persistent "visited" flag. Second, the sticky cohort ID becomes a persistent fingerprinting signal per website that carries over even if different accounts log in to the site or the site tries to clear its state. Third, sticky cohort IDs could be set up for a small set of bounce tracking domains and be used to pick up a persistent ID. Finally, being persistently assigned a cohort for ad targeting purposes can be really bad for users (see the stories on baby ads after miscarriage and marriage ads after cancelled wedding) and probably not popular with advertisers who want "fresh" interest signals to target.

dmarti commented 3 years ago

Thank you, good points about how sticky cohort IDs could interact with other state preserved by a site. (#77 covers the similar issue of sites being able to observe the timing of when a user joins and leaves the "null cohort").

othermaciej commented 3 years ago

If cohort IDs are sticky, would the user still be able to delete/reset their cohort ID, e.g. by deleting website data or clearing history?

michaelkleber commented 3 years ago

Hi John,

Right, this is indeed the "Longitudinal privacy" question. We've been considering a few different mitigations. As you know, this is an iterative and open process, and we expect to implement one or more of these solutions in future versions of FLoC. (Remember that third-party cookies are still around in Chrome, so FLoC-based "slow fingerprinting" does not pose any tracking risk beyond what 3p cookies are already offering today.)

There's stickiness, as Don pointed out — or maybe not permanent stickiness, but the cohort changing only slowly on each site. (Of course it would still need to be cleared along with any other first-party state, for the reasons you mentioned above. That would put a person into the has-no-cohort category until the next time it would get re-calculated.)
There's the related idea of computing a person's cohort at different times on different sites. This isn't the same as updating at different times, which I think you were referring to above. The idea here is that different sites a person visits would see a flock derived from a different time window.

As @npdoty pointed out in #69, how useful this is depends on how different a person's browsing is on different days. Real-world data seems like the best way to measure the decrease in fingerprintability here.
There's the idea of adding per-site noise to the output of the hash function, as mentioned in the original explainer. This is mostly a Differential Privacy approach to further address the concern about leaking browsing history. But once you're adding noise, that noise can vary by which site you're on, so that your history of cohorts-over-time on different sites look pretty different from each other.

This requires measuring the privacy/utility trade-off as you vary the amount of noise. If the noise weight is 0, we need to worry about the attack you described; if the noise weight is large, it drowns out the browsing-based signal entirely, and the cohort ID is effectively your per-site random number of the week. The question is whether there is a useful value in between.

Those aren't the only possibilities, but they do seem collectively promising enough to warrant further exploration.

fabiomariotti commented 3 years ago

It would be interesting to explore these limitations for 3rd parties. It somehow puts an upper-bound on quality. 3rd party would fell in either tracking or having low quality. I mean: whatever the standard quality would be.

But if this comment is correct, we would have GDPR issues right away. I think in EU this will not pass.

TheMaskMaker commented 3 years ago

This requires measuring the privacy/utility trade-off as you vary the amount of noise. If the noise weight is 0, we need to worry about the attack you described; if the noise weight is large, it drowns out the browsing-based signal entirely, and the cohort ID is effectively your per-site random number of the week. The question is whether there is a useful value in between.

@michaelkleber @dmarti

I've read several issues describing noise-based solutions for tracking cases, but I wonder if this approach is possibly a problematic antipattern for the following reason:

I am not sure a 'fine line' between 'noise low enough to not hurt the publisher' and 'noise high enough to thwart tracking case', exists because the tracker actually gains more value and accuracy from the noisy signal. If anything it promotes reliance on them the need to do extra tracking.

If the noise is low enough so that the signal has value but is imperfect, then any abuse case server tracking system could simply work as intended, and just add an equal uncertainty to their algorithm's acceptable parameters and widen the scope. In fact, coordinated trackers will be able to view a more accurate view of reality than the laymen who simply accepts the noisy signal, and their actions are further rewarded. I worry noisy approaches will only serve to hurt small publishers and those who use the system as intended.

michaelkleber commented 3 years ago

Can you explain how "the tracker actually gains more value and accuracy from the noisy signal"?

It seems like you're thinking about an attacker who already knew your identity on a bunch of different sites, who could see multiple different noised versions of the underlying flock and so better guess what the un-noised version would be. That can't possibly get them more information than the un-noised flock, though.

But in any case, I don't see how that would help them learn your real flock on some other site where they don't already know you're the same person.

TheMaskMaker commented 3 years ago

To your first question, the 'more' refers to more value and accuracy relative to an 'average system' that uses floc as intended you intend it. To put it another way, assume the tracker and the average system compete to get the best data in a noise world and a noiseless world. The tracker's information in the noisy world is worse than in a non-noise world, but, relative to the average system in a noisy world, the tracker has the capacity to reduce error past the noise and achieve an accuracy of probability higher than the noise error. The limit is as you say the un-noised floc, but the average system has a noised floc, which is worse than the tracker. Thus the tracker is rewarded even more in the noisy system, even though everyone's data, average and tracker alike, is worse. This is because the tracker has taken less of a hit to its accuracy, which is the opposite of the intention.

Lets say, to use purely imaginary numbers, in a noisy world the tracker has 5% less accuracy, but an average system has 10% less accuracy. The difference has only grown. The average system is merely hurt. Naturally whether this occurs depends on many factors and depends on the tracker's ability to gain access to associating data (which it does in John's example).

In the context of John's example, the tracker does know you are the same person, or failing that can make a probabilistic guess with increased accuracy from tracked personal data. Thus the tracker can remove the noise to some extent or entirely, and the average cannot remove the noise at all. I fear this may apply to many other noise-based thwarting methods as well.

Does this clarify my concern?

michaelkleber commented 3 years ago

Thank you, @TheMaskMaker, good point and good explanation.

It seems to me that the concern you raise is even more of an issue in a setting without FLoC, where a "well-behaved" ad network has only contextual information, but one who performs some covert tracking gains a very large advantage by combining data across sites.

I do agree, though, that we want to minimize the advantage gap that you point out.

TheMaskMaker commented 3 years ago

@michaelkleber Happy to help. Many publishers would be quite worried about receiving such noisy signals and I think it would do more harm than good. Also glad to see you care as much as I do, working on this on the weekend!

I do want to clarify though, outside of a floc system, the behavior I am describing can in fact be 'well behaved' depending on what the system is. It is important to separate floc's privacy definition from those of other proposals.

For example in SWAN this behavior (or similar, still reading it over) is privacy-preserving and acceptable, because SWAN has a very different definition of what constitutes privacy, and a different system of safeguards. So I don't think it is as much of a risk outside of floc, as long as the alternative system accounts for it.

In Floc, this goes against the floc definition of privacy and one proposed safeguard (noisy signals) I fear would worsen the floc-specific problem so I point it out quite emphatically, where in another proposal it might not be an issue at all.

It does get rather confusing to address each proposal from a different perspective, but of course I try to speak in each proposal with the definitions and morals of that particular proposal in mind!

TheMaskMaker commented 3 years ago

@johnwilander I'm curious on your take, do you think any of the proposed solutions (under the floc perspective) solve the tracking case in floc, and what do you think of my concerns that noisy signals would worsen it? I don't want to digress too far from your original issue, so lets say

let cohortCollectionForWebsiteA = [ "week01_2022" : "0666", "week03_2022" : "A566", "week04_2022" : "2111", "week05_2022" : "1171", "week07_2022" : "749B", ]

let cohortCollectionForWebsiteB = [ "week01_2022" : "0666", "week02_2022" : "0030", "week05_2022" : "1171", "week06_2022" : "7311", "week07_2022" : "749B", ]

If we make these signals 20% less reliable, than the website, or lets say flip 1/5 of the signals, I feel like it is a huge noise hit to the publisher that now must stack 20% error on top of their already imperfect analysis, and by adding a third or fourth site C,D you could easily reduce any data you miss because of the 20% noise?

What are your thoughts?

michaelkleber commented 3 years ago

@TheMaskMaker Ah, I think there is an important reason that I don't think this is as worrisome as you do: when I talking about "adding noise" to FLoC, that does not necessarily mean making the underlying signal substantially less useful!

For example, in the way of calculating FLoC in Chrome's first Origin Trial, your cohort is some sort of LSH ("locality-sensitive hash") of the set of websites you've visited recently. A noised version could mean: When you go to site X, your cohort is the LSH of your true browsing history plus one randomly-chosen additional site Rx from the most popular 1000 in your country. When you go to site Y, the LSH input uses a different randomly-chosen additional site Ry, etc.

Now each FLoC is probably still reasonably good, as a targeting signal! After all, you wouldn't expect a person visiting a single additional popular site to completely change what you'd want to target with. But nevertheless this would probably make your FLoC different on different sites, to defeat the cross-site tracking threat that John started with.

TheMaskMaker commented 3 years ago

@michaelkleber I see what you mean, but I think it affects this variant as well. Let me give you an example, and it will either reveal a misunderstanding on my part with flocs/cohorts or, if I understand correctly, help you to see my point about the threat with noise. I think this example will be useful in either event in explaining the problem.

Suppose systems A,B,C,D are collaborators in a floc world. They use outside tracking are work together. Meanwhile Website X does not.

A.com and B.com are websites and use login to recognize a user. C-ware is a tablet OS and has an eye in any app installed on it linked to the owner's account. Its default browser can access flocs. D-browser has a user account and so also can recognize the user, and sends browsing data home with its device sync functionality.

A,B,C,D have data sales not just in web advertising, but also to e-mail listings, snoop groups, improving their own products with data science, etc. Thus any floc data is valuable to them even outside of web adv, though of course it is useful for web adv as well.

We have 4 or more cooperating groups that can access the floc ID, recognize an individual, and send data to a server as per @johnwilander 's example, but with 2 more players.

Lets also assume, 2 different floc hash models to explore major possibilities.

1. For the first 'experiment,' lets assume a very convergent hashing algorithm. This algorithm aggressively places users in very focused flocs based on major sites visited, so, a random site added would be unlikely to change the floc id. Lets say this occurs at a rate of 25%.

This model is very useful for publishers and advertisers because you can trust each floc to reliably represent interests. You can trust them to mean roughly the same thing over time. You can target campaigns to them.

The random site only changes the ID on System B. Now the trackers (A,B,C,D) get even more information! They can say with confidence what floc the user is normally a part of, and also know an additional floc they might belong to! They even have information on which site caused this change. This could open the door to even more information.

Meanwhile, the publishers lose a targeting opportunity 25% of the time, as the randomized flocs appear to irregularly for them to understand their value, unless they join the tracking system.

2. In the second 'experiment' lets assume a very divergent algorithm. this algorithm severely alters floc id with small changes in site visit/interest. Lets say 75%.

Now there is a good chance A,B,C, and D all have different flocs for this user. But they don't care. They know the user is a member of these 3-4 flocs, and can correlate those flocs into a floc group. They now know these flocs represent similar interests (assuming the algorithm is not so divergent as to make them useless). They can now target to the floc group across the 2 websites and through any data sales opportunities presented to the OS and browser, and thus don't miss any targeting opportunities.

Meanwhile, the publisher that does not join the tracking group has a mess of random flocs, and only 25% appear on their 1 site often enough to be recognizable as useful. They lose 75% of targeting opportunities.

Even with this noise method you mention, I think the greater the noise, the more beneficial cross-site tracking is in floc, and the more work is done to use flocs as cross-site trackers.