FLoC+Server - Githubissues

WICG / floc

This proposal has been replaced by the Topics API.

https://github.com/patcg-individual-drafts/topics

Other

935 stars 90 forks source link

FLoC+Server #20

Open jdwieland8282 opened 4 years ago

jdwieland8282 commented 4 years ago

@jkarlin, Would you entertain a conversation/thread about running cohort assembly on a trusted server in addition to assembling cohorts inside the browsers construct. I've written a proposal called FLoC+Server (FKA Gatekeeper), that seeks to copy the concepts behind cohort assembly onto a trusted, transparent server run by a not-for-profit entity. That entity is TBD and who it can be is still undefined, but conceptually, I think cohort assembly is too important to the ad tech ecosystem to be possible by just browsers.

https://github.com/MagniteEngineering/Gatekeeper

jkarlin commented 4 years ago

I'm interested in the proposal. We've also thought about such things. I generally like the notion of providing primitives to let folks do what works best for their pages, rather than having the browser dictate algorithms. But my general concerns with strategies where multiple cohorts exist simultaneously are:

1) More cohorts mean more information to leak. E.g., if a site starts with one ad-tech's cohort and switches to a second, it can just learn them both and know more information.

2) The browser will do its best to ensure that cohorts include as little sensitive information as possible. It may not do a perfect job, but it will try, as it wants to do its best to preserve the privacy of its users as the user's agent. There is no guarantee that third-party platforms will be able to do as good of a job. And the more third-parties there are, the more likely there are to be mistakes or even intentional abuses.

edit: My specific concern with this proposal is that it exposes raw user browsing data off-browser. That goes against the goal of our work which is to expose as little user data beyond the browser as possible while still allowing web content to be monetized.

michaelkleber commented 4 years ago

Hi @jdwieland8282: I agree with both of Josh's concerns — and note that those apply to the Magnite's sibling proposal ProprietaryCohorts just as much as they do to the FLoC+Server (FKA Gatekeeper) that you explicitly asked about.

For FLoC+Server specifically, I expressed my strong reservations when Tom Kershaw presented this idea at the Web Adv BG 7/21 meeting: You're proposing a server whose job is to track information about individual people over time, and that is much less private and much more invasive than any of the proposals Chrome is interested in. By contrast, while Criteo's SPARROW also introduces a trusted server, one key thing we are trusting that server to do is to not track people!

jdwieland8282 commented 4 years ago

Hi @jkarlin, What kinds of primitives have you considered providing? One browser being in one cohort at a time seems like a reasonable compromise. We've always thought of running and cycling enthusiasts as two segments, that specificity let brands like Trek and Brooks advertise accordingly but modifying marketing strategies to advertise to "athletically minded people" shouldn't be too difficult.

Just to be clear we imagine that there will only be a few trusted cohort providers, we don't think switching from one to another will happen all that often, but to manage to that we could refresh a users cohort membership daily, weekly, etc. So that switching cohort assemblers would have little benefit.

wrt:

2. The browser will do its best to ensure that cohorts include as little sensitive information as possible. It may not do a perfect job, but it will try, as it wants to do its best to preserve the privacy of its users as the user's agent. There is no guarantee that third-party platforms will be able to do as good of a job. And the more third-parties there are, the more likely there are to be mistakes or even intentional abuses.

We believe we could build cohorts with just the TLD+1 and a user id. We will log all inbound and outbound request from the cohort assembler, those logs would be auditable by browsers. Further the cohort assemblers code would be open sourced so anyone could see how it operates.

jdwieland8282 commented 4 years ago

@michaelkleber Sparrow at the moment does not have cohort assembly as a feature, I don't see the two proposals as comparable wrt cohort assembly specifically.

How will FLoC build cohorts w/o a user id and some browsing information tied to that id?

tomkershaw1 commented 4 years ago

@michaelkleber just to clarify again, the intent of both the gatekeeper and prop cohorts proposals is not to build a new server side tracking mechanism -- it is simply to externalize the cohort creation process rather than keep it inside the browser or (worse) inside a server-side assembly process in the browser platforms. it is clear that some cross-browser assembly process is required here, and for transparency, trust, and user control we feel this process should be done in the light of day, not inside black boxes. We are also extremely concerned with the computational intensity of resigning cohort creation to inside the browser. We acknowledge that having cohort assembly take place anywhere (including in browsers or in servers) requires a governance process, but the implication that "its ok if we do it but its not ok if someone else does it" seems to me to be problematic.

michaelkleber commented 4 years ago

Sparrow at the moment does not have cohort assembly as a feature...

@jdwieland8282 SPARROW certainly does have cohort assembly as a feature; that's what Interest groups: audiences new building blocks is all about! This is @BasileLeparmentier's work, not mine, so I should let him advocate for it :-).

But anyway, I am very supportive of putting more work into ways to create interest groups to which we can serve ads with a TURTLEDOVE-or-SPARROW mechanism! The fact that the advertising interest group does not get joined with any publisher-site identifier gives us much more privacy protection.

the intent of both the gatekeeper and prop cohorts proposals is not to build a new server side tracking mechanism...

@tomkershaw1 Are you trying to make the distinction that the Magnite proposals do include a server-side tracking mechanism, but that it is only a means to an end, not a goal itself? I appreciate that! But for the browser to do a reasonable job protecting privacy, we need to take into account both the intended uses and the malicious abuses of any mechanism we launch.

it is clear that some cross-browser assembly process is required here...

I don't think that is clear at all! We're trying to design clustering in a way that does not involve data flowing across browsers, or where any data that does flow has no identifiers attached.

...we feel this process should be done in the light of day, not inside black boxes

I suspect we will need to agree to disagree about which proposal has a "black box" problem.

jdwieland8282 commented 4 years ago

@jdwieland8282 SPARROW certainly does have cohort assembly as a feature; that's what Interest groups: audiences new building blocks is all about! This is @BasileLeparmentier's work, not mine, so I should let him advocate for it :-).

:-) @michaelkleber It literally says in @BasileLeparmentier description "not currently described". We should let him weigh in, but what I believe he is describing is TD style interest groups, AKA retargeting segments. What we are referring to hear are lookalike segments. There is a big difference between the two, interest segments are built by a domain seeing a users and adding them to an interest group segment, lookalike segments are predictive, a domain need not necessarily have seen a user but in aggregate users can be assigned to cohorts based on browsing behavior.

The focus of both our proposals, is lookalike/FLoC/ML based audiences, not interest group aka retargeting segments.

Not sure if this was an over site or if my last question got mixed up in the shuffle, but I'd be curious to hear your thoughts on it. "How will FLoC build cohorts w/o a user id and some browsing information tied to that id?"

BasileLeparmentier commented 4 years ago

Hi all,

I will have to say that I do agree with both of you:

There is a cohort assembly in SPARROW, that we think is quite flexible and would cover many use cases, not only retargeting but also segmentation, interest based targeting etc..
However, by design it is still impossible to do something as powerful as in FLOC and the Magnite proposals. Indeed, using interest group as building blocks means that we cannot leverage the user full browsing history to create cohorts (as we never have access to it). It would probably be possible to mimic Lookalike thanks to meta interest group, but it is likely to be significantly more complex than with a clustering algorithm.

From our understanding, it seems you do want to support this use case with the FLOC proposal, and you want to create cohorts thanks to the full user browsing history. This would be done by the browser and the cohort would be available at all time. this should allow to cover better / help cover lookalike and other similar use cases.

Form what I understand, Magnite are proposing that somebody else than the browser should also be able to create such cohorts. And I do understand the merit that having another entity could bring. What we need to find out, is how we can set up such an entity, and with which guarantee so that it can do the cohort assembly job (In the FLOC setting).

Did i get this right?

jdwieland8282 commented 4 years ago

Thanks @BasileLeparmentier yes that is indeed what we are proposing. For everyone's sanity I'll offer a general definition of the 2 types of segments being discussed.

TURTLEDOVE - interest segments, the ad tech industry commonly refers to these as retargeting segments
FLoC - Cohort segments, the ad tech industry commonly refers to these as lookalike segments

The Magnite proposal, FLoC+server (FKA Gatekeeper), seeks to move the assembly of cohorts onto a trusted server (in addition to continuing to allow browser based cohort assembly) because philosophically don't think browsers should be the only entity with access to the ingredients needed to create them. That's our basic premise, the rest of the details are negotiable. I'd ask that the Chrome team weigh in on the concept of whether cohort assembly (lookalike segments) outside the browser is something they would support?

michaelkleber commented 4 years ago

Oooh, no, I don't agree with @jdwieland8282's definitions at all!

I'd say there are three types of segments that we should talk about:

TURTLEDOVE serving: 1a. "Retargeting segments" which you can construct using the simple API in the original TURTLEDOVE explainer, or maybe a modest expansion like the boolean operations proposed in SPARROW. A key feature here is that you actually observed a person doing some event that led you to believe they should be in a segment.

1b. "Lookalike segments" where you have some group of people (maybe a retargeting segment from 1a) and you want to construct a larger group of people who are kind of similar to that group. It would be fine to advertise to these folks using a private mechanism like TURTLEDOVE; the problem is finding them in the first place.
FLoC serving: "Cohort segments" where you don't need to know something about a large cohort of people at the outset; rather, you are content to observe a cohort and learn something about it based on its behavior. If each person were just given a persistent random number from 1 to 1000, you would get a cohort segment, but since we'd expect all 1000 cohorts would be pretty similar, the division into 1000 buckets wouldn't be very useful for many things. As the people in a cohort become more similar in some way, that division into cohorts becomes more useful. Assignments to these cohorts are only useful if we can observe the behavior of the cohort to learn something about them — therefore it would not be useful to use them for TURTLEDOVE-style serving.

I think we're talking about what I've called 1b.

Since TURTLEDOVE interest group memberships cannot be joined with first-party user identity, there's a lot of latitude in how we build those groups. I think figuring out better ways to build 1b's is a rich area that we should work on. On-device and off-device methods both seem reasonable.

I think trying to use 2 as a replacement for 1b is bound to end in tears. FLoC's cohorts use an extremely restrictive model — each person is only in one flock! You might try to build 2's for something like demographic segments, I guess, but for lookalike segments it seems like a poor fit.

Thank you, Jeff, for making it clear enough that we can finally figure out why we're not connecting.

tomkershaw1 commented 4 years ago

@michaelkleber i think it makes sense to focus on 1b at the moment, though if a trusted entity is required (or allowed) for the lookalike use case, using it for other use cases would seem sensible. i think the other point we are making is that whether or not we need cross browser communications is material. if we are stating as a rule that no inter-browser sharing of history can occur, ever, then that is different than saying we will "try" to not have that happen, or it doesn't happen "except" when we need to do x, y or z (such as calculating the correct noise level to maintain anonymity). if cross browser sharing is needed for any use case, our argument is that an off device method should be allowed.

i also think, since this is a sandbox exercise, that it would be extremely useful to augment the FLOC and 1b use cases to include a scenario where there is a trusted entity outside of the browser. the current design assumption is that the browser should treat all external entities as hostile and untrustworthy, and make it as close to impossible as we can for those entities to do something nefarious. that leads to some potentially problematic outcomes in terms of computational intensity on the browser, impact on the device, and process for change management -- as well as transparency. while i am fine with that being the default, having a system that slightly reduces those assumptions and allows for a trusted entity to process some level of browsing history would be very helpful to help us collectively assess the cost and trade-offs associated with the complete derisking of the process.

michaelkleber commented 4 years ago

We're very willing to explore the possibilities for "a trusted entity outside of the browser" as part of the Privacy Sandbox. And indeed for the Chrome proposal for an Aggregate Measurement Service we put a lot of effort into that line of thinking!

But not all trusted entity ideas are the same:

For that Aggregate Measurement Service, there are two servers run by different organizations — and even if one of them goes rogue, it's cryptographically impossible for it to learn any private information as long as the other server keeps playing by the rules.
For the Gatekeeper in Criteo's SPARROW, a single server does indeed see private information — but it does not see any cross-site user identifier, and does not store any user profile. If it goes rogue, it could try to fingerprint users based on their interest groups; that's what it's being trusted not to do.
For your FLoC+Server proposal, a single server sees private information and sees it attached to a stable user identifier and build profiles as a result — so if it goes rogue, all privacy is lost.
For Chrome's FLoC proposal, we're working on clustering approaches like Locality-Sensitive Hashing, where no server sees private information, and the information it does see isn't attached to a stable user identifier. (@jdwieland8282 This is the one-sentence answer to your hanging question "How will FLoC build cohorts w/o a user id and some browsing information tied to that id?"; sorry I missed saying that sooner.)

So let's be careful to not sweep these all together. There is a huge range of what capabilities a trusted server might be granted.

tomkershaw1 commented 4 years ago

@michaelkleber this is very helpful, thanks. we are most familiar with the Sparrow proposal, which i consider a limited trust model where no single entity ever has the ability to identify a user unless some very unnatural acts occur. i would argue that for the FLOC proposal its a bit melodramatic to say "all privacy is lost." it would be more accurate to say that whatever identifying information is stored or available on that server could be lost, but the general idea is that this information would be extremely limited and would become irrelevant quite quickly. Its also the case that the browser could revoke that Gatekeeper's access at any time, which provides an additional level of protection (assuming rogue events are detected, of course). and there is no reason why additional anonymity cannot be added by distributing these functions across multiple machines as proposed for other use cases. i think the important thing is that we are acknowledging all four of the above use cases and all can be explored. that was not clear to me at all as previously the discussion focused only on Sparrow/Turtledove and that admittedly limited use case. I do think its important to have data from several approaches to understand the risk v complexity trade offs associated with each.

michaelkleber commented 4 years ago

Yes, there are definitely a wide range of trusted-server approaches (the four above and surely others too) that we should explore. But of course browsers will be more skeptical of the ones with more privacy risk.

i would argue that for the FLOC proposal its a bit melodramatic to say "all privacy is lost." it would be more accurate to say that whatever identifying information is stored or available on that server could be lost, but the general idea is that this information would be extremely limited and would become irrelevant quite quickly.

As I understand the FLoC+Server existing-user flow (steps 8-9 here), the server receives and logs a unique user ID, the domain the user is on, and arbitrary contextual information about what they're doing. That seems like an unbounded privacy risk to me — am I missing something about the proposal that causes the data to be limited or to become irrelevant quickly?

Its also the case that the browser could revoke that Gatekeeper's access at any time, which provides an additional level of protection

If the server goes rogue, it surely has the ability to tell each of its member sites the unique userID for every one of their users, just by correlating historical logs. ("Hey, you know that guy with flock 1234 who visited URL X at 12:34 yesterday? He's unique ID 9876543210.") Since each site could also have first-party cookies, the browser revoking the rogue server's access at that point seems like quite a weak response — full cross-site tracking would still last until the user deletes all of their data on every member site.

jdwieland8282 commented 4 years ago

Yes, there are definitely a wide range of trusted-server approaches (the four above and surely others too) that we should explore. But of course browsers will be more skeptical of the ones with more privacy risk.

We'd expect nothing less. What kinds of security guarantees would you need to see from a trusted entity before chrome considers them trustworthy? Any thoughts on how we should begin?

As I understand the FLoC+Server existing-user flow (steps 8-9 here), the server receives and logs a unique user ID, the domain the user is on, and arbitrary contextual information about what they're doing. That seems like an unbounded privacy risk to me — am I missing something about the proposal that causes the data to be limited or to become irrelevant quickly?

Correct, there are two storage components, 1) request/response logs these are encrypted and for the purpose of auditing. If browser wanted to make sure the Gatekeeper was behaving, these logs would show all inbound and outbound data for some TBD period of time & 2) "unique user ID, the domain the user is on, and arbitrary contextual information", we need this information for cohort assembly. The idea is user ids go into the Gatekeeper and Cohort Ids come out.

Lookalike audiences require some historical knowledge of the users browsing habits as compared to an index. In our proposal the index is constructed by what the Gatekeeper knows in aggregate, then we compare each browser to that avg and assign them to a lookalike audience. This is how we've thought about 1b to date, have you thought about constructing these audiences differently?

michaelkleber commented 4 years ago

Yup, these are two good questions.

What kinds of security guarantees would you need to see from a trusted entity before Chrome considers them trustworthy?

This is about what browsers would need, not Chrome specifically. We need to find something we can standardize.

This is how we've thought about 1b to date, have you thought about constructing these audiences differently?

Chrome hasn't published any 1b-type proposals yet.

This Lookalike Audience proposal by @benjaminsavage could be a starting point: it contemplates a private alternative to your "what the Gatekeeper knows in aggregate". A private approach to "compare each browser to that avg and assign them to a lookalike audience" seems viable, since that involves comparisons which could happen on the user's device.

But there are surely other ways to go about this as well.

benjaminsavage commented 4 years ago

Chrome hasn't published any 1b-type proposals yet.

OK, I'll bite =). I've just filed https://github.com/jkarlin/floc/issues/24 as one way FLoC + the Aggregated Reporting API could be extended to do a 1b-type of "Lookalike Targeting" in a similar fashion to my previous proposal that @michaelkleber referenced.