WICG / sparrow

60 stars 12 forks source link

Questions on Information Flow #5

Open michaelkleber opened 4 years ago

michaelkleber commented 4 years ago

Thank you for your attention to TURTLEDOVE and your desire to improve on it! To help me understand SPARROW, let me ask some questions about the flow of information between the browser, the Gatekeeper, and the ad network.

For the purposes of this Issue, I'll assume that everyone agrees that the Gatekeeper is perfectly trustworthy.

At that point, the ad network calls regular DSPs for contextual bids and the browser calls the Gatekeeper with interest groups and page contextual data.

Does this mean the "contextual data" is all computed by the browser and the publisher's ad network? In TURTLEDOVE, by contrast, it's possible for each DSP to learn the URL of the page and compute its own contextual signals (discussion).

Resulting bids are returned to the ad network which selects the winner.

First, this seems like a very large information channel, probably hundreds of bits. What prevents the set of bids from encoding lots of data that we are trying to keep private from the ad network?

Second, it seems like the ad network sees these bids (from the Gatekeeper) and the contextual request (directly from the browser) at nearly the same moment. It seems like it would be straightforward to match the two up, since the bids can be influenced by (and therefore can encode) contextual signals. This would make the information leak in the previous paragraph even worse.

If the Gatekeeper is already being trusted to run the ad network's code faithfully and keep it secret (when producing the bids), could it run the auction code that selects the winning bid as well?

The Gatekeeper notifies the advertiser, with a variable delay around one minute, of the display, including the winning interest group, the ad data (campaign, product, layout), the bid value, and the publisher it was displayed on.

This event-level reporting seems like another opportunity for the ad network to join contextual information with interest group membership directly (or to conclusively join the two ad requests of a minute earlier, if that hasn't already been done).

TURTLEDOVE deals with this by only allowing aggregated reporting for information derived from both contextual and interest-group information, and we've had some discussion about the latency. But it doesn't sound like this is just about an hour vs a minute, since you later say "reporting data at the display level, with the interest group and publisher information, allows for the advertisers to learn better ML models".

It seems like knowing the interest group, the publisher, the bid, and the minute is more than enough to know the exact event, and so join the interest group with the user's publisher-site identity.

Can you see any way to avoid this?

The Gatekeeper receives interests group x publisher data, but cannot link this data to individual users since it has no user-level information.

Maybe this sentence sums up part of my worries about this proposal. The publisher knows the user's first-party identity ("Hi! I'm NYTimes subscriber 12345!"). All publisher data could be influenced by this: any signals that feed into bidding, the contents of the page, even the URL might contain PII.

I don't see any way to separate publisher data from user-level information. Are you looking at something differently?

BasileLeparmentier commented 4 years ago

Hi Michael, thank you for your reply and interest in SPARROW!

Before going through the various questions - we think that you made a valid point and that the Gatekeeper role being insured by those handling the auction could add additional privacy protections. In this context, what do you think of the following propositions:

  1. The publisher is able to audit ad safety, paramount for online advertising. It can act on it by further blacklisting images / targeted domain etc.
  2. The publisher is not able to link the ad with a specific user, and never knows (as before) the interest groups the user belonged to.
  3. Billing is transparent.

Taking into account the above, please find below the answers to your questions:

For the purposes of this Issue, I'll assume that everyone agrees that the Gatekeeper is perfectly trustworthy.

That’s an assumption we also make. We think that it could be enforced through contractual agreements and audit procedures, but we would welcome any technical idea (cryptography?) that would further ensure the Gatekeeper trustworthiness. The Gatekeeper role definitely needs to be discussed and developed through next W3C discussions.

At that point, the ad network calls regular DSPs for contextual bids and the browser calls the Gatekeeper with interest groups and page contextual data.

Does this mean the "contextual data" is all computed by the browser and the publisher's ad network?

The contextual data for the interest group bid is computed by the browser, in accordance with the publisher policy. It should contain the page URL, the user-agent, information about the ad (format, placement...). In order to make sure that the URL doesn't convey any user-identifying information, the browser or the Gatekeeper could edit it.

An example could be:

If the exact URL is lemonde.fr/specific_section/Specific_articles/userid=****, the browser or the Gatekeeper should only keep "lemonde.fr/specific_section/Specific_articles" in the IG request.

Please note that the contextual bid, run through the ad network and the DSP, is fully independent from the interest group bid.

Resulting bids are returned to the ad network which selects the winner.

First, this seems like a very large information channel, probably hundreds of bits. What prevents the set of bids from encoding lots of data that we are trying to keep private from the ad network?

Second, it seems like the ad network sees these bids (from the Gatekeeper) and the contextual request (directly from the browser) at nearly the same moment. It seems like it would be straightforward to match the two up, since the bids can be influenced by (and therefore can encode) contextual signals. This would make the information leak in the previous paragraph even worse.

If the Gatekeeper is already being trusted to run the ad network's code faithfully and keep it secret (when producing the bids), could it run the auction code that selects the winning bid as well?

Yes, the Gatekeeper running the auction would actually solve this concern.

The Gatekeeper notifies the advertiser, with a variable delay around one minute, of the display, including the winning interest group, the ad data (campaign, product, layout), the bid value, and the publisher it was displayed on.

This event-level reporting seems like another opportunity for the ad network to join contextual information with interest group membership directly (or to conclusively join the two ad requests of a minute earlier, if that hasn't already been done).

TURTLEDOVE deals with this by only allowing aggregated reporting for information derived from both contextual and interest-group information, and we've had some discussion about the latency. But it doesn't sound like this is just about an hour vs a minute, since you later say "reporting data at the display level, with the interest group and publisher information, allows for the advertisers to learn better ML models".

It seems like knowing the interest group, the publisher, the bid, and the minute is more than enough to know the exact event, and so join the interest group with the user's publisher-site identity.

Can you see any way to avoid this?

Our understanding is that in the case where the interest group bid wins,

The publisher gets:

The advertiser gets:

Adding some noise in bid time reporting should be enough to prevent bridging the data between advertiser and publisher in almost all cases, making such an attack useless. Some similar corner cases could be found with TURTLEDOVE, but without any material impact for the proposal.

The Gatekeeper receives interests group x publisher data but cannot link this data to individual users since it has no user-level information.

The publisher knows the user's first-party identity ("Hi! I'm NYTimes subscriber 12345!"). All publisher data could be influenced by this: any signals that feed into bidding, the contents of the page, even the URL might contain PII.

I don't see any way to separate publisher data from user-level information. Are you looking at something differently?

As we said above, the browser is sending the request which contains contextual information. He is in charge (by trimming the URL, up to the domain level, should it be necessary) that no PII information is contained in the request.

Should the browser see actors systematically trying to share PII using SPARROW or TURTLEDOVE, it could choose (via a well-defined procedure) to prevent them to participate in interest group bids.

bmilekic commented 4 years ago

Bonjour Basile, Hi Michael

Great proposals and discussion -- it feels like good progress is being made. I've been following both TURTLEDOVE and SPARROW proposals with great interest. Quick comments on a couple of the points to add to the discussion:

If the Gatekeeper is already being trusted to run the ad network's code faithfully and keep it secret (when producing the bids), could it run the auction code that selects the winning bid as well?

Yes, the Gatekeeper running the auction would actually solve this concern.

If I'm not mistaken, this sounds more and more like certain SSPs/Exchanges could play the role of Gatekeeper, provided they keep the Interest Group based bid req path completely separate and isolated from the Contextual bid req path.

The Gatekeeper notifies the advertiser, with a variable delay around one minute, of the display, including the winning interest group, the ad data (campaign, product, layout), the bid value, and the publisher it was displayed on.

This event-level reporting seems like another opportunity for the ad network to join contextual information with interest group membership directly (or to conclusively join the two ad requests of a minute earlier, if that hasn't already been done).

TURTLEDOVE deals with this by only allowing aggregated reporting for information derived from both contextual and interest-group information, and we've had some discussion about the latency. But it doesn't sound like this is just about an hour vs a minute, since you later say "reporting data at the display level, with the interest group and publisher information, allows for the advertisers to learn better ML models".

It seems like knowing the interest group, the publisher, the bid, and the minute is more than enough to know the exact event, and so join the interest group with the user's publisher-site identity.

Can you see any way to avoid this?

Our understanding is that in the case where the interest group bid wins,

The publisher gets:

  • At bid time: exact bid time, contextual data (URL), including local user id
  • In reporting, with one minute delay: bid time with some noise, contextual data, targeted domain, ad bundle - but not the interest group

The advertiser gets:

  • In the reporting, with one minute delay: bid time with some noise, contextual data (URL), interest group

Adding some noise in bid time reporting should be enough to prevent bridging the data between advertiser and publisher in almost all cases, making such an attack useless. Some similar corner cases could be found with TURTLEDOVE, but without any material impact for the proposal.

I believe the concern is that if the advertiser/DSP also gets access to a publisher-provided "1st party" user ID in the Contextual bid req path (sans Interest Group), there exists an attack vector where a bidder advertiser/DSP can try to link that pub-provided user ID (and all associated on-pub-site behaviours) as observed from the Contextual bid req path, with the Gatekeeper-sent Interest Group bid req events (and slightly delayed reporting events), which also now identify the publisher and context.

Therefore, I'm assuming that there's an assumption / implicit suggestion being made that the Contextual request path should not contain any personal identifiers, including 1st party/pub-provided "user IDs"?

The Gatekeeper receives interests group x publisher data but cannot link this data to individual users since it has no user-level information.

As above... I think there's an assumption being made that the publisher-triggered Contextual request path shall contain no pub-provided User ID.

Pl-Mrcy commented 4 years ago

Hi @bmilekic,

SSPs are, from our perspective, in a good position to step up and assume the role of gatekeepers. Other actors such as cloud providers could be interested as well. Another possibility is for current buyers to "split" into independent entities with a strict Chinese wall implementation. All in all, there are definitely several pre-existing actors that could impersonate this new role, which would provide a hefty dose of variety and competition, resulting in more innovation.

I want to clarify one point: in SPARROW, contrary to TURTLEDOVE, there is only one request including contextual AND interest-based signal. In the diagram here the contextual request with grey arrows passes through a direct relationship between the advertiser and the publisher, completely outside of the Privacy Sandbox (such direct contextual calls would also exist in a more complete TURTLEDOVE diagram, on top of those going through inside the privacy sandbox).

The point you make about the potential attack vector is still a valid one. However, we think that the delay we propose would make this kind of attacks of such a low return on investment that they become irrelevant from a business perspective. Linking the two requires a very low volume of ad requests per minute, making this attack impossible to scale by definition.

michaelkleber commented 4 years ago

If possible, I'd like to keep the question "How can we trust the Gatekeeper?" separate from this issue. We'll certainly need to talk about who could be appropriately trustworthy. But I plan to focus on what we can design if we assume the Gatekeeper is trusted by browsers and ad tech alike.

In order to make sure that the URL doesn't convey any user-identifying information, the browser or the Gatekeeper could edit it.

@BasileLeparmentier This seems hard even in the case where it's unintentional — which evidently happens plenty, if the search results for [PII in URLs] are any indication. It would be much worse if the information were being deliberately hidden.

I would much rather have a system in which the contextual ad request is allowed to contain all the context, including the real URL and any other first-party information the publisher wants to use. (Note that this is exactly the opposite of what @bmilekic said, but I think in line with @Pl-Mrcy's reply.)

And more broadly, I don't want to put the browser or the Gatekeeper in the position of needing to police the information sent through some channel. Instead I want a design in which there just is no channel to join up information that needs to remain separate.

Suppose we agree that (1) we don't want a way for the publisher to learn the interest groups of a visitor to their site, and (2) we don't want to police the contents of the contextual targeting URL. The only way to satisfy those requirements is if the Gatekeeper is the only server that gets to know both the URL and the interest group at the same time.

Is there some variant of SPARROW that meets this bar?

brodrigu commented 4 years ago

Suppose we agree that (1) we don't want a way for the publisher to learn the interest groups of a visitor to their site, and (2) we don't want to police the contents of the contextual targeting URL. The only way to satisfy those requirements is if the Gatekeeper is the only server that gets to know both the URL and the interest group at the same time.

I think we first need to call out whether the advertiser needs to know the publisher or, more granularly, the page (not necessarily the full url). I think we can solve for this either way, but we should start with what is needed.

I don't think the solution lies in policing per se, but in having a protocol for providing only the necessary information for economic viability in a way the promotes the privacy goals.

Pl-Mrcy commented 4 years ago

We understand your concerns about how user information may leak using the contextual information in the request. Please be assured that we share your concerns and that we want to find the best solution for all actors.

Even interest-based advertising requires some form of contextual data. Information about the "printing environment" (placement size, nature of the page content, etc.) are not a nice bonus but a must-have if we want the solution to be actually used by advertisers and publishers.

The contextual information could be used for many different purposes:

We obviously want to curtail the first and champion the others since they (among others) are essential to the ad business. We think that adding some constraints and on contextual data, including the URL, that passes through or some additional latency would make this breach inoperable.

It would still theoretically be possible to leverage it in a very limited way and it indeed doesn't cover all possible cases on paper. However, we want to make sure that the attack is arduous enough to make it economically irrelevant. Although an attacker could theoretically work to expose the interest groups of some users and eventually succeed every so often, we could be assured that this attack won't ever occur at scale, thus making the cost-benefit ratio strongly unfavourable, eroding the very motive for such attacks.

We are working on putting figures on the actual privacy risks associated with different levels of granularity for publisher information/latency, in a way that would be easily replicable by other parties.

Is this line of reasoning acceptable to you? Or, are you keen to accept only a solution that would present no theoretical breach, no matter how small, and would cover ALL cases by technical means only?

michaelkleber commented 4 years ago

I'm confused about what you think requires less privacy here.

Take the use case of brand safety for publisher and advertiser. This is explicitly within scope even for TURTLEDOVE; I don't see why SPARROW changes any of this. Quoting from my original explainer:

This model can support the use case where an advertiser is unwilling to run their ad on pages about a certain topic, for example: the contextual response signals could include the ad network's analysis of the topics on the page; the interest-group response could contain a signal that is a block-list of disallowed topics; and the JS bidding function could include logic to compare those two signals and in case of a match return a negative value, guaranteeing the interest-group-response ad would not show.

Publisher brand safety could work the same way: the metadata about an interest-group-targeted ad includes its topics (as determined by a server, just like today), publisher controls let them pick what topics are allowed/blocked, and on-device JS compares the two.

The key point here is that the sell-side contributes contextual topics, and rules about ad topics, while the buy-side contributes ad topics, and rules about publisher context. Sure, maybe the DSP wants to evaluate the URL on its own and not trust the SSP; it can do that as part of the contextual ad request. Likewise the SSP may want to run its own analysis of the creative, and it can do that during some review, just as it does today. None of that needs to change.

At some point these two things need to be joined, with each set of rules evaluating the corresponding state. Whether that's implemented in JS (TURTLEDOVE) or on a trusted server (SPARROW) doesn't change the fact that it can be done without giving any information back to the publisher or advertiser.

I understand many advantages of moving things to a trusted server — for example, freedom to have large ML models, real-time adjustment of campaigns, and not needing to expose your decision logic to competitors. Those are all clear benefits.

But if we worked out the server trust question, it seems like you can get all of those without offering new opportunities for tracking.

Pl-Mrcy commented 4 years ago

I am glad that we agree on the benefits brought by a gatekeeper.

Our perspective if that brand safety is two-fold: there is a live component (at bidding time) and a reporting element. One cannot work without the other. If you can't observe any "wrong-doing" in the reporting, you cannot properly update the rules or block-list applied at bid-time.

Let me take two simple examples to highlight where the system you describe wouldn't be enough:

These two examples particularly underline the fact that brand safety cannot be handled with an "on average" policy. Even one case could go viral and damage the advertiser brand or the publisher brand. This is also true of fraud, invalid traffic, and other use-cases which require detailed reporting (even though most of them can bear quite some delay) to be handled appropriately.

Currently, the availability of detailed information at the display granularity ensures that any infringement upon the defined policy (by a deliberate attacker or by mistake) can be spotted and the responsible held accountable. Many efforts in the industry are driven by this sense of traceability and strong accountability. Turtledove would undermine these efforts without any possible return.

michaelkleber commented 4 years ago

Based on your description, it seems like this has nothing to do with the question of whether the auction happens on a trusted server vs in the browser.

In both of your examples, the goal is the publisher seeing a report of "all interest group ads that appeared on my site." The aggregate reporting API would already offer that capability for any interest group that triggers showing enough ads on the site. Your worry is about ads that appear very few times (below some aggregate reporting threshold) and also are mis-classified by the ad servers responsible for filtering ad eligibility.

But it seems to me that TURTLEDOVE puts the publisher in a better position to handle this threat that in the current RTB market, for two different reasons:

  1. When an ad targets an interest group, it is served as a web package with all subresources included, so at interest-group serving time it is possible to know how the ad will actually appear on the screen. In today's RTB, ad rendering can touch an unlimited number of servers, with responses that can change on different impressions, with no guarantee of reproducibility. So an ad served via TURTLEDOVE permits far more accountability.

  2. When an ad targets an interest group, the detail of which interest group won lets you trace down the problematic campaign. In regular RTB, the browser has no notion of the associated ad campaign, so there is no way it can help with this. In TURTLEDOVE, the browser knows what interest group won, and so what campaign the ad came from. Anyone seeing a bad ad would be in an ideal place to report not just the fact that it happened, but enough detail to root out the problem — either in the moment or after the fact, if the browser e.g. retains metadata about "ads you saw yesterday."

These guarantees, which the browser can be sure are true and which dramatically improve accountability, seem to me like a huge win to offset the risk of a mis-classified ad campaign that happens to show a single-digit number of impressions on a site.

Pl-Mrcy commented 4 years ago

Whether the auction happens on a trusted server vs in the browser has nothing to do with it indeed.

We really like this idea of central ad control dashboard such as what you described above. It does not put the publisher in a better place though. It does indeed give more information and tools to the browser/user. However, the publisher cannot pro-actively monitor its own property without the intervention of users. Furthermore, it is nowhere clear how the publisher would access this data and in what form (in the case of a problematic campaign reported by the user)?

the risk of a mis-classified ad campaign that happens to show a single-digit number of impressions on a site.

Are you saying that the threshold for reporting would be around 10? The TURTLEDOVE proposition paper doesn't specify a hard number for the various thresholds (reporting, interest groups minimal size in particular). Can you give us more clarity on this part?

Lastly, the risk here is that there could be many campaigns with single digits of impressions and thus without any reporting. Since it is likely that a there will be many rather small interest groups (advertiser will try to have as much granularity as possible), it may be that these unreported campaigns represent a significant chunk of displays on a given publisher. Even if we suppose that a handful of unsafe ads may not be so terrible, N times (for N unsafe campaigns) a handful of ads would raise some concerns.

Pl-Mrcy commented 4 years ago

We published an analysis on the impact of thresholds on publisher and advertiser reporting in this repository

We added the analysis pseudo-script so that other actors in the industry can run it on their own data. Don't hesitate to share your results and comment.