Open Pl-Mrcy opened 4 years ago
Hi Paul,
The Multi-Browser Aggregation Service Explainer, over in the Conversion Measurement API repo, contains our ongoing work on figuring out how to handle noise in reports.
There shouldn't be a problem supporting the use case of a DSP who needs to know all the domains on which they showed ads. The need for thresholding only comes up in a special case:
DP in the face of an unknown output domain
There are particular challenges with maintaining differential privacy if the output domain (e.g. the aggregation keys) are unknown prior to beginning the computation. In particular, the mere fact that an aggregation key appears in the output could be sensitive, even if we do things like add noise to the values associated with it.
In these cases, the tool available to us is to use thresholding to add protection to keys with low counts. See the Appendix for more details here.
In TURTLEDOVE, the set of domains your ads might have appeared on is a known output domain: The contextual ad requests all contain the domain names, and so you've seen the list of all possible domains even without an aggregation service being involved.
In that case there's no need to threshold. Adding noise to the reported number of impressions would mean that you can't tell which domains had 1 vs 2 impressions, but the set of domains would still be correct.
Thank you for your answer.
We understood that the DSP would have to receive and treat ALL contextual ad requests, no matter if the user is part of one of his interest-group or not (to prevent any privacy attack by "absence of ad request"). For a DSP to know the set of domains his ads might have appeared on, he would have to keep a record of all domains ANY ad request was sent for in the period. I am understanding this correctly?
Secondly:
In that case there's no need to threshold
In Multi-Browser Aggregation Service Explainer, it is said:
some keys (potentially) dropped if they do not have counts above some threshold
Do you confirm that the thresholding only applies if the output domain is unknown?
Finally: Despite the added noise, the keys for which the real value is 0 (i.e. no display was made on this domain) would always get 0 and the keys for which the real value is superior 0 (i.e. At least one display was made on this domain) would always get an output value superior to 0 in the final, private output. Is that correct?
We understood that the DSP would have to receive and treat ALL contextual ad requests, no matter if the user is part of one of his interest-group or not (to prevent any privacy attack by "absence of ad request"). For a DSP to know the set of domains his ads might have appeared on, he would have to keep a record of all domains ANY ad request was sent for in the period. I am understanding this correctly?
Yes that's correct. If the output space for aggregation (e.g. the domains) are not known, you need to do some form of thresholding to get DP properties.
Do you confirm that the thresholding only applies if the output domain is unknown?
Yeah what you linked is the "basic, no enhancements" version of the proposal. The issue of known output domains is described here in the explainer.
Despite the added noise, the keys for which the real value is 0 (i.e. no display was made on this domain) would always get 0 and the keys for which the real value is superior 0 (i.e. At least one display was made on this domain) would always get an output value superior to 0 in the final, private output. Is that correct?
No, the noise is two-sided e.g. values that are 0 or positive will have unbiased noise that could lead to negative counts. If you could distinguish keys that had zero or non-zero counts, it is precisely the privacy leak we are trying to prevent with the DP thresholding (hiding the presence / absence of a key).
In TURTLEDOVE, the set of domains your ads might have appeared on is a known output domain: The contextual ad requests all contain the domain names, and so you've seen the list of all possible domains even without an aggregation service being involved.
Based on the contextual requests, an advertiser would know all domains that "broadcasted" at least one contextual request during the period. If we consider top-level domains, we are talking about millions of entries daily. The advertiser would have displayed ads on a small fraction of this list. In addition, many domains in this list would be considered unsafe by this advertiser and the goal of the ex-post audit would be to verify that no ad was done on any of these "unsafe" websites.
No, the noise is two-sided e.g. values that are 0 or positive will have unbiased noise that could lead to negative counts. If you could distinguish keys that had zero or non-zero counts, it is precisely the privacy leak we are trying to prevent with the DP thresholding (hiding the presence/absence of a key).
From the millions of domains in input, the report would not help distinguish those with zero display from those with non-zero displays. Many considered unsafe domains would be reported as having received a non-null number of displays from this advertiser, even they didn't in practice. On the opposite, domains with a positive count of displays, in reality, could be reported as not having received any.
It does not meet the requirements for Brand Safety and Ad Quality, and as I stated it above, doesn't allow DSPs to comply with the law.
Thanks, Paul, I think you're quite right — in a world with TURTLEDOVE, we would need a reporting mechanism that supports something like the SPARROW "Report on Ads Served" functionality, and the DP-style reporting infrastructure here is not enough for that requirement.
As described in this post, the obligation to reach a certain amount k of "identical reports" to access them would quite dramatically impact the reporting for the advertisers and publishers, and it may even prevent the DSPs to comply with the law. In France (it may also apply to other countries), DSPs are legally forced to disclose the comprehensive list of publishers they printed ads on to their clients (cf. here). This would not be possible using this API as a significant share of the publishers would be hidden to the DSPs, even with a low threshold. How do you plan to tackle this use case?