Choice of parameters in the aggregation API

alois-bissuel commented 3 years ago

Hi, Thanks again for all the proposals and the interesting discussions. We believe the conversion measurement API has a great potential. There are a few limits we would like to understand though, especially in the aggregation API. In the explainer for the aggregation API, there are limitations on the contributions which can be made to the histogram. While the cap on L1 value is obvious for differential privacy to work, we do not understand the limit on the number of contributions to be made (small, eg 3). Also, why should the aggregate report be scoped on the tuple (source_site, attribution_destination)? We believe that there are great benefits in aggregating reports from multiple source_site (or attribution_destination, depending on the use case) in a single request, to lower the overall level of noise. Thanks a lot for you answer, and we would be happy to discuss this live in the next meeting if needed.

csharrison commented 3 years ago

Thanks for filing:

The limit on contributions can help in some cases with tighter privacy analysis, but it also helps with communication and computation costs. The current design of the MPC protocol with DPFs might mean that sending way too many separate histogram contributions could lead to a lot of bandwidth and processing costs. That being said, this isn't a firm limit and we are open to exploring options where more contributions are used. Can you share anything about your use-case and why it needs more histogram contributions?
The report is scoped to the tuple (source_site, attribution_destination) for privacy budgeting purposes to limit the information learned about the join of those two sites. It should be fine to aggregate across multiple source sites from a privacy POV as far as I know. This was discussed a bit in this issue (https://github.com/WICG/conversion-measurement-api/issues/190)

Feel free to add to the agenda if you want to discuss in the meeting (https://bit.ly/ara-meeting-notes).

AlexandreGilotte commented 2 years ago

Hi, Thanks for those clarifications. About our use cases with a larger number of histogram contribution: We are currently working on some solutions to try to learn our ML models using some aggregated data. Let me first state that this is not an easy problem. The current "best" (or maybe "least bad" would be more accurate) solutions we have are as follow:

the methods which emerged from the challenge we organized at AdKDD. You may find a short description of those methods in a blog post I wrote here: https://medium.com/criteo-engineering/results-from-the-criteo-adkdd-2021-challenge-50abc9fa3a6 Note that those methods however still require some granular data, and they use some aggregation on all pairs of features. With a number of only 19 features (by comparison , our production models has >100 features) in the challenge, that makes about 200 histogram contributions per display.
a method I personally worked on, using only aggregated data, which could be summarized as modelling the distribution of displays to obtain "fake" granular samples, and learning on those samples. We may find it here: https://github.com/criteo-research/ad_click_prediction_from_aggregated_data/blob/main/README.md . One key point of this method is to observe as many correlations between features as possible, to build a reasonable model of their joined distribution. That means having a large number of aggregation tables. (typically "all pairs of features") While I am still working on some possible improvements, its performances are quite far below the methods having access to some granular samples. I might of course present those different methods at a next call if you think it would be of interest.

As a side note, we are aware that it is possible to sample the keys to keep only "a small number of keys" on each displays, and that the noise would be scaled up when the number of keys is large anyway. If I am not mistaken, with Laplace noise the signal to noise ratio is actually equivalent. But with a Gaussian (instead of Laplace) noise, allowing a large number of keys should give a much improved signal to noise ratio.

To summarize, this "small number of histogram contribution" might cause yet another big performances drop for the models we could learn; I am personally not confident that at the end of the road the models would be "good enough" to sustain a viable ecosystem.

csharrison commented 2 years ago

Thanks @AlexandreGilotte ! Yes I agree the ML use-case is a clear case where it seems useful to contribute to many aggregate keys at once. I am happy to consider changes to the API to make sure that the use-case can be done in a performant and private way.

One thing we should think about is how to make sure the entire system knows how to scale the noise. Our current design only has a single L1 sensitivity so some of the techniques necessarily for more advanced DP composition don't really work assuming worst-case input. This has great benefits in terms of simplicity (you can allocate the L1 budget however you want, and the system can be oblivious to it) but it does not maximize utility for this use-case.

This is on the agenda for the meeting today, so let's try to get into these problems :)

csharrison commented 2 years ago

We discussed this in the call yesterday (minutes).

I raised two concerns with supporting events contributing to many buckets (~hundreds):

Performance. We need to make sure that aggregate designs scale well (cpu / communication costs) to adding lots of contributions per event
Picking the right sensitivity constraints to satisfy all use-cases. Choosing something like a large L1 sensitivity (~200) and a small Linf sensitivity (~1) like used in the Criteo ML challenge will be great for ML but not so great for simpler reporting use-cases. There are a number of possible resolutions to this that we should consider:
- extend the design to support dynamic sensitivity selection
- Compromising between the two use-cases
- Allowing multiple different reporting types with fixed sensitivity constraints
- etc

alois-bissuel commented 2 years ago

Thanks for you answer. Regarding the two concerns you voiced, may I ask some further clarifications?

Performance. Could you confirm that the performance hit is due to MPC?
Regarding the sensitivity, is the Linf sensitivity a requirement for the differential privacy applied to the aggregation results? I have a hard time figuring out why this parameter would be needed, except if want to "trim" the contribution per time window (clipping the contributions to Linf is easy, whereas lowering the L1 norm means arbitrarily dropping some contributions I guess). In any case, I fully support the ability to configure the API for various use cases (ie dynamic sensitivity selection).

csharrison commented 2 years ago

I think MPC will be the biggest factor, although it seems even in naive non-MPC implementations there will be some kind of factor for the # of buckets contributed in terms of communication cost / cpu cost.
Linf is not a hard requirement, but it brings us many benefits. In the Criteo competition using a Linf of 1 and L1 of 171 you were able to achieve ~(5,1e-10) DP with Gaussian std of 17. If instead you only had an L1 of 171 you wouldn't be able to take advantage of the more advanced DP composition tricks and you would have had to use a distribution with standard deviation ~219 (of course, when optimizing for a single statistic, Laplace performs better here with a std of ~48). This is because an adversarial input could be that the user contributed the full L1 value 171 to a single bucket, rather than spreading it evenly across 171 different buckets. The Linf parameter means that you can't put all your contribution in one bucket. Let me know if that makes sense!

alois-bissuel commented 2 years ago

Got it! This definitely makes sense. Here, Linf is used to control the L2 sensitivity for Gaussian noise we applied. In other words, if Linf=1, the L2 sensitivity is bounded by sqrt(171). If there were no Linf, the L2 sensitivity would have been 171.

On a roughly connected subject, it seems to me that using a Gaussian noise (made worthwhile because of Linf) would be uniquely suited for MPC, as the two (or more) servers could independently add their noise, which is then a Gaussian. A quick look at appendix E of the DPF paper shows they use a sum of Laplace noise, which has not this summing property.

csharrison commented 2 years ago

Got it! This definitely makes sense. Here, Linf is used to control the L2 sensitivity for Gaussian noise we applied. In other words, if Linf=1, the L2 sensitivity is bounded by sqrt(171). If there were no Linf, the L2 sensitivity would have been 171.

Yeah that's exactly right. The L2 sensitivity is just sqrt(L1*Linf).

On a roughly connected subject, it seems to me that using a Gaussian noise (made worthwhile because of Linf) would be uniquely suited for MPC, as the two (or more) servers could independently add their noise, which is then a Gaussian. A quick look at appendix E of the DPF paper shows they use a sum of Laplace noise, which has not this summing property.

I don't think the DPF paper is doing the optimal approach. If both servers sampled from the difference of two Polya RVs the sum would be distributed according to Discrete Laplace (or two-sided geometric). This link has more details. I don't think MPC constrains the design choices here.

csharrison commented 2 years ago

Revisiting this, I am wondering if it is feasible for parties to advertise via some global configuration what kind of sensitivity bounding they are interested in. This would have to be global because we'd need all users to obey the same constraints, and have noise applied in a uniform way downstream in the aggregation service.

This would introduce a lot of complications, but it seems like a technique that would generally work without just picking a place in the constraint-space that's a middle ground position.

alois-bissuel commented 2 years ago

I am not sure to follow your proposal. What do you mean by parties and users here?

csharrison commented 2 years ago

Parties: reporting origins Users: devices / browsers

Essentially I am thinking of a speculative new mechanism where e.g. criteo.com hosts a file saying "Please bound my contributions such that the L2 sensitivity <= xxx".

Then browsers have a mechanism to read these files and apply the appropriate sensitivity bounds on the user contributions, instead of the default L1 bounds we have in the API today. This would increase the contributions allowed per user in cases when you are guaranteed to "spread it out" across multiple buckets.

WICG / attribution-reporting-api

Choice of parameters in the aggregation API #249