cfrg / draft-irtf-cfrg-vdaf

VDAF specification
Other
18 stars 14 forks source link

Can / should VDAF support DP shufflers? #9

Closed csharrison closed 1 year ago

csharrison commented 2 years ago

Shuffle differential privacy (https://arxiv.org/pdf/1811.12469.pdf and many other papers) is a type of differential privacy where the privacy mechanism can be thought of as a composition of two pieces:

  1. A local randomizer applying some local noise to data (and stripping identifiers, etc)
  2. A central curator that takes as input a batch of client data and simply applies a permutation on it

The power of the shuffle DP is that the server-side mechanism is simple and general purpose. The exact same shuffler can be used for different applications measuring different kinds of things. Privacy guarantees are based on two things:

  1. The level of local noise added to the data
  2. The number of clients that participate in a shuffle

At an intuitive level, the more reports you have, and the more noise is injected into the reports, the more a given client's input is "hidden in the crowd".

Is this a good fit for VDAFs? Shuffling is conceptually a very simple task to ask a 2-party MPC protocol to do. I could imagine slotting this into the VDAF structure like:

Possible concerns:

Anyways, thoughts appreciated! Shuffling can be a simple and effective way to achieve good privacy with very simple mechanisms, but it doesn't fit so well into VDAFs.

cc @schoppmp

cjpatton commented 2 years ago

Interesting idea! This does indeed sound like a practical way to get good privacy. Even without DP, you get unlinkability of measurements to the client that sent them. This is basically what a mixnet does.

My own conclusion is that this protocol shape isn't appropriate for the VDAF document, however this may well be in-scope for what the PPM working group does.

What do you think @bifurcation?

cjpatton commented 2 years ago

By the way, I'm reading https://eprint.iacr.org/2021/1490.pdf right now. This protocol seems to have some shuffling, but also seems to aim for a security property stronger than unlinkability. I think if we can achieve the same security for this protocol as for prio3/hits, I think it's worth considering including it here.

csharrison commented 2 years ago

Thanks @cjpatton , this makes sense to me. A few quick responses below:

From a security perspective, unlinkability of measurements to senders is useful, but is weaker than what we hope to achieve for, say, prio3 or hits, which is that they learn only the aggregate result. Perhaps there's a way to patch the scheme to prevent leaking the set of measurements, as you suggest. But suppose we manage to ensure that all the aggregators learn is the aggregate result: then what does shuffling buy us?

One thing shuffling buys you is flexibility. Let's say all my users input data in [0, 10]. You could imagine a couple of different ways of aggregating this data, e.g. reporting the sum, median, percentiles, etc. With a shuffled version of the data, you can compute all of these on the collector side without embedding the computations in the aggregators.

There is also flexibility in terms of data used. In principle (as long as you trust the clients), the aggregators can just be shuffling opaque bytes. Clients could update to submit different kinds of data without any behavior update needed on the aggregator side. For instance, rather than shuffling integers they could shuffle sketches or some other data structure.

By the way, I'm reading https://eprint.iacr.org/2021/1490.pdf right now. This protocol seems to have some shuffling, but also seems to aim for a security property stronger than unlinkability. I think if we can achieve the same security for this protocol as for prio3/hits, I think it's worth considering including it here.

This proposal is interesting but it isn't the same kind of general purpose shuffling I'm discussing here. That paper only discusses computing histograms.

cjpatton commented 2 years ago

I agree the flexibility is attractive. My point about shuffling alone doesn't prevent aggregators from learning the set of measurements. Indeed, this is what makes it so flexible!

csharrison commented 2 years ago

I agree the flexibility is attractive. My point about shuffling alone doesn't prevent aggregators from learning the set of measurements. Indeed, this is what makes it so flexible!

Oh sorry I missed this point. It should be easy to ensure that the shuffled data is not visible to the aggregators if they don't collude with the collector - just ensure all the data is also encrypted to the collector's key. The requirement is just that the MPC shuffles opaque blobs.

cjpatton commented 2 years ago

Right, but the collector would see the set of measurements in the clear, right? All the aggregators have done is permuted them.

csharrison commented 2 years ago

The collector sees the set of measurements "in the clear", but the technique is designed to be combined with local noise so it isn't really accurate to say that the final release contains the all of the original measurements.

cjpatton commented 2 years ago

Yes, true enough!

cjpatton commented 1 year ago

We are beginning to see the emergence of MPC techniques that involve some sort of in-MPC "shuffling" or "ordering" of the reports being aggregated. Example:

I think where we're landing is that this architecture is sufficiently different from VDAFs that we should not try to shoe-horn them into this particular document.