OpenMined / PipelineDP

PipelineDP is a Python framework for applying differentially private aggregations to large datasets using batch processing systems such as Apache Spark, Apache Beam, and more.
https://pipelinedp.io/
Apache License 2.0
275 stars 77 forks source link

Private Partition selection #8

Closed dvadym closed 3 years ago

dvadym commented 3 years ago

Context

Definition: The partition keys are called private if they are not known in advance but are determined based on the data contributed by the individuals in the datasets. More details.

The private partition selection is a procedure that ensures that the output partitions keys are selected in DP fashion. There are at least 2 methods for private partition selection. More details:

  1. Truncated geometric thresholding (paper)
  2. Laplace/Gaussian thresholding (paper).

There will be implemented Python wrappers for [C++ implementation]((https://github.com/google/differential-privacy/blob/main/cc/algorithms/partition-selection.h).

Goals

To implement a part of pipeline which performs private partition selection.

API

A method of DPEngine class

def _select_private_partitions(self, col)

where col is collection of (partition_key, privacy_id_count, data) (the type of data is not important)

Output: a collection of (partition_key, data) only with selected partition keys.

Notes on implementation

Essentially truncated geometric algorithm specifies probability that a partition is kept in the output based on privacy_id_count in this partition.

Note: I'm happy to help with figuring out open questions.

tudorcebere commented 3 years ago

I would like to work on truncated geometric thresholding (it might take some time, as I'm not the sharpest tool on DP).

dvadym commented 3 years ago

@tudorcebere sure, thank you!

dvadym commented 3 years ago

It was done an analysis of using Python Wrappers of Google C++ library vs implementation of the algorithms fully in Python. Creating of wrappers is pretty a straightforward procedure, but writing, testing and maintaining of algorithms are not. And moreover having implementations of algorithms both in C++ and Python is additional maintenance cost.

So, the current approach is to use wrappers for C++ algorithms, whenever it's possible. Wrappers for Private Partition selection have not been implemented yet. I've checked how to do wrappers and now I have a clear understanding what needs to be done. It will be done soon (PR).

So the main thing that is left in this issue is to implement a pipeline function def _select_private_partitions(self, col) (more details in the issue description).

@tudorcebere are you still interested in this task with these changes? If yes, do you have some estimates when you have time? I'm happy to help with any questions and if needed we can have pair programming.

tudorcebere commented 3 years ago

Hey @dvadym! Still interested in doing this, but got a few rough weeks around the work on PySyft. You can open this up and if it's still available in ~2weeks, I might take a stab. Sorry for not answering earlier, it fell off my radar in github.

levzlotnik commented 3 years ago

Hey, can I take this?

dvadym commented 3 years ago

Sure, thanks, go ahead!