OpenMined / PipelineDP

PipelineDP is a Python framework for applying differentially private aggregations to large datasets using batch processing systems such as Apache Spark, Apache Beam, and more.
https://pipelinedp.io/
Apache License 2.0
275 stars 77 forks source link

Create Python wrappers for private partitions selection #52

Closed dvadym closed 3 years ago

dvadym commented 3 years ago

Context

Definition: The partition keys are called private if they are not known in advance but are determined based on the data contributed by the individuals in the datasets. More details.

The private partition selection is a procedure that ensures that the output partitions keys are selected in DP fashion. There are at least 2 methods for private partition selection. More details:

  1. Truncated geometric thresholding (paper)
  2. Laplace/Gaussian thresholding (paper).

PyDP project provides wrappers for Google C++ DP library. But wrappers for private partition selection are missing.

Goals

To implement wrappers for Truncated geometric thresholding and Laplace/Gaussian thresholding in PyDP.

C++ library API:

  1. Truncated geometric thresholding (it's there called PreaggPartitionSelection, which is an old name, it's better to use a new in wrappers)
  2. LaplacePartitionSelection.
  3. GaussianPartitionSelection.

Python API

  1. Pybind wrapper for PartitionSelectionStrategy, with 1 method ShouldKeep.
  2. Functions:
def create_truncted_geometric_partition_strategy(<all needed params>) -> PartitionSelectionStrategy
def create_laplace_partition_strategy(<all needed params>) -> PartitionSelectionStrategy
def create_gaussian_partition_strategy(<all needed params>) -> PartitionSelectionStrategy
levzlotnik commented 3 years ago

Hey, I'd like to take over this issue.

dvadym commented 3 years ago

Sure, thanks!

levzlotnik commented 3 years ago

Hey, the PR on PyDP#374 was merged. The newly added API and pseudo-code of usage:

from pydp.algorithms.partition_selection import create_paritition_strategy
k_tsgd_selector = create_partition_strategy("truncated_geometric", 
                      epsilon, delta, max_partitions) # type: PartitionSelectionStrategy
...
def get_dp_partitions(database, partition_selector: PartitionSelectionStrategy):
    database_partitions = database.partition_by(KEY)
    for partition in database_partitions:
        if partition_selector.should_keep(paritition.num_users):
            yield partition

dp_partitions = get_dp_partitions(data, k_tsgd_selector)
dvadym commented 3 years ago

Thanks!