OpenMined / PipelineDP

PipelineDP is a Python framework for applying differentially private aggregations to large datasets using batch processing systems such as Apache Spark, Apache Beam, and more.
https://pipelinedp.io/
Apache License 2.0
270 stars 75 forks source link

Post aggregation thresholding #494

Closed dvadym closed 8 months ago

dvadym commented 9 months ago

Theory

This PRs implements post-aggregation thresholding, namely it implements the following partition selection algorithm.

1. Noise stddev and threshold T computation: from (eps, delta,
  l0_sensitivity=max_partition_contributed).
2. Contribution bounding: for each privacy unit chosen partitions in which
  it contributes, if there are more than max_partition_contributed,
  max_partition_contributed partitions are randomly sampled.
3. Aggregation: for each partition the count of privacy unit is computed
4. Selection: for each partition with n privacy units, it’s released iff
   num_privacy_units + noise >= T.  In case of releasing num_privacy_units + noise is released as well.

The details on computing noise stddev and T can be found in doc. Those computations are implemented in Google C++ building block libraries and Python wrappers from PyDP are used.

This algorithm is called post-aggregation thresholding because it uses aggregated values of the number of privacy units.

What this PR contains:

This PRs contains the whole implementation of Post aggregation thresholding, namely:

  1. ThresholdMechanism class which is wrapper around PyDP object. It's needed for combiner in order to be able to use PyDP object.
  2. PostAggregationThresholdingCombiner combiner which computes privacy id counts and applies ThresholdMechanism.
  3. Extending AggregateParams with bool variable post_aggregation_thresholding
  4. Creating PostAggregationThresholdingCombiner object, which is created when post_aggregation_thresholding = False
  5. Filtering partitions on "privacy_id_count = None", when post_aggregation_thresholding = False