PipelineDP is a Python framework for applying differentially private aggregations to large datasets using batch processing systems such as Apache Spark, Apache Beam, and more.
This PRs implements post-aggregation thresholding, namely it implements the following partition selection algorithm.
1. Noise stddev and threshold T computation: from (eps, delta,
l0_sensitivity=max_partition_contributed).
2. Contribution bounding: for each privacy unit chosen partitions in which
it contributes, if there are more than max_partition_contributed,
max_partition_contributed partitions are randomly sampled.
3. Aggregation: for each partition the count of privacy unit is computed
4. Selection: for each partition with n privacy units, it’s released iff
num_privacy_units + noise >= T. In case of releasing num_privacy_units + noise is released as well.
The details on computing noise stddev and T can be found in doc. Those computations are implemented in Google C++ building block libraries and Python wrappers from PyDP are used.
This algorithm is called post-aggregation thresholding because it uses aggregated values of the number of privacy units.
What this PR contains:
This PRs contains the whole implementation of Post aggregation thresholding, namely:
ThresholdMechanism class which is wrapper around PyDP object. It's needed for combiner in order to be able to use PyDP object.
PostAggregationThresholdingCombiner combiner which computes privacy id counts and applies ThresholdMechanism.
Extending AggregateParams with bool variable post_aggregation_thresholding
Creating PostAggregationThresholdingCombiner object, which is created when post_aggregation_thresholding = False
Filtering partitions on "privacy_id_count = None", when post_aggregation_thresholding = False
Theory
This PRs implements post-aggregation thresholding, namely it implements the following partition selection algorithm.
The details on computing noise stddev and T can be found in doc. Those computations are implemented in Google C++ building block libraries and Python wrappers from PyDP are used.
This algorithm is called post-aggregation thresholding because it uses aggregated values of the number of privacy units.
What this PR contains:
This PRs contains the whole implementation of Post aggregation thresholding, namely:
ThresholdMechanism
class which is wrapper around PyDP object. It's needed for combiner in order to be able to use PyDP object.PostAggregationThresholdingCombiner
combiner which computes privacy id counts and appliesThresholdMechanism
.AggregateParams
with bool variablepost_aggregation_thresholding
PostAggregationThresholdingCombiner
object, which is created whenpost_aggregation_thresholding = False
post_aggregation_thresholding = False