OpenMined / PipelineDP

PipelineDP is a Python framework for applying differentially private aggregations to large datasets using batch processing systems such as Apache Spark, Apache Beam, and more.
https://pipelinedp.io/
Apache License 2.0
270 stars 75 forks source link

Do not sample per partition when min/max_sum_per_partitions is set #481

Closed dvadym closed 10 months ago

dvadym commented 11 months ago

This PR introduces combiner.expects_per_partition_sampling() method.

If at least one combiner returns true from expects_per_partition_sampling(), sampling per partition is performed. Sampling in partition is required for Mean/Variance/Quantiles. On other hand, when SUM is computed with min/max_sum_per_partition contribution bounding, there should be no sampling, since SumCombiner at first sums per partition contributions and clips to min/max_sum_per_partition.