PipelineDP is a Python framework for applying differentially private aggregations to large datasets using batch processing systems such as Apache Spark, Apache Beam, and more.
For simplicity assume that metrics=[count, sum] and there are 2 configurations to compute - max_partition_contributed = [1,2].
For computing each metric for each input configuration UtilityAnalysisCombiner is created, before this PR the code was the following
for metric in metrics:
for configuration in configurations:
budget = request_budget()
// create combiner
The problem is that it requests budget 2*2 = 4 times. Which is incorrect, since different configurations have different budget. This PR fixes that by ensuring that for each metric independently of the number of configurations request_budget is called once.
What was broken?
For simplicity assume that
metrics=[count, sum]
and there are 2 configurations to compute -max_partition_contributed = [1,2]
.For computing each metric for each input configuration
UtilityAnalysisCombiner
is created, before this PR the code was the followingThe problem is that it requests budget 2*2 = 4 times. Which is incorrect, since different configurations have different budget. This PR fixes that by ensuring that for each metric independently of the number of configurations
request_budget
is called once.