PipelineDP is a Python framework for applying differentially private aggregations to large datasets using batch processing systems such as Apache Spark, Apache Beam, and more.
This PR implements running utility analysis on pre-aggregated data (previously it was possible only on the raw data).
The implementations consists of the following:
The extracting of needed columns (e.g. privacy_id, partition_key) with extractors is moved to a separate DPEngine._extract_columns method which is overriden in UtilityAnalysisEngine (because for pre-aggregated data the different data extraction).
Computing of count, sum per (privacy_id, partition_key) was moved from CompoundCombiner to ContributionBounding (that's because count and sum is already computed in for per-aggregated data)
Introduced NoOpContributionBounder for using for pre-aggregated data (because all that needed for contribution bounding is already computed in pre-aggregted data).
pre_aggregated_data field is introduced in all structures where needed.
Note: Tests will be fixed and new tests will added in the following commits
This PR implements running utility analysis on pre-aggregated data (previously it was possible only on the raw data).
The implementations consists of the following:
DPEngine._extract_columns
method which is overriden in UtilityAnalysisEngine (because for pre-aggregated data the different data extraction).(privacy_id, partition_key)
was moved fromCompoundCombiner
toContributionBounding
(that's because count and sum is already computed in for per-aggregated data)NoOpContributionBounder
for using for pre-aggregated data (because all that needed for contribution bounding is already computed in pre-aggregted data).pre_aggregated_data
field is introduced in all structures where needed.Note: Tests will be fixed and new tests will added in the following commits