OpenMined / PipelineDP

PipelineDP is a Python framework for applying differentially private aggregations to large datasets using batch processing systems such as Apache Spark, Apache Beam, and more.
https://pipelinedp.io/
Apache License 2.0
275 stars 77 forks source link

Run utility analysis on pre-aggregated data #373

Closed dvadym closed 1 year ago

dvadym commented 1 year ago

This PR implements running utility analysis on pre-aggregated data (previously it was possible only on the raw data).

The implementations consists of the following:

  1. The extracting of needed columns (e.g. privacy_id, partition_key) with extractors is moved to a separate DPEngine._extract_columns method which is overriden in UtilityAnalysisEngine (because for pre-aggregated data the different data extraction).
  2. Computing of count, sum per (privacy_id, partition_key) was moved from CompoundCombiner to ContributionBounding (that's because count and sum is already computed in for per-aggregated data)
  3. Introduced NoOpContributionBounder for using for pre-aggregated data (because all that needed for contribution bounding is already computed in pre-aggregted data).
  4. pre_aggregated_data field is introduced in all structures where needed.

Note: Tests will be fixed and new tests will added in the following commits