OpenMined / PipelineDP

PipelineDP is a Python framework for applying differentially private aggregations to large datasets using batch processing systems such as Apache Spark, Apache Beam, and more.
https://pipelinedp.io/
Apache License 2.0
270 stars 75 forks source link

Utility report per partition size #448

Closed dvadym closed 1 year ago

dvadym commented 1 year ago

This PR implements the computation of UtilityReport hisogram per partition size.

Histogram description:

Hisogram bucket bounds are [0, 1, 10, 20, 50, 100, 200, 500, 1000, ...]. Hisotgram contains UtilityReport (which is aggregated Utility analysis information) for partitions of corresponding size, e.g. it contains UtilityReport for partitions of size [10, 20), UtilityReport for partitions of size [20, 50) etc.

Implementation details: 1.UtilityReportHistogram dataclass with histogram [partition_size_from, partition_size_to] -> UtilityReport

  1. Computing UtilityReportHistogram in perform_utility_analysis. That's done by extending the existing computation by computation histogram as well.