OpenMined / PipelineDP

PipelineDP is a Python framework for applying differentially private aggregations to large datasets using batch processing systems such as Apache Spark, Apache Beam, and more.
https://pipelinedp.io/
Apache License 2.0
275 stars 77 forks source link

Computing partition count histograms #359

Closed dvadym closed 1 year ago

dvadym commented 2 years ago

This PR implements computing of histograms for (partition_key, count) (i.e. how many partitions have count=1, how many partitions have count=2, ... ) and (partition_key, privacy_id_count) (i.e. how many partitions have privacy_id_count=1, how many partitions have privacy_id_count=2, ... ).

Those are the same histograms which are used for computing cross and per partition contributions per privacy_id.

These histograms will be used in parameter tuning.