OpenMined / PipelineDP

PipelineDP is a Python framework for applying differentially private aggregations to large datasets using batch processing systems such as Apache Spark, Apache Beam, and more.
https://pipelinedp.io/
Apache License 2.0
275 stars 77 forks source link

Sampling per partitions #358

Closed dvadym closed 1 year ago

dvadym commented 2 years ago

This PR implements sub-sampling of partitions for utility analysis. It works the following: 1.UtilityAnalysisEngine.aggregate has sampling_probability

  1. Utility analysis ContributionBounder can perform sub-sampling of partitions using ValueSampler if sampling probability < 1.0
  2. ValueSampler computes hash of partition key and if the hash (which is assumed to return uniform values) is less than sampling_rate*max_hash_value, the partition_key is kept.
dvadym commented 1 year ago

Thanks for review!