PipelineDP is a Python framework for applying differentially private aggregations to large datasets using batch processing systems such as Apache Spark, Apache Beam, and more.
This PR implements sub-sampling of partitions for utility analysis.
It works the following:
1.UtilityAnalysisEngine.aggregate has sampling_probability
Utility analysis ContributionBounder can perform sub-sampling of partitions using ValueSampler if sampling probability < 1.0
ValueSampler computes hash of partition key and if the hash (which is assumed to return uniform values) is less than sampling_rate*max_hash_value, the partition_key is kept.
This PR implements sub-sampling of partitions for utility analysis. It works the following: 1.
UtilityAnalysisEngine.aggregate
has sampling_probabilityValueSampler
if sampling probability < 1.0ValueSampler
computes hash of partition key and if the hash (which is assumed to return uniform values) is less than sampling_rate*max_hash_value, the partition_key is kept.