PipelineDP is a Python framework for applying differentially private aggregations to large datasets using batch processing systems such as Apache Spark, Apache Beam, and more.
One part of the anonymization pipeline is to do contribution bounding. Namely for to limit contributions from 1 privacy unit. One of the common way to specify contributions is with max_partitions_contributed and max_contribution_per_partition. Atm it's done with 2 samplings:
Sample max_contributions_per_partition per (privacy_id, partition_key) (code)
Sample max_partitions_contributed per (partition_key) (code).
It's scalable, but it requires 2 shufling sessions (each sampling requires shufling). It's expensive. Another way to do sampling is
to do group by privacy_key and to do sampling in memory.
Goal
Implement sampling with one group by privacy_key and to do sampling in memory.
Note: Since one privacy unit can contain too much, datapoints, we can limit it with some large const, for example 10**7.
Context
Prerequisites: PipleineDP terminology, especially privacy unit, partition key.
One part of the anonymization pipeline is to do contribution bounding. Namely for to limit contributions from 1 privacy unit. One of the common way to specify contributions is with
max_partitions_contributed
andmax_contribution_per_partition
. Atm it's done with 2 samplings:max_contributions_per_partition
per(privacy_id, partition_key)
(code)max_partitions_contributed
per(partition_key)
(code).It's scalable, but it requires 2 shufling sessions (each sampling requires shufling). It's expensive. Another way to do sampling is to do group by
privacy_key
and to do sampling in memory.Goal
Implement sampling with one group by
privacy_key
and to do sampling in memory.Note: Since one privacy unit can contain too much, datapoints, we can limit it with some large const, for example
10**7
.Code pointers