Context

Prerequisites: PipleineDP terminology, especially privacy unit, partition key.

One part of the anonymization pipeline is to do contribution bounding. Namely for to limit contributions from 1 privacy unit. One of the common way to specify contributions is with max_partitions_contributed and max_contribution_per_partition. Atm it's done with 2 samplings:

Sample max_contributions_per_partition per (privacy_id, partition_key) (code)
Sample max_partitions_contributed per (partition_key) (code).

It's scalable, but it requires 2 shufling sessions (each sampling requires shufling). It's expensive. Another way to do sampling is to do group by privacy_key and to do sampling in memory.

Goal

Implement sampling with one group by privacy_key and to do sampling in memory.

Note: Since one privacy unit can contain too much, datapoints, we can limit it with some large const, for example 10**7.

Code pointers

ContributionBounder is the abstract base class for ContributionBounders.
SamplingCrossAndPerPartitionContributionBounder is the class which does current 2 stage sampling.
SamplingPerPrivacyIdContributionBounder is a class which samples fixed number per privacy_unit (it's more as an example)
Tests for contriution bounders
Contribution bounder creation

OpenMined / PipelineDP

Contribution bounding with Group By privacy unit #488

Context

Goal

Code pointers