OpenMined / PipelineDP

PipelineDP is a Python framework for applying differentially private aggregations to large datasets using batch processing systems such as Apache Spark, Apache Beam, and more.
https://pipelinedp.io/
Apache License 2.0
270 stars 75 forks source link

Contribution bounding with Group By privacy unit #488

Open dvadym opened 10 months ago

dvadym commented 10 months ago

Context

Prerequisites: PipleineDP terminology, especially privacy unit, partition key.

One part of the anonymization pipeline is to do contribution bounding. Namely for to limit contributions from 1 privacy unit. One of the common way to specify contributions is with max_partitions_contributed and max_contribution_per_partition. Atm it's done with 2 samplings:

  1. Sample max_contributions_per_partition per (privacy_id, partition_key) (code)
  2. Sample max_partitions_contributed per (partition_key) (code).

It's scalable, but it requires 2 shufling sessions (each sampling requires shufling). It's expensive. Another way to do sampling is to do group by privacy_key and to do sampling in memory.

Goal

Implement sampling with one group by privacy_key and to do sampling in memory.

Note: Since one privacy unit can contain too much, datapoints, we can limit it with some large const, for example 10**7.

Code pointers

  1. ContributionBounder is the abstract base class for ContributionBounders.
  2. SamplingCrossAndPerPartitionContributionBounder is the class which does current 2 stage sampling.
  3. SamplingPerPrivacyIdContributionBounder is a class which samples fixed number per privacy_unit (it's more as an example)
  4. Tests for contriution bounders
  5. Contribution bounder creation