OpenMined / PipelineDP

PipelineDP is a Python framework for applying differentially private aggregations to large datasets using batch processing systems such as Apache Spark, Apache Beam, and more.
https://pipelinedp.io/
Apache License 2.0
270 stars 75 forks source link

Histogram error estimator #458

Closed dvadym closed 1 year ago

dvadym commented 1 year ago

This PR implements estimation of RMSE from DatasetHistogram for l0_bound and linf_bound

The algorithm is the following

  1. From l0_bound and l0_contributions_histogram the ratio data_dropped_from_l0 contribution bounding is computed.
  2. From linf_bound and linf_contributions_histogram the ratio_data_dropped_from_linf contribution bounding is computed.
  3. The total 'ratio_data_dropped' for contribution bounding is estimated from data_dropped_from_l0 and ratio_data_dropped_from_linf.
  4. Then under the assumption that contribution bounding drops data uniformly on all partitions, for a partition of the size n, it is assumed that n*ratio_data_dropped data points are dropped with contribution bounding. And RMSE for this partition is computed as sqrt((n*ratio_data_dropped)**2 + noise_std**2)
  5. RMSEs are averaged across all partitions.