OpenMined / PipelineDP

PipelineDP is a Python framework for applying differentially private aggregations to large datasets using batch processing systems such as Apache Spark, Apache Beam, and more.
https://pipelinedp.io/
Apache License 2.0
272 stars 77 forks source link

Private contribution bounds computation #418

Closed RamSaw closed 1 year ago

RamSaw commented 1 year ago

Description

Implements differentially private algorithm to choose max_partitions_contributed bound. Also, introduces a common API that will support calculation of max_contributions_per_partition too. Closes #261

Affected Dependencies

No new dependencies are required.

How has this been tested?

I implemented tests for all new classes and methods, they rely on probabilistic computations but I ensured that they are not flaky. Also, an end2end examples were written and I checked that they work. Common CI setup can be used to test the changes (pytest tests analysis/tests

Checklist

RamSaw commented 1 year ago

Hi!

I implemented the algo, but didn't test it thoroughly. Let's discuss the structure, agree on that and then I will add tests and comprehensive comments to the code. To implement algo in DPEngine I had to break circular dependencies and therefore I had to move histograms.py and PreAggregateExtractors class to pipeline_dp module. Also, I had to extract DataExtractors into its own file. Does it make sense? Or we should structure it differently. This refactoring broke some tests in analysis package and maybe in utility_analysis package. But I think it should be fixable and I will fix it if the current structure LGTM.