OpenMined / PipelineDP

PipelineDP is a Python framework for applying differentially private aggregations to large datasets using batch processing systems such as Apache Spark, Apache Beam, and more.
https://pipelinedp.io/
Apache License 2.0
275 stars 77 forks source link

Utility analysis performance optimization #415

Closed dvadym closed 1 year ago

dvadym commented 1 year ago

This PR improves the speed of per-partition analysis metric computations.

Before this PR: accumulator for per-partititions metrics could be in either sparse or dense mode and sparse -> dense transfer happens when sparse is big enough. And if sparse & dense merged, sparse is converted to dense. Converting is expensive and the sparse might contain only 1 data point (so many conversion).

After this PR: sparse and dense part is kept both, merge((sparse1, dense1), (sparse2, dense2)) is (merge(sparse1, sparse2), merge(dense1, dense2)). That excludes sparse->dense conversion with small amount of elements

dvadym commented 1 year ago

Thanks for review!