PipelineDP is a Python framework for applying differentially private aggregations to large datasets using batch processing systems such as Apache Spark, Apache Beam, and more.
This PR improves the speed of per-partition analysis metric computations.
Before this PR: accumulator for per-partititions metrics could be in either sparse or dense mode and sparse -> dense transfer happens when sparse is big enough. And if sparse & dense merged, sparse is converted to dense. Converting is expensive and the sparse might contain only 1 data point (so many conversion).
After this PR: sparse and dense part is kept both, merge((sparse1, dense1), (sparse2, dense2)) is (merge(sparse1, sparse2), merge(dense1, dense2)). That excludes sparse->dense conversion with small amount of elements
This PR improves the speed of per-partition analysis metric computations.
Before this PR: accumulator for per-partititions metrics could be in either sparse or dense mode and sparse -> dense transfer happens when sparse is big enough. And if sparse & dense merged, sparse is converted to dense. Converting is expensive and the sparse might contain only 1 data point (so many conversion).
After this PR: sparse and dense part is kept both,
merge((sparse1, dense1), (sparse2, dense2))
is(merge(sparse1, sparse2), merge(dense1, dense2))
. That excludes sparse->dense conversion with small amount of elements