OpenMined / PipelineDP

PipelineDP is a Python framework for applying differentially private aggregations to large datasets using batch processing systems such as Apache Spark, Apache Beam, and more.
https://pipelinedp.io/
Apache License 2.0
270 stars 75 forks source link

Implement pure logarithmic grid for candidates search. #459

Closed RamSaw closed 1 year ago

RamSaw commented 1 year ago

Now we construct candidates set based on generate_possible_contribution_bounds method that generates numbers that have 3 left-most digits non-zero and others are zero. I.e., it will be [1, 2, 3, ..., 999, 1000, 1010, 1020, ..., 9990, 10000, 10100, 10200. This is not a pure logarithmic scale, it keeps relative difference between neighbouring almost the same across the candidates set (it varies from 0.1% to 1%). Such solution has a problem that beginning of the sequence can become very scarce due to sampling (we do sampling such sequence contains more than max_candidates values). It is bad because density in the beginning of the sequence should be higher, we need more elements there because a change there in absolute values affects the result much more than change in the right side of the sequence where values are big. For example, difference in result between l_0 = 1 and l_0 = 2 is much bigger than difference between l_0 = 1000 and l_0 = 1001.

Therefore in this PR we implement pure logarithmic grid. The code generates a sequence a_i, where a_i = max_value^(i / (max_candidates - 1)), i in [0..(max_candidates - 1)]