Closed dvadym closed 3 years ago
I would like to work on truncated geometric thresholding (it might take some time, as I'm not the sharpest tool on DP).
@tudorcebere sure, thank you!
It was done an analysis of using Python Wrappers of Google C++ library vs implementation of the algorithms fully in Python. Creating of wrappers is pretty a straightforward procedure, but writing, testing and maintaining of algorithms are not. And moreover having implementations of algorithms both in C++ and Python is additional maintenance cost.
So, the current approach is to use wrappers for C++ algorithms, whenever it's possible. Wrappers for Private Partition selection have not been implemented yet. I've checked how to do wrappers and now I have a clear understanding what needs to be done. It will be done soon (PR).
So the main thing that is left in this issue is to implement a pipeline function def _select_private_partitions(self, col)
(more details in the issue description).
@tudorcebere are you still interested in this task with these changes? If yes, do you have some estimates when you have time? I'm happy to help with any questions and if needed we can have pair programming.
Hey @dvadym! Still interested in doing this, but got a few rough weeks around the work on PySyft. You can open this up and if it's still available in ~2weeks, I might take a stab. Sorry for not answering earlier, it fell off my radar in github.
Hey, can I take this?
Sure, thanks, go ahead!
Context
Definition: The partition keys are called private if they are not known in advance but are determined based on the data contributed by the individuals in the datasets. More details.
The private partition selection is a procedure that ensures that the output partitions keys are selected in DP fashion. There are at least 2 methods for private partition selection. More details:
There will be implemented Python wrappers for [C++ implementation]((https://github.com/google/differential-privacy/blob/main/cc/algorithms/partition-selection.h).
Goals
To implement a part of pipeline which performs private partition selection.
API
A method of DPEngine class
def _select_private_partitions(self, col)
where col is collection of (partition_key, privacy_id_count, data) (the type of data is not important)
Output: a collection of (partition_key, data) only with selected partition keys.
Notes on implementation
Essentially truncated geometric algorithm specifies probability that a partition is kept in the output based on privacy_id_count in this partition.
Note: I'm happy to help with figuring out open questions.