OpenMined / PipelineDP

PipelineDP is a Python framework for applying differentially private aggregations to large datasets using batch processing systems such as Apache Spark, Apache Beam, and more.
https://pipelinedp.io/
Apache License 2.0
275 stars 77 forks source link

Evaluation of Python bindings vs. pure Python #13

Closed jspacek closed 3 years ago

jspacek commented 3 years ago

What?

We will consider the options for writing Python bindings to https://github.com/google/differential-privacy/blob/main/cc/algorithms/numerical-mechanisms.h vs Pure Python.

There is a discussion document to list pro/cons 📝 https://docs.google.com/document/d/1PfQ6iJ858HcK4DkWuSpkIaOsc4_YMchTpG8vbcQy3Bk/edit?usp=sharing

Items to consider:

How long?

Perhaps a week?

Is your research related to a problem?

This relates to #11 Secure noise issue.

Additional Context

Could utilize existing benchmark test for LaPlace: https://github.com/google/differential-privacy/blob/f3e565a14b7d48869b650483de897eebc89ad494/go/noise/noise_test.go#L38

jspacek commented 3 years ago

@AbinavRavi @walexi @say-yam

dvadym commented 3 years ago

I've made some research on using different frameworks for building wrappers. It looks like PyBind11 is the best option available and it's used widely in Google open source projects (eg. in TensorFlow).

I've checked PyBind11 and tried some simple examples of using it. It's pretty simple in use. So I think we should use PyBind11.

Moreover PyDP already uses PyBind11, so probably we should use PyDP. Additional plus of PyDP, that it's already has infrastructure - build for all platforms, publishing package to Pypy repository.

So taking into consideration relative simplicity of building wrappers, I think we should start with trying wrappers and if they work (eg. fast enough) it might be a better solution than reimplementing secure noise in Python.

I've checked PyDP, there are some missing parts. For more context from C++ library (it might be confusing at first :) ):

  1. There are distributions (Laplace, Gaussian) (source).
  2. And there are mechanisms (Laplace, Gaussian) (source).

Distributions provide generation of samples from the corresponding distributions. While mechanisms provide more high-level interface - managing epsilon, delta, computing std from eps/delta, adding random samples to the value and making correct rounding. The C++ building block library team strongly recommends use mechanisms. So we should use mechanisms.

Atm PyDP contains wrappers for distributions (source). But there are no wrappers for mechanisms. So it looks like a good solution is to implement wrappers for mechanisms in PyDP.

dvadym commented 3 years ago

Does anybody have C++ experience and would like to try to build wrappers for mechanisms?

AbinavRavi commented 3 years ago

Thanks for the insightful observations @dvadym. I can take a look at how to build the wrappers for the mechanism.

dvadym commented 3 years ago

@AbinavRavi thank you!

In case of any problems/questions please LMK (including I can help with C++ build problems if needed)

dvadym commented 3 years ago

The decision to use Python binding to C++ implementation was made