OpenMined / PySyft

Perform data science on data that remains in someone else's server
https://www.openmined.org/
Apache License 2.0
9.52k stars 1.99k forks source link

Abstract Fixed Point Precision #12

Closed iamtrask closed 6 years ago

iamtrask commented 7 years ago

Description: All of our Homomorphic Encryption algorithms operate on Integers under the hood. However, we want to be able to perform math on numbers of varying precision (most popularly, Floats). Thus, we need someone to put together a generic Fixed Point Precision wrapper that can provide this functionality around encryption algorithms that are Integers such that the following types are supported:

Acceptance Criteria:

samsontmr commented 7 years ago

Split this into sub-tasks for each type?

iamtrask commented 7 years ago

Maybe, however the best way to build it is probably to build the "Arbitrary" one first and then just instantiate the rest of them using specific configs of the arbitrary?

siarez commented 7 years ago

How do you guys generally see this implemented? I was imagining each encrypted tensor have its own self.precision. Then when a new Tensor is created from an array then numbers in the array are multiplied by 10**self.precision and divided by 10**self.precision after decryption. For example, the encrypt method in pailier/keys.py would get a new line like x *= 10 ** precision:

    def encrypt(self, x, same_type=False, precision=4):
        """Encrypts x. X can be either an encrypted int or a numpy
        vector/matrix/tensor."""
        x *= 10 ** precision
        ...
        return self.pk.encrypt(x)

Then all math operation for PailierTensor should be modified to take precision into account. Does this sound right? or is there a better way to do this?

iamtrask commented 7 years ago

Man, great question. I've spent a couple hours yesterday working through a simple version of this just for the Multi-party-computation prototype and I learned a few things about how we might like this to work. First, I found it's important to separate two pieces of configuration.

1) Tensor Specific Config: each tensor (encrypted or not) will have its own configurable level of precision which corresponds to its actual memory footprint for each number. However, the one thing these all seem to have in common is that the underlying can only be used to represent Natural Numbers (https://en.wikipedia.org/wiki/Natural_number). Thus, this Tensor specific precision is really about answering the questions "what range of natural numbers can this tensor represent given this configuration?" Also, in the case of encryption, this configuration has a lot to do with both the speed and security of the encrypted operations.

2) Fixed Precision Abstraction: A separate issue is the upper and lower bound as well as level of precision of the Tensor that the end user wants to use. In other words, most users won't care as much about the underlying range of (1)... what they care about is the end result (Float, Double, Int, etc.) This tensor abstraction should allow one to not know anything about (1) and generally use the same interfaces for rebalancing (like you mentioned above with the multiplication).

Two useful resources:

The python-paillier has implemented its own fixed precision logic (which we're currently using for PaillierTensor but we want to get away from to something more abstract) and the blogpost has a simple implementation as well (which we're currently using for our Multi-Party Computation prototype but we also want to get away from to something more abstract).

iamtrask commented 7 years ago

http://fabiensanglard.net/floating_point_visually_explained/

siarez commented 7 years ago

I think I get your points, but I'm still a bit confused.

  1. Do we want the user explicitly define upper and lower bounds and precision of the tensors they create?
  2. Can you give an example of what the API would look like to the user when we have support of all these types?

Also what is the purpose of Float class in he/paillier/basic.py? Is it related to this issue?

iamtrask commented 7 years ago
  1. While this is definitely a feature we want, I think we can expect most users to simply rely on defaults and instead have familiar "named" tensors to infer the precision they're getting. Aka...

PaillierFloatTensor - would be a tensor with Float level precision configured to be optimal for Paillier encryption BVLongTensor - would be a Long level precision configured to be optimal for BV encryption.

while custom would be optional... i think most folks would simply grab one of these guys.

Thoughts?

Also what is the purpose of Float class in he/paillier/basic.py? Is it related to this issue?

That Float class would likely become deprecated in leau of an "Integer" class which would use our abstract fixed point precision stuff (generic to all tensors). We are only using that Float there because python-paillier has fixed precision baked in (which is not true for our other libs)

siarez commented 7 years ago

“while custom would be optional... i think most folks would simply grab one of these guys. Thoughts?” Oh I don't think custom types are necessary either. PaillierFloatTensor and BVLongTensor ... sounds good. It's similar to PyTorch so even better!

Now, what I undestand is that PaillierFloatTensor and PaillierLongTensor will be wrappers around PaillierTensor (which is the integer class with our “abstract fixed point precision stuff”). For example, PaillierLongTensor will inherit from a PaillierTensor, but override precision configuration and methods as necessary to behave like a proper LongTensor, correct?

iamtrask commented 7 years ago

Couldn't have said it better myself!!!

siarez commented 7 years ago

Ok I'm gonna start working on this. (hands his beer to a bystander and starts stretching)

(If anyone is interested, please feel free to work on this issue. I'm not sure when (or if) I'll be able to offer a good solution. )

iamtrask commented 7 years ago

Hands @siarez 2 beers and a shot of Jack Daniels. You got this!!!

siarez commented 7 years ago

After some deep introspection, I realized my FixedPoint class (in my pull request) has issues. I found a fixed point module that seems pretty solid: SPFPM. It yields itself to what we are trying to do very well. However, it does not seem PIP installable, eventhough it is listed on PyPi @iamtrask what is your position on using this module?

siarez commented 7 years ago

For now, I decided not to use SPFPM. So you can ignore my previous comment.

Here are a few thoughts that I have been grappling with while working on this issue:

  1. Why should we support non-python data types (e.g. Byte, Half, Double, Short ...) when the implementation of our encryption (e.g. phe) is in Python and uses Python numeric datatypes? It probably makes sense for PyTorch, because their arithmetic is implemented in C, but that is not the case for us. In other words, limiting the datatype by number of bits in Python seems like an unnecessary constraint to impose on ourselves.

  2. Also as I understand it, phe has floating point precision built-in not fixed-point. So I don't quite understand why we need to implement an abstract fixed-point datatype when our end goal is to support floating-point datatypes? Why not implement an abstract floating-point type from the get go?

  3. Also (this could be due to my ignorance, but) I don't understand how we can support floating-point operations with a fixed-point datatype.

@iamtrask Thoughts?

iamtrask commented 7 years ago
  1. PyTorch supports them and we would like to interop with PyTorch.
  2. that's an oversight on my part... although there is some information leakage that might bother folks from having the exponent decrypted. It seems like a good idea to offer both (but we can pick just one for now... your choice)
  3. you're correct.
siarez commented 7 years ago
  1. In that case, I think we should think about what are the requirements for interoperability. We should think about what kinds of interoperability is possible. For example, given that our ciphertext are huge integers (that don't fit in 64-bits), I don't think it is possible to use PyTorch tensors. Therefore, I don't think we can use PyTorch to do operations on our encrypted tensors. (It is a different story with NumPy, because it lets you store your own data types in an ndarray)
  2. Is it possible to encrypt the exponent, if we are using fixed-point? It's not obvious to me how. But I did a bit of number crunching and found out:

    • The finite field of a 64-bit float has 2*2^53*2^10 == 2^64 == 18446744073709551616 elements
    • Biggest positive number for 64-bit float is ~1.79769e+308
    • The finite field of 1024-bit paillier encrypted numbers has 2^1022 elements
    • The biggest positive (integer) number for a 1024-bit Paillier encrypted number is 2^1021 == 10^307 (I suspect that would not be very different for other encryptions either.)
    • For example, we can have a fixed point that has 30 bits to the right of decimal point, 970 bits to the left and one sign bit. This gives us a range of [-1e292 +1e292] and precision of 1/2^30 == 9.3e-10

    Until now, I had not noticed the fact that we have 1022 bits available for our fixed-point! I think that means it is possible to have a fixed-point number that has reasonable range and precision for deep learning tasks. Therefore, I think a fixed-point number should be good enough for now.

@iamtrask If my points in (2) make sense to you, I can polish up my PR and so it can be merged. Also what do you think about (1)?

iamtrask commented 7 years ago
  1. Totally agree but I think the best use case of PyTorch integration is the other way around. We want PyTorch's autograd and neural networks implementations to use our tensors, so the high level interfaces should mirror them as best we can.

  2. I think they do. That being said, it seems like there are tradeoffs to both. For floating point, in theory you could intentionally use a smaller range (which would make it faster, yes?) and let the decimal point movements do the heavy lifting. So long run it might be interesting to still offer both, but at this point i'm jsut playing devil's advocate.

siarez commented 7 years ago
  1. Ok, that sounds achievable :)
  2. That maybe be true for regular un-encrypted floats, but in our case (1024-bit Paillier encryption) all numeric types regardless of their size and type, will turn into a fixed-bit ciphertexts. So I expect there will not be any performance difference, because all arithmatic operation are done on fixed-bit ciphertexts. And the number of bits in the ciphertext are dictated by the encryption scheme/config not the original datatype.

I'll update my code with the recent insights.