Efficient dot product with shamir secret sharing

lu562 commented 2 years ago

Hi,

May I know if there are efficient dot product protocols (communication independent of vector size) implemented for Shamir secret sharing? If so, are there some example codes for that?

The document mentions SPDZ-wise has it implemented but it is for replicated secret sharing.

Appreciate it!

lu562 commented 2 years ago

And one more follow up question:

I tried running some examples with atlas-party.x as it seems to provide the efficient dot product I need (I appreciate it if you can confirm this.). And I separate the offline phase and the online phase by running Fake-offline first, then running the online phase with parameters "-F" "-v".

The print output is something as follows: ''' Communication details (rounds in parallel threads counted double): Sending to all 6.4e-05 MB in 2 rounds, taking 0.00457724 seconds Sending/receiving 0.21344 MB in 5 rounds, taking 0.0274816 seconds CPU time = 0.0320301 The following timing is exclusive preprocessing. Time = 0.0532217 seconds Data sent = 0.213568 MB in ~7 rounds (party 0) Global data sent = 0.640496 MB (all parties) '''' In the code, I just do one dot product and one open, so I was expecting 2 rounds. May I assume that the second line is for the online phase? I'm also not sure what causes the communication in the third line since it remains the same when I change the code to do more dot products, is it the communication for the offline phase?

The code I play with is as simple as this: ''' k = 10000 a = Array(k, sint) b = Array(k, sint) @for_range(k) def f(i): a[i] = sint(2) b[i] = sint(2)

c = sint.dot_product(a, b) c_plain = c.reveal() print_ln("%s", c_plain) '''

Thank you!

mkskeller commented 2 years ago

May I know if there are efficient dot product protocols (communication independent of vector size) implemented for Shamir secret sharing? If so, are there some example codes for that?

I can't think of any write-up. The essential idea is to sum up additive secret that result from local multiplication before the resharing step. This way, the resharing step is only executed for one share instead of several.

mkskeller commented 2 years ago

In the code, I just do one dot product and one open, so I was expecting 2 rounds. May I assume that the second line is for the online phase? I'm also not sure what causes the communication in the third line since it remains the same when I change the code to do more dot products, is it the communication for the offline phase?

Which version are you using? I'm getting fewer rounds with the latest version. There are a few things going on:

One-time communication for setup purposes is included in the count.
The default mode of operation for outputs is to operate in a star-shaped way: All parties send their shares to party 0, who then computes the output and sends it to all parties. This way, quadratic communication is avoided. You can activate direct communication with the command-line argument --direct.

lu562 commented 2 years ago

I think I'm using the latest version as I just pulled the repo today, and I used atlas-party.x

Also, I realized that when I change the code (e.g. doing more dot products). The second line of the benchmark "Sending to all XXX bits in XXX round" changes and it is consistent with the code change. However, the third line "Sending/receiving XXX bits in XXX rounds" always shows the same number. I'm not sure if this is intended?

mkskeller commented 2 years ago

Sorry, I've missed that you're using Atlas. Preprocessing from file isn't supported for double sharings, so the communication does include preprocessing.

lu562 commented 2 years ago

oh got it, thank you!

Then is there a way to benchmark offline phase and online phase separately for Atlas?

I used to think the "Sending/receiving 0.21344 MB in 5 rounds, taking 0.0274816 seconds" is for the offline phase, and "Sending to all 6.4e-05 MB in 2 rounds, taking 0.00457724 seconds" is for the online phase because I'm expecting the online phase to be 2 rounds, so it seems to be just a coincidence :)

btw may I know if there are other protocols besides Atlas, which support constant complexity dot product?

lu562 commented 2 years ago

Sorry for more questions :)

I tried running the code using sy-shamir-party.x, and the benchmark output has the same issue with Atlas. So I cannot separate the communication/time of offline and online.

besides, if I run 1000 dot products in a for loop, the benchmark shows it takes 1000 rounds, may I know if there are ways to run those dot products in parallel (in the same round)? The dot product inputs are independent of each other.

Thanks!

mkskeller commented 2 years ago

SPDZ-wise protocols don't use preprocessing for multiplication, so all computation is in the online phase.

Which loop are you using? @for_range_opt should parallelize up to some extent.

lu562 commented 2 years ago

Thank you! @for_range_opt indeed helps reduce the number of rounds. To understand the communication benchmark better, I compile the following code:

a = sint(10)
b = sint(10)
c = a * b

and run it with sy-shamir.x, and I get the following benchmark log:

Compiler: ./compile.py -F 64 play
Communication details (rounds in parallel threads counted double):
Partial broadcasting 1.6e-05 MB in 1 rounds, taking 0.00296882 seconds
Sending to all 1.6e-05 MB in 1 rounds, taking 0.0164529 seconds
Sending/receiving 0.160192 MB in 6 rounds, taking 0.00797508 seconds
CPU time = 0.0143699
The following timing is inclusive preprocessing.
Time = 0.0158316 seconds 
Data sent = 0.160256 MB in ~8 rounds (party 0)
Global data sent = 0.480704 MB (all parties)
Actual cost of program:
Command line: /home/lu/MP-SPDZ/MP-SPDZ/Scripts/../sy-shamir-party.x 0 -v play -pn 10836 -h localhost -N 3

Any idea why it is 6 rounds in total? If there is a one-time setup thing, how to measure it?

mkskeller commented 2 years ago

SPDZ-wise uses a checking protocol to check multiplications at the end, see the original paper: https://eprint.iacr.org/2018/570 I've had a quick look and got the following overall tally: 1 round for setup, 2 rounds for the multiplication (could be one but it makes the code easier), 5 rounds for checking

lu562 commented 2 years ago

Thank you for the clarification!

mkskeller commented 2 years ago

Then is there a way to benchmark offline phase and online phase separately for Atlas?

No.

btw may I know if there are other protocols besides Atlas, which support constant complexity dot product?

Semi-honest Rep3, Rep4, all SY, and semi-honest Shamir. Hemi and Temi support a more matrix multiplication but not dot product.

lu562 commented 2 years ago

Appreciate the information :)

lu562 commented 2 years ago

The dot product works amazingly fast for semi-honest Shamir, thanks!

btw, what protocol should I choose if I want the fastest matrix multiplication with Shamir sharing?

mkskeller commented 2 years ago

If there is an efficient dot product, it will be used for matrix multiplication.

lu562 commented 2 years ago

oh good! that means semi-honest Shamir can also do n-by-n matrix multiplication with O(n^2) communication, right?

mkskeller commented 2 years ago

Yes

lu562 commented 2 years ago

Does semi-honest Shamir support benchmark offline and online separately?

mkskeller commented 2 years ago

Yes, but there isn't an offline phase for multiplication, so it would be mainly randomness used for non-linear functions.

lu562 commented 2 years ago

great, thanks!

lu562 commented 2 years ago

when I try initializing secret shared 10000*10000 matrices, the compilation timed out. Did it reach some limitation or should I change some config files to make it work?

mkskeller commented 2 years ago

Did you use Python for loops? If so, you need to use more specific loop instructions: https://mp-spdz.readthedocs.io/en/latest/troubleshooting.html#compile-py-takes-too-long-or-runs-out-of-memory

lu562 commented 2 years ago

Hi,

Indeed it is because of the loop. thanks!

lu562 commented 2 years ago

Hi, when I try to multiply two 3000*3000 matrices, the compiler raises the error "vector too large", is this a hard limit or is there a way to pass it? Thanks!

mkskeller commented 2 years ago

Thank you for raising this. You should find that 6664de3f77 fixes it.

lu562 commented 2 years ago

awesome, thank you!

btw, I realized that multiplication of large matrices is time-consuming, is there a way to apply multi-threading or other optimization to it? I'm thinking about using multi-threading for each dot product but as the exposed API is simply something like Z = X * Y, I'm not sure how to do it.

mkskeller commented 2 years ago

With 60dd78797e, you can call X.dot(Y, n_threads=n) for multithreaded matrix multiplication.

lu562 commented 2 years ago

that is great! thanks.

lu562 commented 2 years ago

Hi, I was testing the multi-threading matrix multiplication and found some interesting behaviors. When I multiply two large matrices (e.g. 3000 3000) with only one thread, I can see that the compiler compiled it into multiple small-matrix multiplications. Below is the screenshot of compilation of 30003000 matrix multiplication without multi-threading:

Program requires at most:
 27000000000 integer triples
           1 matrix multiplications (204x3000 * 3000x3000)
           2 matrix multiplications (1398x3000 * 3000x3000)

It looks like you are separating the first matrix into multiple small ones and it makes sense to me. When I try n_thread = 2, I get the following result:

Program requires at most:
 27000000000 integer triples
           2 matrix multiplications (102x3000 * 3000x3000)
           2 matrix multiplications (1398x3000 * 3000x3000)

So it looks like the multi-threading only separates the small matrix (204x3000) into to smaller matrices and use multi-threading to compute each of them. And it leaves large matrices (1398x3000) unchanged. This may leave the main part of computation out of the effect of multi-threading. Could you help check if it is the case? Thanks!

This problem does not happen when the size of the matrix is small. For small matrices, n_thread=x will divide the first matrix into x pieces of the same size, so there is no problem.

mkskeller commented 2 years ago

Your observations are correct. The main reason for the split is that the system cannot handle output matrices larger than ~4 million entries, hence the 1398x3000 multiplication, which is just below the threshold. The remaining multiplication is also distributed among the threads in the case of multithreading. From an efficiency point of view, the way the split is made does not matter as long as the distribution is even among the threads.

lu562 commented 2 years ago

thanks for the explanation! It makes good sense.

lu562 commented 2 years ago

Just for confirmation, for the example of 30003000 matrix multiplication, both "the matrix multiplications (204x3000 3000x3000)" and "the matrix multiplications (1398x3000 * 3000x3000)" will be multi-threaded, right?

The reason I ask is that the compiler does not split (1398x3000 3000x3000) multiplication into two (699x3000 3000x3000) when n_thread = 2 in printed compilation output.

mkskeller commented 2 years ago

In the case of two threads, both execute a 1398x3000 and then a 102x3000 multiplication as 2*(1398+102) = 3000.

lu562 commented 2 years ago

Oh I see, so it is like the compiler first split it into two 1500*3000, then split 1500 into 1398 + 102.

mkskeller commented 2 years ago

Yes

lu562 commented 2 years ago

got it, thanks!

data61 / MP-SPDZ

Efficient dot product with shamir secret sharing #474