Multiplication performance in various settings

Vomvas commented 4 years ago

Hello,

I have a few questions regarding the implementation and performance of multiplications in MP-SPDZ. I am considering sint types for simplicity, and unless otherwise stated I am using MASCOT with preprocessed data.

Basic example

In the following snippet, c is a constant known by all parties, not a secret-shared input, however it is treated like such when it comes to multiplication efficiency. My guess the compiler has no way to tell them apart and optimize the operation as if it was a known constant, so it's up to me to declare it as a clear-type value, is this the purpose of the design?

a = sint.get_input_from(0)
b = sint.get_input_from(0)
c = sint(3)
d = cint(3)
e = 3

library.start_timer(1)
res = a * b
library.stop_timer(1)

library.start_timer(2)
res = a * c
library.stop_timer(2)

library.start_timer(3)
res = a * d
library.stop_timer(3)

library.start_timer(4)
res = a * e
library.stop_timer(4)

Time = 11.0211 seconds            # Total time
Time1 = 1.00147 seconds           # Multiply secret inputs
Time2 = 1.00197 seconds           # Multiply secret input with secret constant
Time3 = 2.926e-06 seconds         # Multiply sescret input with clear constant
Time4 = 8.181e-06 seconds         # Multiply secret input with public constant

Parallel communication

My understanding is that @for_range_parallel(n_parallel, n_loops) tries to parallelize the communication required for the multiplications in the loop, up to a given n_parallel. The compilation time increases noticeably as I increase n_parallel, is this in order to allocate the increased required resources for preprocessed data?

In the following example I expect the multiplications to be executed using 1 round of communication, whether it's n_mults=1000 or n_mults=5000. During compilation, I see Program requires: 1000 integer triples 1 virtual machine rounds but when I run the program, the output (please see below) shows multiple rounds of communication.

nmults = 1000
a = sint(114)
b = sint(124)
res = Array(nmults, sint)

@for_range_parallel(nmults, nmults)
def _(nmult):
    res[nmult] = (a * b)

The verbose output shows more than double time required for 5k multiplications, and the communication times are also higher, even though the number of rounds is the same. Does this have to do with the amount of transferred data even though it's in the order of a few MBs?

Number of SPDZ gfp multiplications: 1000
SPDZ gfp Triples reading: 0.000631223
Broadcasting 0.000204 MB in 7 rounds, taking 0.150708 seconds
Receiving directly 0.256 MB in 4 rounds, taking 0.00341217 seconds
Sending directly 0.256 MB in 4 rounds, taking 0.000268782 seconds
Receiving took 0.00341472 seconds
Memory usage: sint=9192 cint=8192 sgf2n=8192 cgf2n=8192 regint=8192 
Join timer: 0 0.0264062
Finish timer: 0.117548
Process timer: 0.0191011
Time = 0.144312 seconds 
Data sent = 0.256816 MB
Global data sent = 0.51608 MB
Summed all shares at once
Full broadcast
Actual cost of program:
  Type int
          1000        Triples
End of prog

Number of SPDZ gfp multiplications: 5000
SPDZ gfp Triples reading: 0.00138983
Broadcasting 0.000204 MB in 7 rounds, taking 0.190703 seconds
Receiving directly 1.28 MB in 4 rounds, taking 0.147623 seconds
Sending directly 1.28 MB in 4 rounds, taking 0.000859722 seconds
Receiving took 0.147631 seconds
Memory usage: sint=13192 cint=8192 sgf2n=8192 cgf2n=8192 regint=8192 
Join timer: 0 0.182439
Finish timer: 0.162084
Process timer: 0.0476061
Time = 0.344803 seconds 
Data sent = 1.28082 MB
Global data sent = 2.56408 MB
Summed all shares at once
Full broadcast
Actual cost of program:
  Type int
          5000        Triples
End of prog

Yao's GC and parallel communication: When runing the above snippet with Yao's GC I do not expect any gain in performance. One thing to note is that binary compilation for 5k multiplications explodes, unless I decrease n_parallel to something like 50. In this case, the performance is consistently worse than when simply using @for_range. Is this expected?

Thanks so much for your time!

mkskeller commented 4 years ago

In the following snippet, c is a constant known by all parties, not a secret-shared input, however it is treated like such when it comes to multiplication efficiency. My guess the compiler has no way to tell them apart and optimize the operation as if it was a known constant, so it's up to me to declare it as a clear-type value, is this the purpose of the design?

Indeed it is. The compiler will never replace types.

My understanding is that @for_range_parallel(n_parallel, n_loops) tries to parallelize the communication required for the multiplications in the loop, up to a given n_parallel. The compilation time increases noticeably as I increase n_parallel, is this in order to allocate the increased required resources for preprocessed data?

No, this is about unrolling and optimizing. The compiler will create bytecode that is linear in n_parallel.

In the following example I expect the multiplications to be executed using 1 round of communication, whether it's n_mults=1000 or n_mults=5000. During compilation, I see Program requires: 1000 integer triples 1 virtual machine rounds but when I run the program, the output (please see below) shows multiple rounds of communication.

The number of virtual machine rounds is different to the number of communication rounds. The broadcasting rounds are one-off for setting up. The fact that both parties send and receive separately is the default star-shaped communication. You change with mascot-party.x -d. Finally, when I run your example, I get only 1 round of sending/receiving. What version, compiler options, and run-time options did you use?

The verbose output shows more than double time required for 5k multiplications, and the communication times are also higher, even though the number of rounds is the same. Does this have to do with the amount of transferred data even though it's in the order of a few MBs?

I'd say yes, but I don't know for sure.

* Yao's GC and parallel communication: When runing the above snippet with Yao's GC I do not expect any gain in performance. One thing to note is that binary compilation for 5k multiplications explodes, unless I decrease `n_parallel` to something like `50`. In this case, the performance is consistently worse than when simply using `@for_range`. Is this expected?

Yes because multiplication in binary circuits is much more involved than in arithmetic circuits.

Vomvas commented 4 years ago

What version, compiler options, and run-time options did you use?

I am using the latest release, ./compile.py -F 64, run-online.sh with 5 parties and -lgp 256 for runtime. When I switch to 2 parties, I send and receive in 1 round, but still broadcast in 7 rounds, which makes sense I suppose.

mkskeller commented 4 years ago

Indeed, the communication happens separately with every party, hence four times with five parties.

Vomvas commented 4 years ago

* Yao's GC and parallel communication: When runing the above snippet with Yao's GC I do not expect any gain in performance. One thing to note is that binary compilation for 5k multiplications explodes, unless I decrease `n_parallel` to something like `50`. In this case, the performance is consistently worse than when simply using `@for_range`. Is this expected?

Yes because multiplication in binary circuits is much more involved than in arithmetic circuits.

I am not sure if I was clear before. Multiplication in binary circuits is indeed more expensive compared to arithmetic ones. However, running a loop in Yao's GC, using @for_range_parallel performs worse than running the same loop again in Yao's GC using @for_range. As an example:

res = Array(nmults, sbitfix)
a = sbitfix(3.14)
b = sbitfix(3.14)
@for_range(nmults)
# @for_range_parallel(50, nmults)           # This is consistently slower
def _(nmult):
    res[nmult] = (a * b)

Is this expected? My thought was that parallel loops should have no effect in garbled circuits.

Regards.

mkskeller commented 4 years ago

Do you observe this during compilation or actually running the computation? @for_range_parallel unrolls the loop, which clearly has an impact on compilation, but that shouldn't be the case for the actual execution.

Vomvas commented 4 years ago

Compilation is indeed longer as discussed, but I am referring to the actual runtime. I am adding artificial latency to the loopback interface to have a better sense of the time difference, but even with no added latency, for the above snippet with nmults=1000 I get:

Using @for_range

PLAYERS="2" time Scripts/yao.sh multiplication_time
2 players
Player 0 is running on machine 127.0.0.1
Player 1 is running on machine 127.0.0.1
2 players
Player 0 is running on machine 127.0.0.1
Player 1 is running on machine 127.0.0.1
Compiler: ./compile.py -B 64 multiplication_time
Compiler: ./compile.py -B 64 multiplication_time
Number of AND gates: Receiving one-to-one 0.004224 MB in 1 rounds, taking 0.00925271 seconds
Sending one-to-one 3.6e-05 MB in 1 rounds, taking 6.575e-05 seconds
Receiving took 0.00925284 seconds
Stored 0 GB in 0 seconds and retrieved them in 0 seconds 
Stored 0 GB in 0 seconds and retrieved them in 0 seconds 
4131000
Receiving one-to-one 3.6e-05 MB in 1 rounds, taking 0.000329444 seconds
Sending one-to-one 0.004224 MB in 1 rounds, taking 5.9997e-05 seconds
Receiving took 0.000329632 seconds
Sending directly 132.192 MB in 16 rounds, taking 0.050451 seconds
XOR time: 0.0829575
Finished after 183010 instructions
Receiving directly 132.192 MB in 16 rounds, taking 0.0600376 seconds
Receiving took 0.0600467 seconds
XOR time: 0.0740987
Finished after 183010 instructions
Sending directly 132.192 MB
Time = 0.396507 seconds
Data sent = 132.196 MB
YaoGarbleWire timer 0 at end: 0.412916 seconds
Receiving directly 132.192 MB
Time = 0.386493 seconds
Data sent = 132.196 MB
YaoEvalWire timer 0 at end: 0.412526 seconds
0.56user 0.35system 0:00.96elapsed 94%CPU (0avgtext+0avgdata 58232maxresident)k
0inputs+0outputs (0major+71981minor)pagefaults 0swaps

Using @for_range_parallel(50, nmults)

PLAYERS="2" time Scripts/yao.sh multiplication_time
2 players
Player 0 is running on machine 127.0.0.1
Player 1 is running on machine 127.0.0.1
2 players
Player 0 is running on machine 127.0.0.1
Player 1 is running on machine 127.0.0.1
Compiler: ./compile.py -B 64 multiplication_time
Compiler: ./compile.py -B 64 multiplication_time
Number of AND gates: Receiving one-to-one 0.004224 MB in 1 rounds, taking 0.00544337 seconds
Sending one-to-one 3.6e-05 MB in 1 rounds, taking 2.7097e-05 seconds
Receiving took 0.00544355 seconds
Stored 0 GB in 0 seconds and retrieved them in 0 seconds 
Stored 0 GB in 0 seconds and retrieved them in 0 seconds 
4131000
Receiving one-to-one 3.6e-05 MB in 1 rounds, taking 0.000531661 seconds
Sending one-to-one 0.004224 MB in 1 rounds, taking 8.3552e-05 seconds
Receiving took 0.000531841 seconds
Sending directly 132.192 MB in 14 rounds, taking 0.0521188 seconds
XOR time: 0.29273
Finished after 93870 instructions
Receiving directly 132.192 MB in 14 rounds, taking 0.0595208 seconds
Receiving took 0.0595286 seconds
XOR time: 0.270146
Finished after 93870 instructions
Receiving directly 132.192 MB
Time = 0.738048 seconds
Data sent = 132.196 MB
Sending directly 132.192 MB
Time = 0.746809 seconds
Data sent = 132.196 MB
YaoGarbleWire timer 0 at end: 0.781698 seconds
YaoEvalWire timer 0 at end: 0.780484 seconds
1.74user 0.30system 0:01.33elapsed 153%CPU (0avgtext+0avgdata 68484maxresident)k
0inputs+0outputs (0major+82194minor)pagefaults 0swaps

The most noticable differences are XOR time and Finished after ~ instructions.

mkskeller commented 3 years ago

Thank for you clarifying this. I found that the reason for this is memory allocation. Using more parallelism results in a fewer loop iterations with more work per loop and vice versa. If there are more loop iterations, the virtual machine is reusing more memory that was allocated before and thus needs to call malloc/free less often. Such calls are relatively expensive, which explains the impact. The bottom line is that parallelism has limited benefit with garbled circuits, as you have mentioned, but there are costs, which are justified when running secret sharing protocols.

Vomvas commented 3 years ago

Broadcasting 0.000204 MB in 7 rounds, taking 0.190703 seconds

The broadcasting rounds are one-off for setting up.

Hello,

I looked for the description of this set up phase it in the online phases of MASCOT and SPDZ papers but didn't find anything. I assume it has nothing to do with the preprocessing phase setup, please correct me if I am wrong.

I tried eliminating this by reusing old memory (-m old) but it seems unrelated.

Thank you very much!

mkskeller commented 3 years ago

I've had another look, and the broadcasting is only partially for setting up. The seven rounds are composed as following:

2 rounds for initializing a shared randomness generator (commit and open). This is not necessary for many programs, but done anyway for simplicity.
4 for the MAC check (twice commit and open). This is explained in Figure 10 of https://eprint.iacr.org/2012/642.
1 for the broadcast check explained in Section 3.1 of https://eprint.iacr.org/2012/642.

Vomvas commented 3 years ago

Thank you, it makes sense, indeed in an additions-only circuit the 4 broadcasts for the MAC check disappear.

data61 / MP-SPDZ

Multiplication performance in various settings #154

Basic example

Parallel communication