Closed Vomvas closed 3 years ago
In the following snippet,
c
is a constant known by all parties, not a secret-shared input, however it is treated like such when it comes to multiplication efficiency. My guess the compiler has no way to tell them apart and optimize the operation as if it was a known constant, so it's up to me to declare it as a clear-type value, is this the purpose of the design?
Indeed it is. The compiler will never replace types.
My understanding is that
@for_range_parallel(n_parallel, n_loops)
tries to parallelize the communication required for the multiplications in the loop, up to a givenn_parallel
. The compilation time increases noticeably as I increasen_parallel
, is this in order to allocate the increased required resources for preprocessed data?
No, this is about unrolling and optimizing. The compiler will create bytecode that is linear in n_parallel
.
In the following example I expect the multiplications to be executed using 1 round of communication, whether it's
n_mults=1000
orn_mults=5000
. During compilation, I seeProgram requires: 1000 integer triples 1 virtual machine rounds
but when I run the program, the output (please see below) shows multiple rounds of communication.
The number of virtual machine rounds is different to the number of communication rounds. The broadcasting rounds are one-off for setting up. The fact that both parties send and receive separately is the default star-shaped communication. You change with mascot-party.x -d
. Finally, when I run your example, I get only 1 round of sending/receiving. What version, compiler options, and run-time options did you use?
The verbose output shows more than double time required for 5k multiplications, and the communication times are also higher, even though the number of rounds is the same. Does this have to do with the amount of transferred data even though it's in the order of a few MBs?
I'd say yes, but I don't know for sure.
* Yao's GC and parallel communication: When runing the above snippet with Yao's GC I do not expect any gain in performance. One thing to note is that binary compilation for 5k multiplications explodes, unless I decrease `n_parallel` to something like `50`. In this case, the performance is consistently worse than when simply using `@for_range`. Is this expected?
Yes because multiplication in binary circuits is much more involved than in arithmetic circuits.
What version, compiler options, and run-time options did you use?
I am using the latest release, ./compile.py -F 64
, run-online.sh
with 5 parties and -lgp 256
for runtime. When I switch to 2 parties, I send and receive in 1 round, but still broadcast in 7 rounds, which makes sense I suppose.
Indeed, the communication happens separately with every party, hence four times with five parties.
* Yao's GC and parallel communication: When runing the above snippet with Yao's GC I do not expect any gain in performance. One thing to note is that binary compilation for 5k multiplications explodes, unless I decrease `n_parallel` to something like `50`. In this case, the performance is consistently worse than when simply using `@for_range`. Is this expected?
Yes because multiplication in binary circuits is much more involved than in arithmetic circuits.
I am not sure if I was clear before. Multiplication in binary circuits is indeed more expensive compared to arithmetic ones. However, running a loop in Yao's GC, using @for_range_parallel
performs worse than running the same loop again in Yao's GC using @for_range
. As an example:
res = Array(nmults, sbitfix)
a = sbitfix(3.14)
b = sbitfix(3.14)
@for_range(nmults)
# @for_range_parallel(50, nmults) # This is consistently slower
def _(nmult):
res[nmult] = (a * b)
Is this expected? My thought was that parallel loops should have no effect in garbled circuits.
Regards.
Do you observe this during compilation or actually running the computation? @for_range_parallel
unrolls the loop, which clearly has an impact on compilation, but that shouldn't be the case for the actual execution.
Compilation is indeed longer as discussed, but I am referring to the actual runtime. I am adding artificial latency to the loopback interface to have a better sense of the time difference, but even with no added latency, for the above snippet with nmults=1000
I get:
Using @for_range
PLAYERS="2" time Scripts/yao.sh multiplication_time
2 players
Player 0 is running on machine 127.0.0.1
Player 1 is running on machine 127.0.0.1
2 players
Player 0 is running on machine 127.0.0.1
Player 1 is running on machine 127.0.0.1
Compiler: ./compile.py -B 64 multiplication_time
Compiler: ./compile.py -B 64 multiplication_time
Number of AND gates: Receiving one-to-one 0.004224 MB in 1 rounds, taking 0.00925271 seconds
Sending one-to-one 3.6e-05 MB in 1 rounds, taking 6.575e-05 seconds
Receiving took 0.00925284 seconds
Stored 0 GB in 0 seconds and retrieved them in 0 seconds
Stored 0 GB in 0 seconds and retrieved them in 0 seconds
4131000
Receiving one-to-one 3.6e-05 MB in 1 rounds, taking 0.000329444 seconds
Sending one-to-one 0.004224 MB in 1 rounds, taking 5.9997e-05 seconds
Receiving took 0.000329632 seconds
Sending directly 132.192 MB in 16 rounds, taking 0.050451 seconds
XOR time: 0.0829575
Finished after 183010 instructions
Receiving directly 132.192 MB in 16 rounds, taking 0.0600376 seconds
Receiving took 0.0600467 seconds
XOR time: 0.0740987
Finished after 183010 instructions
Sending directly 132.192 MB
Time = 0.396507 seconds
Data sent = 132.196 MB
YaoGarbleWire timer 0 at end: 0.412916 seconds
Receiving directly 132.192 MB
Time = 0.386493 seconds
Data sent = 132.196 MB
YaoEvalWire timer 0 at end: 0.412526 seconds
0.56user 0.35system 0:00.96elapsed 94%CPU (0avgtext+0avgdata 58232maxresident)k
0inputs+0outputs (0major+71981minor)pagefaults 0swaps
Using @for_range_parallel(50, nmults)
PLAYERS="2" time Scripts/yao.sh multiplication_time
2 players
Player 0 is running on machine 127.0.0.1
Player 1 is running on machine 127.0.0.1
2 players
Player 0 is running on machine 127.0.0.1
Player 1 is running on machine 127.0.0.1
Compiler: ./compile.py -B 64 multiplication_time
Compiler: ./compile.py -B 64 multiplication_time
Number of AND gates: Receiving one-to-one 0.004224 MB in 1 rounds, taking 0.00544337 seconds
Sending one-to-one 3.6e-05 MB in 1 rounds, taking 2.7097e-05 seconds
Receiving took 0.00544355 seconds
Stored 0 GB in 0 seconds and retrieved them in 0 seconds
Stored 0 GB in 0 seconds and retrieved them in 0 seconds
4131000
Receiving one-to-one 3.6e-05 MB in 1 rounds, taking 0.000531661 seconds
Sending one-to-one 0.004224 MB in 1 rounds, taking 8.3552e-05 seconds
Receiving took 0.000531841 seconds
Sending directly 132.192 MB in 14 rounds, taking 0.0521188 seconds
XOR time: 0.29273
Finished after 93870 instructions
Receiving directly 132.192 MB in 14 rounds, taking 0.0595208 seconds
Receiving took 0.0595286 seconds
XOR time: 0.270146
Finished after 93870 instructions
Receiving directly 132.192 MB
Time = 0.738048 seconds
Data sent = 132.196 MB
Sending directly 132.192 MB
Time = 0.746809 seconds
Data sent = 132.196 MB
YaoGarbleWire timer 0 at end: 0.781698 seconds
YaoEvalWire timer 0 at end: 0.780484 seconds
1.74user 0.30system 0:01.33elapsed 153%CPU (0avgtext+0avgdata 68484maxresident)k
0inputs+0outputs (0major+82194minor)pagefaults 0swaps
The most noticable differences are XOR time
and Finished after ~ instructions
.
Thank for you clarifying this. I found that the reason for this is memory allocation. Using more parallelism results in a fewer loop iterations with more work per loop and vice versa. If there are more loop iterations, the virtual machine is reusing more memory that was allocated before and thus needs to call malloc/free less often. Such calls are relatively expensive, which explains the impact. The bottom line is that parallelism has limited benefit with garbled circuits, as you have mentioned, but there are costs, which are justified when running secret sharing protocols.
Broadcasting 0.000204 MB in 7 rounds, taking 0.190703 seconds
The broadcasting rounds are one-off for setting up.
Hello,
I looked for the description of this set up phase it in the online phases of MASCOT and SPDZ papers but didn't find anything. I assume it has nothing to do with the preprocessing phase setup, please correct me if I am wrong.
I tried eliminating this by reusing old memory (-m old
) but it seems unrelated.
Thank you very much!
I've had another look, and the broadcasting is only partially for setting up. The seven rounds are composed as following:
Thank you, it makes sense, indeed in an additions-only circuit the 4 broadcasts for the MAC check disappear.
Hello,
I have a few questions regarding the implementation and performance of multiplications in MP-SPDZ. I am considering
sint
types for simplicity, and unless otherwise stated I am using MASCOT with preprocessed data.Basic example
In the following snippet,
c
is a constant known by all parties, not a secret-shared input, however it is treated like such when it comes to multiplication efficiency. My guess the compiler has no way to tell them apart and optimize the operation as if it was a known constant, so it's up to me to declare it as a clear-type value, is this the purpose of the design?Parallel communication
My understanding is that
@for_range_parallel(n_parallel, n_loops)
tries to parallelize the communication required for the multiplications in the loop, up to a givenn_parallel
. The compilation time increases noticeably as I increasen_parallel
, is this in order to allocate the increased required resources for preprocessed data?In the following example I expect the multiplications to be executed using 1 round of communication, whether it's
n_mults=1000
orn_mults=5000
. During compilation, I seeProgram requires: 1000 integer triples 1 virtual machine rounds
but when I run the program, the output (please see below) shows multiple rounds of communication.The verbose output shows more than double time required for 5k multiplications, and the communication times are also higher, even though the number of rounds is the same. Does this have to do with the amount of transferred data even though it's in the order of a few MBs?
n_parallel
to something like50
. In this case, the performance is consistently worse than when simply using@for_range
. Is this expected?Thanks so much for your time!