How to optimize the communication cost for matrix multiplication

zyl2016 commented 8 months ago

Hello, sorry for disturbing you.

I have observed from #751 that the algorithm in secureML has been implemented. So I wonder how to use the optimized matrix multiplication described in Section IV-A of secureML.

I currently use the * operator for conducting matmul. Here shows the code and the profiling. Source code:

x = sint.Matrix(4, 4)
x.assign_all(5)

y = sint.Matrix(4, 4)
y.assign_all(5)

result = x * y

Profiling:

./compile.py -R 64 matmultest

Default bit length for compilation: 63
Default security parameter for compilation: 40
Compiling file Programs/Source/matmultest.mpc
Writing to Programs/Schedules/matmultest.sch
Writing to Programs/Bytecode/matmultest-0.bc
Hash: 80c59c8a86d03237089eedcae0fc5533d085af8a23db10d730422d689e6eb924
Program requires at most:
          64 integer triples
           1 matrix multiplications (4x4 * 4x4)

Scripts/semi2k.sh -v matmultest

Running /home/MPCtest/MP-SPDZ/Scripts/../semi2k-party.x 0 -v matmultest -pn 13285 -h localhost -N 2
Running /home/MPCtest/MP-SPDZ/Scripts/../semi2k-party.x 1 -v matmultest -pn 13285 -h localhost -N 2
Using statistical security parameter 40
Trying to run 64-bit computation
Compiler: ./compile.py -R 64 matmultest
Spent 0.000159083 seconds (0.001024 MB, 2 rounds) on the online phase and 0.00487164 seconds (0.104848 MB, 41 rounds) on the preprocessing/offline phase.
Communication details:
Exchanging one-to-one 0.032864 MB in 9 rounds, taking 0.00101364 seconds
Receiving directly 0.001024 MB in 1 rounds, taking 6.6053e-05 seconds
Receiving one-to-one 0.071984 MB in 16 rounds, taking 0 seconds
Sending directly 0.001024 MB in 1 rounds, taking 2.7162e-05 seconds
Sending one-to-one 0.071984 MB in 16 rounds, taking 0 seconds
CPU time = 0.00761983
The following benchmarks are including preprocessing (offline phase).
Time = 0.00607781 seconds 
Data sent = 0.105872 MB in ~43 rounds (party 0 only)
Global data sent = 0.211744 MB (all parties)
Actual cost of program:
  Type int
            64        Triples

The data sent here for the online phase is 0.001024 MB for 4 4 matmul 4 4. I also tested out that the data sent for element-wise multiplication is 1.6e-5 MB for the online phase. There are 64 times of data sent for one multiplication, so I think the data sent for matrix multiplication here is linear to the number of multiplication. As described in Section IV-A of secureML, the data sent for matrix multiplication can be optimized to be linear to the shape of the input matrix, which is 16 times of data sent for one multiplication. How can I use this optimized matrix multiplication?

Looking forward to your reply! Thanks!

mkskeller commented 8 months ago

Semi2k doesn't implement the optimization in Section IV-A. The protocol in dealer-party.x does however by using a third party for the preprocessing, so you should see the desired effect there.

zyl2016 commented 8 months ago

Thank you for your response! I tested on the dealer-ring.sh, and it showed the desired effect! Here are some follow-up questions for the tested result below.

Scripts/dealer-ring.sh -v matmultest
Running /home/MPCtest/MP-SPDZ/Scripts/../dealer-ring-party.x 0 -v matmultest -pn 16812 -h localhost -N 3
Running /home/MPCtest/MP-SPDZ/Scripts/../dealer-ring-party.x 1 -v matmultest -pn 16812 -h localhost -N 3
Running /home/MPCtest/MP-SPDZ/Scripts/../dealer-ring-party.x 2 -v matmultest -pn 16812 -h localhost -N 3
Using statistical security parameter 40
Trying to run 64-bit computation
Compiler: ./compile.py -R 64 matmultest
        99 triples of matrix left
Significant amount of unused triples of matrix distorting the benchmark. This protocol has a large minimum batch size, which makes this unavoidable for small programs.
Spent 0.00115767 seconds (0.000256 MB, 3 rounds) on the online phase and 1.1664e-05 seconds on the preprocessing/offline phase.
Communication details:
Receiving directly 0.000256 MB in 1 rounds, taking 0.000116434 seconds
Sending directly 0.000256 MB in 1 rounds, taking 9.5412e-05 seconds
CPU time = 0.00171892
The following benchmarks are including preprocessing (offline phase).
Time = 0.00203508 seconds 
Data sent = 0.000256 MB in ~3 rounds (party 0 only)
Global data sent = 0.077312 MB (all parties)
Actual cost of program:
Command line: /home/MPCtest/MP-SPDZ/Scripts/../dealer-ring-party.x 0 -v matmultest -pn 16812 -h localhost -N 3

What is the meaning of 99 triples of matrix left in the output?
Why there are 3 communication rounds on the online phase instead of 2 shown in the semi2k?

I am also curious about why there is no corresponding implementation in the semi2k. I think the key feature to realize this optimization is on generating Matrix multiplication triples instead of normal multiplication triples. Is it because generating Matrix multiplication will cause more communication cost in offline phase?

mkskeller commented 8 months ago

The default is for the dealer to send 100 matrix triples, but the program only uses one.
The additional round is for the dealer to send the matrix triples

I am also curious about why there is no corresponding implementation in the semi2k. I think the key feature to realize this optimization is on generating Matrix multiplication triples instead of normal multiplication triples. Is it because generating Matrix multiplication will cause more communication cost in offline phase?

Yes. MP-SPDZ is more geared towards running the entire computation, and the cost of the offline phase generally dwarfs the online in Semi2k. That said, Hemi does use this optimization because the offline generation is relatively efficient in homomorphic encryption.

data61 / MP-SPDZ

How to optimize the communication cost for matrix multiplication #1321