data61 / MP-SPDZ

Versatile framework for multi-party computation
Other
897 stars 278 forks source link

Outsourced computation runs much more slowly than non-outsourced #823

Closed FriendlyN closed 1 year ago

FriendlyN commented 1 year ago

I have observed a major slowdown over “standard” MPC when performing a computation in the outsourced model, where clients secret share to workers and have them perform a computation on their behalf.

Timing various parts of the .mpc program and client code, via manual instrumentation and FlameGraph/perf, suggests that the bulk of the overhead is coming from client input sharing to workers.

Concretely, for the outsourced program (listed below as outsourced_input_sharing.mpc), the following is output by the workers:

Using security parameter 40
Start listening on thread 140142870247168
Party 1 is listening on port 14001 for external client connections.
Thread 140142870247168 found server.
Party 1 received external client connection from client id: 2
Thread 140142870247168 found server.
Party 1 received external client connection from client id: 0
Thread 140142870247168 found server.
Party 1 received external client connection from client id: 1
The following benchmarks are including preprocessing (offline phase).
Time = 18.2836 seconds
Data sent = 6.40128 MB in ~100 rounds (party 1)
Global data sent = 25.6038 MB (all parties)

And for the non-outsourced program (listed below as non_outsourced_input_sharing.mpc), where inputs were shared by workers and obtained from file (Player-Data/Input-P-0), the following was the output:

Using security parameter 40
The following benchmarks are including preprocessing (offline phase).
Time = 0.0626709 seconds
Data sent = 1.04858 MB in ~1 rounds (party 1)
Global data sent = 4.1943 MB (all parties)

Both experiments used the same number of workers (n=3) and protocol (Shamir), and the outsourced example used 3 clients. The client code is written in Python and adapted from the Banker’s Bonus example. (With client code written in C++, the worker time was ~10 seconds, also much slower than the non-outsourced case.)

All inputs are obtained as a matrix (Tensor) with size 256x256. I additionally started the clients before the workers to minimize any time that could be attributed to waiting for connections.

While there is certainly some latency between clients and workers and more data in the outsourced setting (due to the client input sharing semantics), they don’t seem able to account for the massive time differences in input sharing steps. Further it appears that client input sharing happens sequentially (one party at a time, based on client logging), while non-outsourced input sharing happens concurrently.

Do you know what might be causing the client input sharing step to take much more time (overall and relative to the non-outsourced setting)? From working through the networking code, It seems that client input sharing happens on the main thread within the Processor but that non-outsourced programs have separate threads via CryptoPlayer for handling socket I/O. Could the lack of dedicated I/O threads be the primary source of slowdown here?

Thanks!

Programs:

non_outsourced_input_sharing.mpc:

SIZE = int(program.args[1])

m1 = sint.input_tensor_from(0, [SIZE,SIZE])
m2 = sint.input_tensor_from(1, [SIZE,SIZE])
m3 = sint.input_tensor_from(2, [SIZE,SIZE])

outsourced_input_sharing.mpc:

from Compiler.types import sint, regint, Array, SubMultiArray
from Compiler.library import print_ln, do_while, for_range

PORTNUM = 14000
CLIENTS = int(program.args[1])
SIDELEN = int(program.args[2])

def accept_client(port):
    client_id = accept_client_connection(port)
    expected_clients = regint.read_from_socket(client_id)
    return client_id, expected_clients

def close_clients(client_ids):
    @for_range(CLIENTS)
    def _(i):
        closeclientconnection(client_ids[i])

def wait_for_clients(port):
    listen_for_clients(port)
    client_ids = Array(CLIENTS, regint)

    @for_range(CLIENTS)
    def _(index):
       client_id, expected_clients = accept_client(port)
        client_ids[index] = client_id

    return client_ids

def get_client_inputs(client_ids):
    client_inputs = sint.Tensor([CLIENTS, SIDELEN, SIDELEN])
    @for_range(CLIENTS)
    def _(i):
        client_inputs[i] = sint.input_tensor_from_client(client_ids[i], [SIDELEN,SIDELEN])
    return client_inputs

def main():
    client_ids = wait_for_clients(PORTNUM)
    client_inputs = get_client_inputs(client_ids)

main()
mkskeller commented 1 year ago

This is because the the code doesn't use normal secret sharing but the multiplication-based input protocol in https://eprint.iacr.org/2015/1006 (Figure 2). It wouldn't technically be necessary for semi-honest protocols but always using it simplifies the client code.

FriendlyN commented 1 year ago

Thank you for the information.

I see that the MP-SPDZ’s confidential benchmarking paper’s secret sharing scheme uses multiplication triples obtained from the worker nodes. Since these triples can be generated in the offline phase, does the -DINSECURE compilation flag affect when/how these triples get generated in the outsourced setting?

RHG101997 commented 1 year ago

Thank you for the information.

I see that the MP-SPDZ’s confidential benchmarking paper’s secret sharing scheme uses multiplication triples obtained from the worker nodes. Since these triples can be generated in the offline phase, does the -DINSECURE compilation flag affect when/how these triples get generated in the outsourced setting?

I would wait for @mkskeller answer just to be sure, but the -DINSECURE flag helps mainly by letting you re-use triples while running the computation. That way, you don't need to stop the computation to generate them.

Therefore I think it will impact since the triples used for the outsourcing are from the same pool as those used for other operations.

Reference:

mkskeller commented 1 year ago

-DINSECURE allows reusing triples when reading them from files with the -F option. Without -F, I don't think there is any change.

FriendlyN commented 1 year ago

Thank you both for the clarification.

Is it possible to separate the offline and online time reporting for the outsourced MPC execution? If not, would the best approach to separating these times be to separately generate the triples needed for outsourced secret sharing and then run the online step with the -DINSECURE compilation flag and the -F flag on the machine binary?

Thanks again!

mkskeller commented 1 year ago

This is currently only possible using -F.

FriendlyN commented 1 year ago

When using triples generated from Fake-Offline.x with the ‘-F’ flag , the outsourced input sharing of 256x256 inputs per client (3 clients, 3 workers) still takes roughly 15.8s if I am reading the output from Party 1 correctly:

Using security parameter 40 … The following benchmarks are excluding preprocessing (offline phase). Time = 15.7894 seconds Data sent = 0 MB in ~0 rounds (party 1) Global data sent = 0 MB (all parties)

This means that 15.8s of the 18.3s in the prior example are in the online phase only.

When I scale up the number of client inputs to 512x512 per client (same setting, same flags/preprocessing), the input sharing step takes almost a minute:

Using security parameter 40 … The following benchmarks are excluding preprocessing (offline phase). Time = 53.5112 seconds Data sent = 0 MB in ~0 rounds (party 1) Global data sent = 0 MB (all parties)

In comparison, even adding a matmult and reconstruction to the non-outsourced setting (3 workers) doesn’t bring up the total time above 2 seconds:

Using security parameter 40 The following benchmarks are including preprocessing (offline phase). Time = 1.18061 seconds Data sent = 12.5829 MB in ~4 rounds (party 1) Global data sent = 46.1373 MB (all parties)

But a non-outsourced matmult + reconstruction should have more communication and computation than just outsourced input sharing when triples are used from file.

Profiling the Python send_private_inputs function suggests a non-negligible amount of time is being spent in socket I/O and serialization rather than in the input masking and triple relation check:

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.061    0.061   22.547   22.547 <string>:1(<module>)
  2359296    1.181    0.000    1.363    0.000 client.py:112(consume)
        1    2.182    2.182   19.449   19.449 client.py:32(receive_triples)
        1    0.137    0.137    0.137    0.137 client.py:33(<listcomp>)
        1    0.111    0.111   22.486   22.486 client.py:50(send_private_inputs)
        2    0.000    0.000    0.000    0.000 client.py:65(__init__)
        3    0.000    0.000    2.382    0.794 client.py:75(Send)
        3    3.537    1.179    8.322    2.774 client.py:79(Receive)
   262144    0.106    0.000    0.207    0.000 domains.py:15(__mul__)
   262144    0.032    0.000    0.032    0.000 domains.py:24(__eq__)
  2359296    1.192    0.000    2.913    0.000 domains.py:30(unpack)
   262144    0.091    0.000    0.120    0.000 domains.py:33(pack)
  5505024    2.622    0.000    2.622    0.000 domains.py:4(__init__)
  2359296    1.628    0.000    4.541    0.000 domains.py:54(unpack)
   262144    0.155    0.000    0.375    0.000 domains.py:58(pack)
  2621440    1.776    0.000    3.017    0.000 domains.py:8(__add__)
     4626    0.001    0.000    0.001    0.000 ssl.py:1109(_checkClosed)
     2307    0.005    0.000    4.779    0.002 ssl.py:1121(read)
        6    0.000    0.000    2.382    0.397 ssl.py:1199(send)
        6    0.000    0.000    2.382    0.397 ssl.py:1226(sendall)
     2307    0.005    0.000    4.784    0.002 ssl.py:1252(recv)
        3    0.000    0.000    0.000    0.000 {built-in method _struct.pack}
        3    0.000    0.000    0.000    0.000 {built-in method _struct.unpack}
        1    0.000    0.000   22.547   22.547 {built-in method builtins.exec}
  2363919    0.183    0.000    0.183    0.000 {built-in method builtins.len}
  2359296    0.358    0.000    0.358    0.000 {built-in method from_bytes}
       12    0.000    0.000    0.000    0.000 {method '__exit__' of 'memoryview' objects}
        6    0.000    0.000    0.000    0.000 {method 'cast' of 'memoryview' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
     2307    4.774    0.002    4.774    0.002 {method 'read' of '_ssl._SSLSocket' objects}
   262144    0.028    0.000    0.028    0.000 {method 'to_bytes' of 'int' objects}
        6    2.382    0.397    2.382    0.397 {method 'write' of '_ssl._SSLSocket' objects}

The above profiling is just for the first client to finish input sharing; the last client to finish input sharing reported ~60s in send_private_inputs.

And even using the C++ client, the overall input sharing step for 512x512 inputs (3 clients, 3 workers) takes ~28s with -F on:

Using security parameter 40 … The following benchmarks are excluding preprocessing (offline phase). Time = 27.8092 seconds Data sent = 0 MB in ~0 rounds (party 1) Global data sent = 0 MB (all parties)

It seems like the discrepancies in outsourced vs non-outsourced input sharing times are likely not coming from a long offline step nor differences in the amount of data sent among workers nor computation done (either among workers or on clients).

I’m not sure the time differences in the above programs can be attributed to different MPC semantics. Are there implementation differences between client and non-client (like the ones I mentioned earlier) that could be causing the slowness in client input sharing?

Thanks for your help!

mkskeller commented 1 year ago

Thank for providing this detailed account. Which protocol are you using? The matrix multiplication can be quite efficient in some protocols. I suggest to use a vector multiplication instead to get an idea: sint(0, size=512*512) ** 2. The fact that the time increases client by client is indeed linked to the fact that they are processed sequentially. Lastly, using -v when invoking the computation parties gives more details about the communication time, which might be helpful.

FriendlyN commented 1 year ago

The matrix multiplication example was just to highlight that even additional computation and communication in the non-outsourced setting does not approach the time required for only input sharing in the outsourced setting. All of the other examples are just the input sharing step for either the outsourced or non-outsourced setting. All examples use Shamir.

The matrix multiplication time suggests that the additional communication complexity of the client input sharing is likely not the source of the slowdown – otherwise the matrix multiplication (which has comparable communication) would see a major slowdown over the non-outsourced input sharing.

The matrix multiplication time also suggests that the worker side manipulation of client-generated secret shares is also not the source of slowdown – again, otherwise the matrix multiplication would be much slower than either of the experiments of just input sharing steps.

With the worker-side MPC steps and communication among workers controlled for, the only other sources of slowdown seem to be isolated to either the local client computation or the communication between clients and workers.

The Python profiling indicates that a non-trivial amount of time is spent in socket I/O and some time is spent in triple reconstruction and input masking. If the slowdowns were just by virtue of Python’s interpreter, then one would expect the times to drop when using the C++ client. They do indeed decrease with the C++ client, but not down to the ~1-2 seconds that the non-outsourced computation takes.

Moreover, the client input sharing time seems to grow much faster with input size than the resharing time of the non-outsourced case (being careful no to compare to non-outsourced input sharing step since that is only a PRG seed communication).

Of the above options, I would speculate that local client computation (triple reconstruction/checking, and input masking) seems to be unlikely to take substantially longer as input size grows since the math backend is already fairly efficient and consistent between C++ clients and workers .

The socket I/O seems to strangely take more time between client and workers than among workers themselves based on the Python profiling, and the networking implementation is different (I think) between client-worker connections and worker-worker connections. My guess is that the issue is somewhere here, but I was wondering if you knew of any immediate leads or had a hunch on what might otherwise be causing the slowdown in client input sharing.

Thanks!

mkskeller commented 1 year ago

It is certainly true that intro-worker communication is more optimized that worker-client communication. In particular, the clients are processed sequentially whereas intra-worker communication is done in parallel. I'm happy to have a closer look if you post the client-side code.

FriendlyN commented 1 year ago

Thank you. Here is the client code: Archive.zip

mkskeller commented 1 year ago

Thank you for providing this. There is a performance bug, which should be fixed in 7266f3b877d9ec6b2e59397bc24c566ed8263726.

FriendlyN commented 1 year ago

Thank you for the optimization.

It seems to have somewhat improved the outsourced input sharing times for Shamir, but not enough to bring the execution times down to comparable non-outsourced operations.

Here is a table of the runtimes in seconds vs client input sizes. The setting is the same as in the prior experiments (3 workers, 3 clients, preprocessing from file using -F). I also timed the same programs using MASCOT as the worker machine for a control since it was not affected by the optimization:

Protocol/Input Size 256x256 512x512
Shamir 12.0089 seconds 42.9197 seconds
MASCOT 11.5524 seconds 43.5793 seconds

The times are roughly equal for the client input sharing, but the input sharing step still grows in time fairly linearly. It seems that the source of the slowdown in client-worker communications is not specific to the Shamir protocol.

Would it be feasible for someone without significant knowledge of MP-SPDZ internals to port the worker-worker network implementation to the client-worker setting, or are there design issues that would make it challenging to work without breaking other things?

Thanks again!

mkskeller commented 1 year ago

There are probably a few issues slowing things down:

FriendlyN commented 1 year ago
  1. The slow times reported above (e.g., ~43 seconds to secret-share a single 512x512 input matrix) are for the first client, which shouldn't be affected by the sequential processing of the later clients.

  2. The times reported above are for the online phase only, so the extra cost from the client input protocol should only come from the communication and use (not the generation) of the triples for input sharing. However, the time to communicate and even multiply similar or larger data sizes to these triples in the non-outsourced setting is much faster (e.g., ~1 second).

  3. Thus, it seems that the implementation of the actual communication (socket I/O, serialization, etc.) is causing the (outsourced) client-worker communication to take more than an order of magnitude longer than (non-outsourced) worker-worker communication. Is there a way to address this issue?

mkskeller commented 1 year ago

Socket I/O, serialization is the same for the C++ client and the workers, so I cannot imagine this to be the issue. What setup are you using? I can run both MASCOT (without preprocessing) and Shamir with 1000x1000 input in five seconds or less on a single machine.

mkskeller commented 1 year ago

Reading the thread again, I want to clarify whether you're concerned with the performance of the C++ client or the Python client. The latter is clearly much slower as it's not optimized for speed.

FriendlyN commented 1 year ago

The Python client will indeed be inherently slower. However, the C++ client still took ~28s to secret share 512x512 inputs in the 3 client, 3 worker setting. Since both clients were significantly slower than any non-client communication, I expected there to be a common source of slowdown for both Python and C++ clients.

I am using 3 distinct machines on LAN as workers (in all settings) and a single machine (WAN) as all 3 clients. The latency between the client machine and worker machines is ~19ms (RTT) on average.

mkskeller commented 1 year ago

Have you tried running client and workers in the LAN or splitting workers over the WAN? I'm not sure it makes sense to compare LAN-only computation with WAN client input.

FriendlyN commented 1 year ago

Thank you for your help!

Moving all clients to the same LAN as the workers brought the times for client input sharing the 256x256 and 512x512 matrices down to about 1.3s overall.

While the LAN latency is negligible (~1ms) and the WAN latency is not (~39ms), it seems surprising that latency impacted throughput that severely. If there is only one VM round for client input sharing (minus preprocessing), why is latency so impactful?

The bandwidth (measured via iperf) between the WAN and LAN machines is 35.1 Mbits/sec (roughly more than 4MB/s), so that doesn’t seem like it should be an issue for the input sharing sizes.

Is there a way to see data sent metrics for the client input sharing step?

Thanks again.

mkskeller commented 1 year ago

4e97f203727f9cc5ddd102d2bfa7d535b0650eb9 outputs detailed client communication statistics with option -v. You should find that inputting 512x512 matrices requires 16.7 MB of communication per client and worker, so 150 MB overall. At 4 MB/s, this comes down to over 30 seconds, which seems to be in line with your observations.

FriendlyN commented 1 year ago

I was previously assuming the use of a 64-bit prime when a prime is not specified, but that may not be the case.

With a 64-bit prime, I would expect roughly the following communication:

512x512 elements 64 bits per element / 8 bits per byte / 1024 bytes per KB / 1024 KB per MB = 2 MB Worker -> Client = 3 Shares of Triples 2 MB = 6 MB Client -> Worker = 2 MB of masked inputs = 2 MB Total comm per client worker input sharing = 8 MB 3 of the above = 24 MB 24 MB / 4 MB/s = 6s Doubling the prime bit length should roughly only double the above, though, which does not bring us up to the reported time of ~28s. Another factor of two, would. however…

The default-prime-machine.x outputs 128-bits as the default prime lengths. Is this the prime length when not specified at compilation, and are MACs (of equivalent bit length) also being sent even for non-malicious secure protocols in the outsourced case?

Thanks again

mkskeller commented 1 year ago
  1. You miss that the communication happens between pairs of workers and clients, so it would be 98 MB not 38 MB.
  2. The default is indeed 128 bits. You can change this with the -lgp option for the virtual machines.
FriendlyN commented 1 year ago

For the above analysis, I was only considering the first client (since every other client should follow the same analysis but sequentially and therefore a multiple of the above time).

In your response to my original analysis, you used a 9x multiplier rather than a 3x multiplier. The 9x is only applicable only when considering 3 clients and 3 workers (all at once) rather than 1 client and 3 workers, correct? Revising my math above to use a 128-bit prime only doubles the communication, which doesn’t make the times consistent with the recorded times with clients were WAN and workers LAN. Are 128-bit MACs additionally being sent with the shares of triples and masked client shares?

Thanks for your time and patience!

mkskeller commented 1 year ago

All your prior messages were relating to three clients. No MACs are being sent.

FriendlyN commented 1 year ago

Sorry for any confusion. The overall setup is 3 clients and 3 workers, but as mentioned above (https://github.com/data61/MP-SPDZ/issues/823#issuecomment-1424243150 and https://github.com/data61/MP-SPDZ/issues/823#issuecomment-1440786438) the client times I reported (e.g., ~28s for the C++ client) are for the first of the 3 clients; the later clients take longer due to the sequential sharing.

The ~16 MB per client-worker pair reported using the -v option makes sense given the 128-bit prime, and I ran an execution with the -v option as you suggested to confirm the reported value makes sense with the above analysis.

Although the runtime of 28s does not make sense given the amount of communication and bandwidth, it seems that the bandwidth is the only term that can vary and that the 4MB/s number might be inconsistent or inaccurate.

Thanks for all your help!