data61 / MP-SPDZ

Versatile framework for multi-party computation
Other
926 stars 279 forks source link

Tricks to speed up compilation #1481

Closed enricobottazzi closed 1 month ago

enricobottazzi commented 2 months ago

Hi Marcel,

As part of my circuit, I have this subroutine that contains three nested loops (n is set to 100). Note that every component is public here so there's really no communication between the computing parties.

    for k in range(1, n + 1):
        for i in range(n):
            for j in range(n):
                @if_(viability_matrix[i][j] == 1)
                def f():
                    c = a_matrix[j][k] > a_matrix[i][k-1] + cost_matrix[i][j]
                    a_matrix[j][k] = c.if_else(a_matrix[i][k-1] + cost_matrix[i][j], a_matrix[j][k])
                    ep_matrix[j][k] = c.if_else(i, ep_matrix[j][k])

Just by isolating this component of the circuit, the compilation with flags --edabit -Z 3 -R 64 -l takes a lot of lines (I stopped the compilation after the warning Compilation produced more that 10 million instructions. Consider using './compile.py -l' or replacing for loops with @for_range_opt: https://mp-spdz.readthedocs.io/en/latest/Compiler.html#Compiler.library.for_range_opt)

As soon as I modify the for loops into for_range_opt the compilation is super fast (it doesn't even get to Compiled 100000 lines).

However, comparing the runtime of the online phase, for example setting a n=15 I notice a slow down in the execution time:

Do you know if there's any trick to speed up the compilation without sacrificing runtime?

Thanks

aaallami commented 2 months ago

Hi @enricobottazzi, That's a really interesting observation! While this might not be directly related to your original question, I noticed that you have both online and offline time reported. I'm wondering if there's a way to benchmark the performance of both phases simultaneously?

enricobottazzi commented 2 months ago

@aaallami yes, you can do that by attaching a flag -v as a runtime argument

OPTIONS:

-b, --batch-size ARG             Size of preprocessing batches (default: 10000)

-B, --bucket-size ARG            Batch size for sacrifice (3-5, default: 4)

-d, --direct                     Direct communication instead of star-shaped
                                 (only for dishonest-majority protocols)

-D, --disk-memory ARG            Use directory on disk for memory (container
                                 data structures) instead of RAM

-ext-server, --external-server   Use external server. Default is to coordinate
                                 through player 0.

-E, --trunc-error ARG            Probabilistic truncation error (2^-x, default:
                                 40)

-f, --file-prep-per-thread       Preprocessing from files by thread (use with
                                 pipes)

-F, --file-preprocessing         Preprocessing from files

-h, --hostname ARG               Host where Server.x or party 0 is running to
                                 coordinate startup (default: localhost).
                                 Ignored if --ip-file-name is used.

-ip, --ip-file-name ARG          Filename containing list of party ip addresses.
                                 Alternative to --hostname and running Server.x
                                 for startup coordination.

-I, --interactive                Interactive mode in the main thread (default:
                                 disabled)

-IF, --input-file ARG            Prefix for input file path (default:
                                 Player-Data/Input). Text input will be read
                                 from {prefix}-P{id}-{thread_id} and binary
                                 input from {prefix}-Binary-P{id}-{thread_id}

-lg2, --lg2 ARG                  Bit length of GF(2^n) field (default: 128;
                                 options are 4, 5, 6, 7, 8, 9, 10, 11, 12, 14,
                                 15, 16, 28, 40, 63, 64, 128)

-m, --memory ARG                 Where to obtain memory, old|empty (default:
                                 empty)
                                        old: reuse previous memory in
                                 Memory-<type>-P<i>
                                        empty: create new empty memory

-mp, --my-port ARG               Port to listen on (default: port number base +
                                 player number)

-o, --options ARG1[,ARGn]        Further options

-OF, --output-file ARG           Prefix for output file path (default: output to
                                 stdout for party 0 (silent otherwise unless
                                 interactive mode is active). Output will be
                                 written to {prefix}-P{id}-{thread_id}. Use '.'
                                 for stdout on all parties.

-p, --player ARG                 This player's number (required if not given
                                 before program name)

-pn, --portnumbase ARG           Port number base to attempt to start
                                 connections from (default: 5000)

-Q, --bits-from-squares          Compute random bits from squares

-R, --ring ARG                   Number of integer bits (default: 64)

-S, --security ARG               Statistical ecurity parameter (default: 40)

-u, --unencrypted                Unencrypted communication.

-v, --verbose                    Verbose output, in particular more data on
                                 communication
aaallami commented 2 months ago

This is fantastic, Thank you so much for your response

mkskeller commented 1 month ago

Regarding this original question: The running time of your benchmarks is so low that the noise might be as big as the signal, i.e., the differences from time to time are as large as the overall time. Do you also observe this in longer benchmarks?

enricobottazzi commented 1 month ago

Hi Marcel, yes, using larger values of n I still observe a slowdown in run time when using @for_range_opt or @for_range compared to using traditional python for loop but this comes with much smaller compilation size.

Is it an expected behaviour? Or is it possible to get the best of both worlds?

enricobottazzi commented 1 month ago

benches

Here a more detailed overview of the compared performance in terms of runtime and compiled circuit size

mkskeller commented 1 month ago

Have you tried adjusting the optimization with for_range_opt (either via --budget or the argument to for_range_opt)? The default is 1000, but 100,000 usually still leads to an acceptable compilation time. It allows to adjust the trade-off in the optimization. I would also appreciate a stand-alone example for further analysis.