encryptogroup / MOTION

An efficient, user-friendly, modular, and extensible framework for mixed-protocol secure multi-party computation with two or more parties
MIT License
85 stars 40 forks source link

Arithmetic operations in boolean GMW -- expected running time? #22

Open Isweet opened 2 years ago

Isweet commented 2 years ago

Hey,

Apologies for the long issue. We are using MOTION in research that we are getting ready to publish and we don't want to misrepresent MOTION.

I wonder what your experience is with the execution time of MOTION using Boolean GMW to execute arithmetic operations? For example, I wrote the following small benchmark which computes Hamming distance. Am I doing anything incorrectly which would significantly slow things down?

void EvaluateProtocol(mo::PartyPointer& party, std::uint32_t value) {
  mo::SecureUnsignedInteger zero(party->In<mo::MpcProtocol::kBooleanGmw>(mo::ToInput((uint32_t) 0), 0));
  mo::SecureUnsignedInteger one(party->In<mo::MpcProtocol::kBooleanGmw>(mo::ToInput((uint32_t) 1), 0));

  std::vector<mo::SecureUnsignedInteger> inputA(value);
  std::vector<mo::SecureUnsignedInteger> inputB(value);

  for (uint32_t i = 0; i < value / 4; i++) {
    inputA[4 * i]     = mo::SecureUnsignedInteger(party->In<mo::MpcProtocol::kBooleanGmw>(mo::ToInput((uint32_t) 1), 0));
    inputA[4 * i + 1] = mo::SecureUnsignedInteger(party->In<mo::MpcProtocol::kBooleanGmw>(mo::ToInput((uint32_t) 0), 0));
    inputA[4 * i + 2] = mo::SecureUnsignedInteger(party->In<mo::MpcProtocol::kBooleanGmw>(mo::ToInput((uint32_t) 1), 0));
    inputA[4 * i + 3] = mo::SecureUnsignedInteger(party->In<mo::MpcProtocol::kBooleanGmw>(mo::ToInput((uint32_t) 1), 0));
  }

  for (uint32_t i = 0; i < value / 4; i++) {
    inputB[4 * i]     = mo::SecureUnsignedInteger(party->In<mo::MpcProtocol::kBooleanGmw>(mo::ToInput((uint32_t) 1), 1));
    inputB[4 * i + 1] = mo::SecureUnsignedInteger(party->In<mo::MpcProtocol::kBooleanGmw>(mo::ToInput((uint32_t) 0), 1));
    inputB[4 * i + 2] = mo::SecureUnsignedInteger(party->In<mo::MpcProtocol::kBooleanGmw>(mo::ToInput((uint32_t) 1), 1));
    inputB[4 * i + 3] = mo::SecureUnsignedInteger(party->In<mo::MpcProtocol::kBooleanGmw>(mo::ToInput((uint32_t) 0), 1));
  }

  auto distance = zero;

  for (uint32_t i = 0; i < value; i++) {
    distance += (inputA[i] == inputB[i]).Mux(zero.Get(), one.Get());
  }

  // construct an output gate
  auto output = distance.Out();

  // run the protocol
  party->Run();
  party->Finish();

  // retrieve the result
  auto result = output.As<std::uint32_t>();
  // print the result into the terminal
  std::cout << "Distance " << result << std::endl;
}

A rough measurement of this program with two parties (hardcoded), on an input size of 1000, gives an end-to-end execution time of about 8 seconds. I am running a 2019 Intel MacBook Pro w/ 16g of physical RAM. It was run locally, so the network latency is very low (RTT to localhost appears to be about 50 microseconds). I was also careful to ensure that the two MOTION processes did not cause exhaustion of physical RAM.

For a comparison, I ran the same benchmark with an input size of 10000 using EMP's sh2pc library and the elapsed time is about 1 second.

I should mention that I'm aware there are ways to improve this particular benchmark. I'm only using this as a representative example of how we are using MOTION to do arithmetic, etc. We have some benchmarks that will only work with Boolean GMW, and require arithmetic. So, using conversions to Arithmetic GMW and/or using Boolean operations are not an option.

Does this agree with your expectation? Obviously, there are a lot of differences between using MOTION w/ Boolean GMW and EMP w/ Boolean Yao. But, this is a larger difference (8x slower on an input size which is 1 order of magnitude smaller) than I expected.

My understanding of the main performance concerns when comparing these two frameworks are...

  1. Since MOTION supports N parties (and therefore needs to use secret sharing for GMW and BMR), it needs to build up a circuit for optimal communication round complexity. This causes a significant amount of memory allocation / deallocation since each gate and wire object has (at least) 1 - 2 pointers which take up 1 - 2 words of memory. In contrast, EMP uses a streaming version of Yao's protocol in which gates are garbled and send over the wire on-demand. Therefore, EMP's memory requirements are much lower.
  2. Since MOTION uses secret sharing, the communication round complexity is proportional to the depth of the circuit. In contrast, EMP's communication round complexity is constant. Even with streaming gates, communication is unidirectional.

Does that sound right?

Oleksandr-Tkachenko commented 2 years ago

I'll try to answer this more exhaustively later.

A short answer is: A high level of abstraction definitely comes with a cost. That's why we advise to use SIMD operations where possible, which run one operation on many elements, as one gate. For that, you would input vectors instead of single elements and do operations as if on a single element.

GCs are often faster than GMW but that's likely not the main problem here.

What also may help is the use of other memory allocators (e.g., see how MOTION_USE_TCMALLOC is used in the docker script).

Maybe trivial but better double-check that you compiled in Release.

Oleksandr-Tkachenko commented 2 years ago

Did you try to optimize your code @Isweet? If yes, how much faster did it get for you?

Also, I know that the code vectorization is a little annoying because the developer needs to do more than just writing the usual code, so we are working on automating the vectorization. If you're interested, keep an eye on this repo.