Possible race condition in CiphertextTallySelection.elgamal_accumulate()

JohnLCaron commented 3 years ago

An audit of mutability in the Java library turns up one potential trouble spot.

    def elgamal_accumulate(
        self, elgamal_ciphertext: ElGamalCiphertext
    ) -> ElGamalCiphertext:
        """
        Homomorphically add the specified value to the message
        """
        new_value = elgamal_add(self.ciphertext, elgamal_ciphertext)  (1)
        self.ciphertext = new_value                                                         (2)
        return self.ciphertext

This method is called from multiple threads by CiphertextTallyContest._accumulate_selections() and CiphertextTally._execute_accumulate(). As it stands, it is susceptible to a race condition if the thread is interrupted between lines 1 and 2.

Its possible that the current logic of Tally prevents this from happening. I cant completely follow the logic, and its a notoriously hard thing to do in multi-threaded code. But even if so, the danger is that future maintainers might accidentally change the logic without realizing its effect. OTOH, I have not actually tried to generate an error with this existing code. And Im not familiar with python's memory model or concurrency model.

But I think its a problem in my Java port, and im thinking of making the method synchronized, which would solve the problem (and I think does not destroy the parellelism, since synchronizing on a method just uses the object's lock, and separate selections have separate objects (I think)):

    /** Homomorphically add the specified value to the message. */
    private synchronized ElGamal.Ciphertext elgamal_accumulate(ElGamal.Ciphertext elgamal_ciphertext) {
      this.ciphertext_accumulate = ElGamal.elgamal_add(this.ciphertext_accumulate, elgamal_ciphertext);
      return this.ciphertext();
    }

Im guessing you have already thought about this, so Im looking for any thoughts or input.

danwallach commented 3 years ago

Python guarantees single-threaded access without preemption. Which is to say, ElectionGuard-Python mostly never has to worry about it.

If you look at arlo-e2e, which was built for concurrency, it does this by being engineered for a map-reduce sort of parallelism, which would work great with Java streams.

JohnLCaron commented 3 years ago

How do you get parallelization? I assumed thats what Scheduler does?

I have been wondering how this all scales up. Presumably the BigInteger / GMPY code is the bottleneck?

Ill have to have a look at arlo-e2e when I can.

I have used Google Flume a fair amount for massive parellelism inside of Google. Called Cloud DataFlow on Google Cloud I think. I suppose the trick is to make it portable across cloud providers.

danwallach commented 3 years ago

It was a long process to get things humming along at scale. The gist is that I built everything around a map-reduce design (so no mutation), first getting it running with Python list comprehensions, then multiprocessing, and eventually Ray.io (which knows how to manage and autoscale a cluster, supposedly portable across Amazon, Azure, etc, although I only ever used it on Amazon).

danwallach commented 3 years ago

If you care about parallelism, you're going to want to do more than just translate from Python to Java. It's going to require changing the abstractions, and thinking about the level of parallelism you're hoping to achieve for your application.

JohnLCaron commented 3 years ago

Looks like Flume has been open sourced as Apache Beam, and can work across cloud providers. Sounds promising for me, since I am already familiar with that.

What is the roadmap for electionguard scalability and cloud enabling? Should I start another issue, or is there already a forum for discussing?

keithrfung commented 3 years ago

@JohnLCaron As @danwallach put it, this is not an issue currently in python as it currently stands thus I'm going to close this issue.

As for roadmap, we have been looking internally at changes to python to meet scaling as arlo does. The group is working on a roadmap that will show up under the ElectionGuard (https://www.electionguard.vote/ aka github.com/microsoft/electionguard).

Regarding the forum for discussing, we recently opened up the discussions section specifically so we can start conversations like this. I know Dan would probably love to have a discourse about this. I started one for you

JohnLCaron commented 3 years ago

If its not a problem in python, Im good with closing. For my own understanding, could you explain what Scheduler is being used for in Tally?

It looks like its being called with _with_sharedresource = false, which apparently means use the process_pool instead of the thread_pool. Using a global lock? Does that still give you parallel computation?

If called _with_sharedresources = true is it still synchronous? If so, what does it mean to be multi-threaded?

Election-Tech-Initiative / electionguard-python

Possible race condition in CiphertextTallySelection.elgamal_accumulate() #273