data61 / MP-SPDZ

Versatile framework for multi-party computation
Other
866 stars 272 forks source link

Custom preprocessing #1305

Closed waamm closed 3 hours ago

waamm commented 4 months ago

Hello!

In order to properly benchmark the preprocessing and online phase of a certain comparison protocol for linear secret sharing scheme-based MPC protocols (see old draft here), I believe we need to customise some preprocessing code but I have not been able to find in the current documentation how to do that.

For example, it would be very useful to know how to produce in preprocessing triples ([a],[b],[ab]) where [a] and [b] are both bits in the arithmetic domain.

Judging by this line, it seems that I would need to create a new function in Fake-Offline.cpp to generate such triples. But I suspect I also need to make the compiler understand this new "data type"/"instruction"; any suggestion where to make the required edits?

Thank you for your help!

mkskeller commented 4 months ago

I can see two easy ways of achieving this:

  1. Compute the triples in the online phase and separate out the benchmarking using timers.
  2. Just change the fake offline generation of triples to generate triples with bits. This shouldn't break anything elsewhere (other than the security) because the triples are still valid.

If you really want to add an instruction, you have add it to at least the following places:

waamm commented 4 months ago

Many thanks, this was very helpful! It seems to me that (2) would not work since ordinary triples are also needed for this protocol, but (1) sounds perfect. That option did not occur to me since I was thinking the compiler would make an effort (for such MPC protocols) to reduce circuit depth as much as possible. So I guess you're saying that this should work:

def preprocessing():
  ...   

start_timer(1)

def online():
  ...

stop_timer(1)

The only way to check whether or not two "independent" multiplication gates are indeed executed in the same communication round, is by inspecting the compiled program?

mkskeller commented 4 months ago

Many thanks, this was very helpful! It seems to me that (2) would not work since ordinary triples are also needed for this protocol, but (1) sounds perfect. That option did not occur to me since I was thinking the compiler would make an effort (for such MPC protocols) to reduce circuit depth as much as possible. So I guess you're saying that this should work:

Yes because timer operations imply a break, that is, no circuit optimization is done between before and after.

The only way to check whether or not two "independent" multiplication gates are indeed executed in the same communication round, is by inspecting the compiled program?

For a set of specific gates yes. For a more birds-eye view, you can look at the number of virtual machine rounds output by the compiler, and see if it roughly matches your expectations. A virtual machine round is one round of any operation optimized by the compiler and that includes any sort of multiplication.

waamm commented 4 months ago

A .reveal() in shamir takes 2 rounds but in mal-shamir it takes 1 round, is that correct?

x = sint(5)

start_timer(1)

x.reveal()

stop_timer(1)
mkskeller commented 4 months ago

In the default configuration yes. shamir uses a star-based opening protocol that scales better with the number of parties at the expense of an extra round. There is the option to use direct communication more similar to mal-shamir by using the --direct command-line argument.

waamm commented 4 months ago

Oh, that sounds like the "king node" approach of Damgård-Nielsen? I'm curious why it would be limited to the opening part of the protocol, and to dishonest-majority protocols - is there a reference for this particular approach?

mkskeller commented 4 months ago

It's certainly related but the approach is generic. With any secret sharing scheme you can do reconstruction by sending all shares to one party and then sending the result back to all parties (O(n)) instead of every party sending their share to every other party (O(n^2)). However, there might be issues with a malicious king node, which in some protocols can be solved in other ways, see for example: https://eprint.iacr.org/2012/642 Regarding generalisation, some multiplication protocols are based are opening (like Damgård-Nielsen), so the properties of the opening protocol filter through to the multiplication protocol, but this isn't the case for all protocols. Regarding the restriction to dishonest-majority protocols, it's just that it wasn't always implemented for Shamir secret sharing, the documentation is just outdate. Thank you for bringing this up.

waamm commented 4 months ago

Inside a large circuit, I now have an array arr of secret-shared sint values which are used in multiplications and additions later in the circuit. Depending on some random values of an array arr2 that are revealed (becoming cint), some of the secret-shared values in arr must be zero, and hence the corresponding multiplications can be skipped. I was hoping that inserting this code might work, but I realise that might be naive?

for i in range(len(arr)):
    @if_(arr2[i] == val)
    def _():
        arr[i] = cint(0)

Bandwidth indeed drops, but by an amount which does not appear to be random (yet it should have been) and the round complexity is going up (with both shamir and mascot).

I'll try to isolate the problem better, but I thought perhaps you already have some idea of what is going on? This at least reproduces the bandwidth phenomenon:

x = [sint.get_random_bit().reveal() for i in range(10)]
y = [sint(i) for i in range(10)]
z = [sint(i) for i in range(10)]

start_timer(1)

for i in range(10):
    @if_(x[i] == 1)
    def _():
        y[i] = cint(0)

for i in range(10):
   (y[i] * z[i]).reveal()

stop_timer(1)
mkskeller commented 4 months ago

You cannot mix run-time branching and Python lists. The example will set all values in y to 0 independently of the condition. As a general rule, only use Array with run-time branching.

waamm commented 4 months ago

Made some changes, but I still don't understand why for the following code the round complexity with shamir is larger than 7? (The line y[i] = cint(0) is probably incorrect, but the result is the same with y[i] = sint(0).)

size = 40

x = Array(size, sint)
y = Array(size, sint)
x2 = Array(size, cint)
for i in range(size):
    x[i] = sint.get_random_bit()
    x2[i] = x[i].reveal()
    y[i] = sint(2)
zeroes = Array(size, cint)

for i in range(size):
    @if_(x2[i] == 0)
    def _():
        zeroes[i] = cint(1)

start_timer(1)

while size > 1:
    size = size // 2
    @for_range_parallel(size, size)
    def _(i):
        @if_e(zeroes[2*i] == 1)
        def _():
            y[i] = cint(0)
        @else_
        def _():
            y[i] = y[i] + y[i] * y[2*i]

z = y[20].reveal()

print_ln("%s", z)

stop_timer(1)
mkskeller commented 4 months ago

The conditional @if_e(zeroes[2*i] == 1) prevents the parallelization, so the multiplication y[i] * y[2*i] is executed size times consecutively.

waamm commented 4 months ago

Is there a straightforward way around that?

mkskeller commented 4 months ago

A straightforward way at some bandwidth expense is to use if_else instead of the @if_e: https://mp-spdz.readthedocs.io/en/latest/Compiler.html#Compiler.types.sint.if_else A more involved way saving some bandwidth is to determine a reasonable upper bound of multiplications for every round and setup an array to be multiplied using @if_e, then running the multiplications in parallel, and then post-process with more conditionals. The crux is that the multiplications cannot be inside the conditional.

waamm commented 3 months ago

Many thanks for those suggestions, but I'm not entirely sure I follow so here's a simplified version of the problem:

y = [sint.get_random() for i in range(size)]
x_clear = [sint.get_random_bit().reveal() for i in range(size)]

start_timer(1)

for i in range(size):
  y[i] = y[i] * y[i] * x_clear[i]

stop_timer(1)

Thus there's a 50% change that y[i] is 0 (or rather, [0]) and the multiplication y[i] * y[i] does not have to be executed.

But I would say there is no "reasonable upper bound" available, other than size itself; is it possible to obtain a 50% reduction in bandwidth here, using the methods you just described?

mkskeller commented 3 months ago

I think you can only get the bandwidth reduction at the cost of more rounds as in the earlier examples.

waamm commented 2 months ago

That's unfortunate, but many thanks again for your effort.

Following this code, I'd now like to separately benchmark the online and offline phase, running a protocol f(x,y,prep_material) thousands or millions of times in multiple threads. Here prep_material is an array of (arrays of) preprocessed secret values (random values, random bits, edaBits, and values obtained by multiplying or adding some of these to each other, etc). Something like this:

n = 1024

n_threads = 8

l = 1

prep_materials = []

for i in range(n):
    prep_materials.append(preprocessing())

res = sint.Array(n)

start_timer(1)
@multithread(n_threads, n)
def _(base, m):
    @for_range(l)
    def _(i):
        f(sint(1, size=m), sint(2, size=m), prep_materials[base:base+m]).store_in_mem(base)

stop_timer(1)

One immediate problem here is: TypeError: slice indices must be integers or None or have an __index__ method

What would I need to change to make such code work?

mkskeller commented 2 months ago

You need to use an array for prep_materials and the get_vector() instead of Python slicing.

waamm commented 2 months ago

By which you mean a MultiArray or Tensor? For an f requiring 2 edabits, I just tried something like this:

edabit_values = sint.Tensor([n,2,1])
edabit_bits = sint.Tensor([n,2,bit_length])

def preprocessing():
    edabit0, edabit1 = [sint.get_edabit(edabits_size, True) for i in range(2)]
    return [edabit0[0], edabit1[0]], [edabit0[1], edabit1[1]]

for i in range(n):
    edabit_values[i], edabit_bits[i] = preprocessing()

start_timer(1)
@multithread(n_threads, n)
def _(base, m):
    print("m = ", m)
    @for_range(l)
    def _(i):
        f(sint(1, size=m), sint(2, size=m), edabit_values.get_vector(base, m), edabit_bits.get_vector(base, m)).store_in_mem(base)

stop_timer(1)

@vectorize
def f(x, y, edabit_values, edabit_bits):
    a = edabit_values[1]

But here the final line a = edabit_values[1] produces an IndexError: list index out of range

mkskeller commented 2 months ago

I think the easiest would be to only use Array.

waamm commented 2 months ago

Not sure I follow - each instance of this f requires 2 edabits, so that's $2 \cdot bitlength$ sbits and 2 sints; are you saying I should produce $2 \cdot bitlength + 2$ separate Arrays?

mkskeller commented 2 months ago

That's what I meant but you can actually use get_part() as well: https://mp-spdz.readthedocs.io/en/latest/Compiler.html#Compiler.types.MultiArray.get_part

waamm commented 2 months ago

Ah I think you mean get_part_vector()? That part now seems to work (though I just realised that I wrote edabit_values = sint.Tensor([n,2,1]) above but that should've probably been edabit_values = sint.Tensor([n,2]) instead, and I'm not sure the compiler noticed?). However, a bit further down some code analogous to

@vectorize
def f(x, y, edabit_values, edabit_bits):
    a = edabit_values[1]
    for i in range(bit_length):
        b = edabit_bits[0][i]

fails to compile, due to another IndexError in the final line (referring to the [i] portion).

mkskeller commented 2 months ago

No, I mean get_part() because it returns a MultiArray of the same dimension just partial along the first dimension.

waamm commented 2 months ago

That yields raise CompilerError('index out of range') in the a = edabit_values[1] line again.

mkskeller commented 2 months ago

Please post the full code.

waamm commented 2 months ago

Here's a shortened version (which I hope is easier to work with, otherwise I'll post a longer version):

bit_length = 64
edabits_size = bit_length

def preprocessing():
    edabit0, edabit1 = [sint.get_edabit(edabits_size, True) for i in range(2)]
    return [edabit0[0], edabit1[0]], [edabit0[1], edabit1[1]]

@vectorize
def f(x, y, edabit_values, edabit_bits):
    a = edabit_values[1] - x
    b = a.reveal()
    return (a, f2(b, edabit_bits[0]))

@vectorize
def f2(b, eda):
    b_bits = cint.bit_decompose(b)
    return [eda[i].bit_xor(b_bits[i]) for i in range(bit_length)]

n = 1024

n_threads = 8

l = 1

res = sint.Array(n)

edabit_values = sint.Tensor([n,2])
edabit_bits = sint.Tensor([n,2,bit_length])

for i in range(n):
    edabit_values[i], edabit_bits[i] = preprocessing()

start_timer(2)
@multithread(n_threads, n)
def _(base, m):
    print("m = ", m)
    @for_range(l)
    def _(i):
        f(sint(1, size=m), sint(2, size=m), edabit_values.get_part(base, m), edabit_bits.get_part(base, m)).store_in_mem(base)

stop_timer(2)
mkskeller commented 2 months ago

Because m=1, edabit_values.get_part(base, m) has dimension (1,2), so 1 is out of bounds.

waamm commented 2 months ago

Hmm but the point is to increase m and "vectorise" this? How should I write that?

I don't understand the relevant documentation: it says "Distribute the computation of n_items to n_threads threads", then sets n_threads = 8 but then it says "in three different threads"?

mkskeller commented 2 months ago

That is indeed a typo but my previous comment referred to the code example you posted originally. The changed code example doesn't the produce out-of-index error.

waamm commented 2 months ago

Yes, instead now there's a Compiler.exceptions.VectorMismatch: Different vector sizes of operands: 2/128

So in that code $m = 128 = 1024 / 8$? Does that mean that edabit_values.get_part(base, m) has dimension (m,2)? Doesn't that conflict with the attempt to "vectorise" this code?

mkskeller commented 2 months ago

So in that code m=128=1024/8? Does that mean that edabit_values.get_part(base, m) has dimension (m,2)?

Yes.

Doesn't that conflict with the attempt to "vectorise" this code?

What do you mean?

waamm commented 2 months ago

Sorry, I phrased that badly. What I meant to say is that it is my impression that when the vectorised f receives the two vectors sint(1, size=m) and sint(2, size=m) of size m, it seems to me that it acts on them entrywise, as if only two individual elements were passed along?

Then the edabit_values input should have dimension (m, 2) and the edabit_bits input have dimension (m, 2, bit_length), but inside f they should appear as arrays of dimension 2 and (2, bit_length)?

mkskeller commented 2 months ago

I see what you mean. @vectorize is a relatively simple approach to make sure that code that works with single sint etc. also works with vectors thereof. It does not handle arrays or tensors in any way.

waamm commented 1 month ago

I can see two easy ways of achieving this:

  1. Compute the triples in the online phase and separate out the benchmarking using timers.

Is there a straightforward way to make that approach work with say the EzPC ResNet code? I was thinking of overloading/replacing the comparison calls inside non_linear.py and then executing such code, but I believe the issue of separating the offline and online phases remain.

mkskeller commented 1 month ago

You could just start and stop a timer within the comparison call replacement, so I still don't see an issue.

waamm commented 1 month ago

I probably wasn't clear - the idea would be to record the change in the total online time (and preprocessing bandwidth) required to execute something like ResNet50, by changing the comparison protocols in say this line.

I still don't understand how to make that work - where should I put the timer(s)? It seems to me that what you're describing would yield many tiny measurements instead? (Also I would need to figure out how to store and reference the preprocessing material?)

Also, do you have any other benchmark suggestions? One of the primary aims behind measuring the required online time of such a heavy workload is to determine whether any change in computational complexity has a significant impact.

mkskeller commented 1 month ago

I don't think I misunderstood. You could (and probably should) do the preprocessing in batches just like it's done in the virtual machines, roughly as follows:

if not preprocessing left:
  start timer
  do preprocessing of a batch of triples
  stop timer
use preprocessing

That way you can reduce the costs associated with preprocessing even the timer calls. I don't see an issue in storing the preprocessing, just use the usual containers.

All that said, I don't want to stop you from implementing it in C++ as the rest, I just think it might be easier in Python if you can implement the preprocessing in Python.

waamm commented 1 month ago

I am very glad to hear that this seems doable, but my background is not that technical and consequently I'm not yet familiar with the virtual machines at this level - are you aware of a similar MP-SPDZ coding example somewhere?

So instead of LtzRing(a, k) I want to use NewLtz(a, k, preprocessing_for_one_invocation), probably by switching them inside non_linear.py. Now given some program like ResNet50 which has comparisons in it, how am I supposed to feed NewLtz its required preprocessing material?

(Yes the Python code is ready, e.g. this function involving some multiplications is involved. Similarly, the Rabbit protocol can be significantly improved in this regard by moving the bit addition protocol of edaBits to preprocessing.)

mkskeller commented 1 month ago

I am very glad to hear that this seems doable, but my background is not that technical and consequently I'm not yet familiar with the virtual machines at this level - are you aware of a similar MP-SPDZ coding example somewhere?

An example of what?

So instead of LtzRing(a, k) I want to use NewLtz(a, k, preprocessing_for_one_invocation), probably by switching them inside non_linear.py. Now given some program like ResNet50 which has comparisons in it, how am I supposed to feed NewLtz its required preprocessing material?

What do you mean by feed? The pseudo-code above involves keeping a batch of preprocessing in store as a well as a counter and replenishing it whenever it's empty. The same principle is used in the C++ code.

waamm commented 1 month ago

I meant that the pseudocode you wrote probably should not go into non_linear.py or the .mpc file itself; where in the codebase is this C++ preprocessing code, and in particular the preprocessing counter? I presume that is where I should put (an appropriate version of) the Python preprocessing code?

mkskeller commented 1 month ago

No, that's not what I meant. It's just a design principle that can be anywhere. One application in C++ is for triples here where the counter is simply the size of the C++ vector: https://github.com/data61/MP-SPDZ/blob/a44132e5095f84ed5fda3e27c100bf2d6e462243/Protocols/ReplicatedPrep.hpp#L221C1-L226C6

waamm commented 1 month ago

I'm thinking now that instead of adding new preprocessing material from scratch, it's probably easier for me and will suffice (for now) to extend the existing edaBit generation protocol, as follows: instead of returning an edaBit, i.e. a value in Z/mZ together with sharings of its bits in Z/2Z, it would additionally return certain products of those bits.

I'll try that tomorrow - following the manual, I could time the online phase of ResNet50 using "insecure preprocessing"? For this, would modifying plain_edabits suffice? For the impact on preprocessing bandwidth, I would create one program which retrieves a bunch of edaBits, and another which subsequently performs those bit multiplications.

mkskeller commented 1 month ago

I don't see a reason why it wouldn't work.

waamm commented 1 month ago

Adding program.use_edabit(True) to tf.mpc doesn't seem to have an effect for SqueezeNet; is that because edaBits are already used by default or because this SqueezeNet does not have operations like ReLUs?

Also, for testing I added some print_ln statements to non_linear.py. The command ./compile.py -R 64 tf EzPC/Athos/Networks/SqueezeNetImgNet/graphDef.bin 1 trunc_pr split then yields Compile with '-O', but that doesn't seem to work regardless of where I place -O?

mkskeller commented 1 month ago

Adding program.use_edabit(True) to tf.mpc doesn't seem to have an effect for SqueezeNet; is that because edaBits are already used by default or because this SqueezeNet does not have operations like ReLUs?

If you compile split, it uses local share conversion instead of edaBits as described in https://eprint.iacr.org/2018/403

Also, for testing I added some print_ln statements to non_linear.py. The command ./compile.py -R 64 tf EzPC/Athos/Networks/SqueezeNetImgNet/graphDef.bin 1 trunc_pr split then yields Compile with '-O', but that doesn't seem to work regardless of where I place -O?

Thank you for raising this. You should find that ef82a68aa9 fixes it.

waamm commented 3 weeks ago

Thanks!

When I put something like this in a .mpc file, it seems to work:

T = cint(1)
b = cint(2)
c = cbit(T < b)

But when I put it inside non_linear.py (and import cbit), I get errors like these:

  File ".../Compiler/non_linear.py", line 228, in LTS
    c = cbit(T < b)
         ^^^^^^^^^^^
  File ".../Compiler/GC/types.py", line 143, in __init__
    self.load_other(value)
  File ".../Compiler/GC/types.py", line 162, in load_other
    n_convs = min(other.size, n_units)
              ^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: '<' not supported between instances of 'int' and 'NoneType'

Also, print(T < b) yields ciinf, is that intended?

mkskeller commented 3 weeks ago

This is probably due to the same optimization causing issues earlier. It tries to generate code without concrete vector lengths, hence the appearance of None and inf. You can try with -O or remove the @instructions_base.cisc decorator from LTZ in comparison.py.

waamm commented 3 weeks ago

Previous code worked after your fix, but now similarly this code

a = sint.get_edabit(64, True)[1][0]
b = (0 < 1)
c = a + b

compiles when put in a .mpc file, but not when similarly placed inside non_linear.py.

  File ".../MP-SPDZ/Compiler/non_linear.py", line 199, in ltz
    return LtzRing(c, k)
           ^^^^^^^^^^^^^
  File ".../MP-SPDZ/Compiler/comparison.py", line 100, in LtzRing
    tmp = a - r_prime
          ~~^~~~~~~~~
  File ".../MP-SPDZ/Compiler/types.py", line 220, in read_mem_operation
    return operation(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../MP-SPDZ/Compiler/GC/types.py", line 521, in __add__
    other = self.conv(other)
            ^^^^^^^^^^^^^^^^
  File ".../MP-SPDZ/Compiler/GC/types.py", line 54, in conv
    res.load_other(other)
  File ".../MP-SPDZ/Compiler/GC/types.py", line 514, in load_other
    super(sbits, self).load_other(other)
  File ".../MP-SPDZ/Compiler/GC/types.py", line 178, in load_other
    self.mov(self, sbitvec(other, self.n).elements()[0])
                   ^^^^^^^^^^^^^^^^^^^^^^
  File ".../MP-SPDZ/Compiler/GC/types.py", line 882, in __init__
    c = ((elements - r) << (l - length)).reveal()
         ~~~~~~~~~~~~~~~^^~~~~~~~~~~~~~
  File ".../MP-SPDZ/Compiler/types.py", line 2820, in __lshift__
    return self * util.pow2_value(other, bit_length, security)
           ~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
TypeError: unsupported operand type(s) for *: 'sint' and 'float'
mkskeller commented 3 weeks ago

First, the code above is somewhat trivial because the comparison is between two Python integers (0 < 1). Second, it might be that (l - length) is negative because the -R parameter is too low, so I would recommend giving a higher number there, but it's hard to be sure without seeing the actual code.

waamm commented 3 weeks ago

I'm getting the same error for (cint(0) < cint(1)).

The actual code to be inserted is here (which does compile inside of a .mpc file); the above snippet is a simplified version of a part of LTS. But the error is now very different indeed:

Writing to Programs/Bytecode/tf-EzPC_Athos_Networks_SqueezeNetImgNet_graphDef.bin-1-trunc_pr-multithread-2.bc
Traceback (most recent call last):
  File ".../MP-SPDZ/Compiler/instructions_base.py", line 975, in check_args
    ArgFormats[f].check(arg)
  File ".../MP-SPDZ/Compiler/instructions_base.py", line 764, in check
    raise ArgumentError(arg, "Wrong register type '%s', expected '%s'" % \
Compiler.exceptions.ArgumentError: (sb47398529(12769)(817216), "Wrong register type 'sb', expected 's'")

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File ".../MP-SPDZ/./compile.py", line 41, in <module>
    main(compiler)
  File ".../MP-SPDZ/./compile.py", line 36, in main
    compilation(compiler)
  File ".../MP-SPDZ/./compile.py", line 19, in compilation
    prog = compiler.compile_file()
           ^^^^^^^^^^^^^^^^^^^^^^^
  File ".../MP-SPDZ/Compiler/compilerLib.py", line 454, in compile_file
    exec(compile(infile.read(), infile.name, "exec"), self.VARS)
  File "Programs/Source/tf.mpc", line 36, in <module>
    opt.forward(1, keep_intermediate=False)
  File ".../MP-SPDZ/Compiler/../Compiler/ml.py", line 200, in wrapper
    res = function(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../MP-SPDZ/Compiler/../Compiler/ml.py", line 2278, in forward
    layer.forward(batch=self.batch_for(layer, batch),
  File ".../MP-SPDZ/Compiler/../Compiler/ml.py", line 265, in forward
    self._forward(batch)
  File ".../MP-SPDZ/Compiler/../Compiler/ml.py", line 1048, in _forward
    @multithread(self.n_threads, len(batch) * n_per_item)
     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../MP-SPDZ/Compiler/library.py", line 1084, in decorator
    tape = prog.new_tape(f, (0,), 'multithread')
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../MP-SPDZ/Compiler/program.py", line 315, in new_tape
    function(*args)
  File ".../MP-SPDZ/Compiler/library.py", line 1066, in f
    return loop_body(base, thread_rounds + inc)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../MP-SPDZ/Compiler/../Compiler/ml.py", line 1050, in _
    self.Y.assign_vector(self.f_part(base, size), base)
                         ^^^^^^^^^^^^^^^^^^^^^^^
  File ".../MP-SPDZ/Compiler/../Compiler/ml.py", line 1099, in f_part
    c = x > 0
        ^^^^^
  File ".../MP-SPDZ/Compiler/types.py", line 141, in vectorized_operation
    res = operation(self, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../MP-SPDZ/Compiler/types.py", line 4475, in __gt__
    return self.v.greater_than(other.v, self.k, self.kappa)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../MP-SPDZ/Compiler/types.py", line 220, in read_mem_operation
    return operation(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../MP-SPDZ/Compiler/types.py", line 228, in type_check
    return operation(self, other, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../MP-SPDZ/Compiler/types.py", line 141, in vectorized_operation
    res = operation(self, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../MP-SPDZ/Compiler/types.py", line 2725, in __gt__
    comparison.LTZ(res, other - self,
  File ".../MP-SPDZ/Compiler/comparison.py", line 84, in LTZ
    movs(s, program.non_linear.ltz(a, k, kappa))
  File ".../MP-SPDZ/Compiler/instructions_base.py", line 408, in maybe_gf2n_instruction
    return instruction(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../MP-SPDZ/Compiler/instructions_base.py", line 317, in maybe_vectorized_instruction
    return Vectorized_Instruction(size, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../MP-SPDZ/Compiler/instructions_base.py", line 281, in __init__
    super(Vectorized_Instruction, self).__init__(*args, **kwargs)
  File ".../MP-SPDZ/Compiler/instructions_base.py", line 930, in __init__
    self.check_args()
  File ".../MP-SPDZ/Compiler/instructions_base.py", line 977, in check_args
    raise CompilerError('Invalid argument %d "%s" to instruction: %s'
Compiler.exceptions.CompilerError: Invalid argument 1 "sb47398529(12769)(817216)" to instruction: vmovs 817216, s1634432(817216), sb47398529(12769)(817216)
Wrong register type 'sb', expected 's'