data61 / MP-SPDZ

Versatile framework for multi-party computation
Other
926 stars 279 forks source link

More clarity on running MP-SPDZ on 3 VMs for 3 parties. #1213

Closed sandy9999 closed 11 months ago

sandy9999 commented 12 months ago

I'm looking to run Programs/Source/breast_tree.mpc using the 3 party Replicated Secret Sharing protocol in the semi-honest setting. This is what I'm currently trying at the root of the MP-SPDZ repo.

 Scripts/compile-run.py -HOSTS HOSTS -E ring breast_tree -Z 3 -R 64

My HOSTS file has the following information:

myname@xxx.xxx.xxx.xxx/home/myname
myname@xxx.xxx.xxx.xxx/home/myname
myname@xxx.xxx.xxx.xxx/home/myname

I also tried the following format:

[myname@]xxx.xxx.xxx.xxx[/home/myname]
[myname@]xxx.xxx.xxx.xxx[/home/myname]
[myname@]xxx.xxx.xxx.xxx[/home/myname]

The HOSTS file is not getting parsed correctly, which I'm guessing is because I've given IP address instead of HOSTNAME. How can I make it work with IP address?

mkskeller commented 12 months ago

The argument should be -H HOSTS.

sandy9999 commented 12 months ago

Thank you, that helped. Following up. I am running:

Scripts/compile-run.py -H HOSTS -E ring breast_tree -Z 3 -R 64

My HOSTS file format.

myname@xxx.xxx.xxx.xxx/home/myname
myname@xxx.xxx.xxx.xxx/home/myname
myname@xxx.xxx.xxx.xxx/home/myname

I am getting the following error.

/usr/bin/ld: cannot open output file static/replicated-ring-party.x: No such file or directory
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [Makefile:150: static/replicated-ring-party.x] Error 1

I realized there's no proper authentication mechanism provided to ssh into the 3 VMs mentioned in HOSTS. Is that what's leading to this error? If yes, how should we provide that?

mkskeller commented 11 months ago

Thank you for raising this. You should find that d4c96c61bd fixes it.

What do you mean no proper authentication mechanism being provided to ssh?

sandy9999 commented 11 months ago

Thank you for fixing! About the authentication mechanism query, I'm supporting it with a stack trace

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
Exception in thread Thread-2 (run):
Exception in thread Thread-3 (run):
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.10/threading.py", line 946, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/src/MP-SPDZ/Scripts/../Compiler/compilerLib.py", line 520, in run
    connection.run(
  File "/usr/local/lib/python3.10/site-packages/decorator.py", line 232, in fun
    self.run()
    return caller(func, *(extras + args), **kw)
  File "/usr/local/lib/python3.10/site-packages/fabric/connection.py", line 22, in opens
  File "/usr/local/lib/python3.10/threading.py", line 946, in run
    self.run()
  File "/usr/local/lib/python3.10/threading.py", line 946, in run
    self.open()
  File "/usr/local/lib/python3.10/site-packages/fabric/connection.py", line 665, in open
    self._target(*self._args, **self._kwargs)
  File "/usr/src/MP-SPDZ/Scripts/../Compiler/compilerLib.py", line 520, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/src/MP-SPDZ/Scripts/../Compiler/compilerLib.py", line 520, in run
    result = self.client.connect(**kwargs)
  File "/usr/local/lib/python3.10/site-packages/paramiko/client.py", line 386, in connect
    sock.connect(addr)
    connection.run(
    connection.run(
TimeoutError: [Errno 110] Connection timed out
  File "/usr/local/lib/python3.10/site-packages/decorator.py", line 232, in fun
  File "/usr/local/lib/python3.10/site-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/usr/local/lib/python3.10/site-packages/fabric/connection.py", line 22, in opens
    self.open()
  File "/usr/local/lib/python3.10/site-packages/fabric/connection.py", line 665, in open
    result = self.client.connect(**kwargs)
  File "/usr/local/lib/python3.10/site-packages/paramiko/client.py", line 386, in connect
    sock.connect(addr)
TimeoutError: [Errno 110] Connection timed out
    return caller(func, *(extras + args), **kw)
  File "/usr/local/lib/python3.10/site-packages/fabric/connection.py", line 22, in opens
    self.open()
  File "/usr/local/lib/python3.10/site-packages/fabric/connection.py", line 665, in open
    result = self.client.connect(**kwargs)
  File "/usr/local/lib/python3.10/site-packages/paramiko/client.py", line 386, in connect
    sock.connect(addr)
TimeoutError: [Errno 110] Connection timed out

Exception in thread Thread-6 (<lambda>):
Exception in thread Thread-5 (<lambda>):
Traceback (most recent call last):
Traceback (most recent call last):
Exception in thread Thread-4 (<lambda>):
  File "/usr/local/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
  File "/usr/local/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.10/threading.py", line 946, in run
    self.run()
    self._target(*self._args, **self._kwargs)
    self.run()
  File "/usr/local/lib/python3.10/threading.py", line 946, in run
  File "/usr/src/MP-SPDZ/Scripts/../Compiler/compilerLib.py", line 561, in <lambda>
    run = lambda i: connections[i].run(
  File "/usr/local/lib/python3.10/site-packages/decorator.py", line 232, in fun
  File "/usr/local/lib/python3.10/threading.py", line 946, in run
    return caller(func, *(extras + args), **kw)
  File "/usr/local/lib/python3.10/site-packages/fabric/connection.py", line 22, in opens
    self._target(*self._args, **self._kwargs)
  File "/usr/src/MP-SPDZ/Scripts/../Compiler/compilerLib.py", line 561, in <lambda>
    self.open()
  File "/usr/local/lib/python3.10/site-packages/fabric/connection.py", line 665, in open
    run = lambda i: connections[i].run(
  File "/usr/local/lib/python3.10/site-packages/decorator.py", line 232, in fun
    result = self.client.connect(**kwargs)
  File "/usr/local/lib/python3.10/site-packages/paramiko/client.py", line 386, in connect
    sock.connect(addr)
TimeoutError: [Errno 110] Connection timed out
    return caller(func, *(extras + args), **kw)
  File "/usr/local/lib/python3.10/site-packages/fabric/connection.py", line 22, in opens
    self._target(*self._args, **self._kwargs)
  File "/usr/src/MP-SPDZ/Scripts/../Compiler/compilerLib.py", line 561, in <lambda>
    self.open()
    run = lambda i: connections[i].run(
  File "/usr/local/lib/python3.10/site-packages/decorator.py", line 232, in fun
  File "/usr/local/lib/python3.10/site-packages/fabric/connection.py", line 665, in open
    result = self.client.connect(**kwargs)
    return caller(func, *(extras + args), **kw)
  File "/usr/local/lib/python3.10/site-packages/paramiko/client.py", line 386, in connect
  File "/usr/local/lib/python3.10/site-packages/fabric/connection.py", line 22, in opens
    self.open()
  File "/usr/local/lib/python3.10/site-packages/fabric/connection.py", line 665, in open
    sock.connect(addr)
    result = self.client.connect(**kwargs)
  File "/usr/local/lib/python3.10/site-packages/paramiko/client.py", line 386, in connect
TimeoutError: [Errno 110] Connection timed out
    sock.connect(addr)
TimeoutError: [Errno 110] Connection timed out

Basically meant to say that we are not providing any details about password/key etc. Or is it expected that the machines are on the same VNET? If 3 parties are basically 3 VMs, should I be running Scripts/compile-run.py -H HOSTS -E ring breast_tree -Z 3 -R 64 from one of these VMs? (Currently I'm running on a 4th VM that doesn't correspond to these parties)

mkskeller commented 11 months ago

The assumption is that you have password-less authentication set up, either by ssh-agent or password-less authentication key. See for example https://www.tecmint.com/ssh-passwordless-login-using-ssh-keygen-in-5-easy-steps/

sandy9999 commented 11 months ago

Thanks, that information helps. Also, are the following scenarios allowed?

a) Run Scripts/compile-run.py -H HOSTS -E ring breast_tree -Z 3 -R 64 from a 4th VM that doesn't correspond to any of the 3 VMs in HOSTS. b) Run Scripts/compile-run.py -H HOSTS -E ring breast_tree -Z 3 -R 64 from one of the 3 VMs mentioned in HOSTS.

Knowing this would help debug NSG rule issues, connection issues that I'm facing.

mkskeller commented 11 months ago

Both should work if the setup is correct.

sandy9999 commented 11 months ago

This is the error I'm getting now when I run Scripts/compile-run.py -H HOSTS -E ring breast_tree -Z 3 -R 64 on one of the 3 VMs in HOSTS.

image

I've built from source, and installed necessary packages in all 3 VMs in HOSTS. I've also tried

./compile.py -Z 3 -R 64 breast_tree
(Open 3 sessions using tmux)
./replicated-ring-party.x 0 custom_data_dt -v --batch-size 1
./replicated-ring-party.x 1 custom_data_dt -v --batch-size 1
./replicated-ring-party.x 2 custom_data_dt -v --batch-size 1

On each VM and this works. Any ideas?

mkskeller commented 11 months ago

This probably still relates to SSH issues. For every host in HOSTS, ssh host should immediately give you a shell.

sandy9999 commented 11 months ago

ssh host is giving me a shell. That's happening. But of course ssh host when host is the VM I'm on wouldn't make sense, that's why I was clarifying scenarios earlier. Would the format of HOSTS change then?

On Tue, Nov 14, 2023, 7:09 AM Marcel Keller @.***> wrote:

This probably still relates to SSH issues. For every host in HOSTS, ssh host should immediately give you a shell.

— Reply to this email directly, view it on GitHub https://github.com/data61/MP-SPDZ/issues/1213#issuecomment-1809403026, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEFLCO35DNOVOEML4HB3I3LYELDUPAVCNFSM6AAAAAA7EOGTMGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBZGQYDGMBSGY . You are receiving this because you authored the thread.Message ID: @.***>

sandy9999 commented 11 months ago

ssh host is giving me a shell. That's happening. But of course, ssh host when host is the VM I'm on wouldn't make sense, that's why I was clarifying scenarios earlier. Would the format of HOSTS change then?

mkskeller commented 11 months ago

I think it's possible to ssh to the host you're already on.

sandy9999 commented 11 months ago

Thank you! All sorted, and I got the 3 VM setup working. :)

sandy9999 commented 11 months ago

I am reopening this thread as I've got more questions.

I created a custom_data_dt.mpc file in Programs/Source that feeds a custom dataset instead of the breast_tree dataset to use the Decision Tree protocol. Contents:

import pandas as pd
import random
import numpy as np

m = 30
n = 2**19

data_x = np.random.uniform(0, 10, (n, m))
data_y = np.random.randint(2, size=(1, n))
df_x = pd.DataFrame(data_x)
df_y = pd.DataFrame(data_y)

df_x = sfix.input_tensor_via(0, df_x)
df_y = sint.input_tensor_via(0, df_y)
df_y = Array.create_from(df_y[0])

program.set_bit_length(32)
sfix.set_precision(16, 31)

from Compiler.decision_tree import TreeClassifier

tree = TreeClassifier(max_depth=5, n_threads=2)

tree.fit(df_x, df_y)

Running this using: Scripts/compile-run.py -H HOSTS -E ring custom_data_dt -Z 3 -R 64 . I'm getting the following exception:

terminate called after throwing an instance of 'boost::wrapexcept<boost::system::system_error>'
  what():  read_some: stream truncated
terminate called after throwing an instance of 'boost::wrapexcept<boost::system::system_error>'
  what():  read_some: stream truncated
bash: line 1: 491024 Killed

I used pktstat and found that just before it crashed, there was a 99% consumption of bandwidth. How is this happening? Suppose we have a very large sint array of say 2^25 elements, are the shares of this array sent at once due to which the bandwidth gets clogged? In case, they are, isn't it better to send them sequentially at least when the bandwidth consumption is nearing the limit so that, instead of the protocol getting killed, it just runs for a very long time?

Note: I am up to date with master, including the libOTE memory leak fix.

mkskeller commented 11 months ago

This is most likely due to memory issues. Scripts/memory-usage.py custom_data_dt gives an estimate of 7 GB RAM per party but that has to be doubled because of the secret sharing, so you're looking at 14 GB of RAM. I don't think it has anything to with bandwidth.

sandy9999 commented 11 months ago

I'm using 3 VMs of 32 GB RAM, so shouldn't 14 GB RAM be accommodated?

mkskeller commented 11 months ago

The estimate doesn't take into account protocol states, but I would be surprised if that would double the memory usage. What memory usage do you observe before it crashes? And is the smallest input size that doesn't crash?

sandy9999 commented 11 months ago

Tried now for N = 2^19, m = 20, h = 5

This is the memory usage I observe on VM 1:

image

And this crashed.

As of the largest input size that doesn't crash, still binary searching through for that, but I can tell you it worked for N = 2^19, m = 11, h = 4.

sandy9999 commented 11 months ago

Scoping down the issue a little more with few observations.

The following line inside the Sort function looks like one of the bottlenecks.

bs = Matrix.create_from(
        sum([k.get_vector().bit_decompose(nb)
             for k, nb in reversed(list(zip(keys, n_bits)))], []))

I rewrote this as:

bs_sum = sum(
        (k.get_vector().bit_decompose(nb) for k, nb in zip(reversed(keys), n_bits)),
        []
    )
bs = Matrix.create_from(bs_sum)

Until, the initialization of bs_sum, it's fine. The Matrix.create_from leads to the process getting killed. Any inherent known issue with Matrix.create_from when scaling up input datasets?

mkskeller commented 11 months ago

The most likely explanation I can think of is the way memory allocation works. The operating system might allow more allocation than is actually available. Only when the memory gets filled, the OS starts looking for space at which point the program is killed.

I assume you have observed this using print_ln output? If so, can you surround with break_point() calls to pin it down during the compiler optimization?

sandy9999 commented 11 months ago

I don't completely follow. I didn't observe this using print_ln, I checked for the function separately by commenting out code that I didn't want to check for. I thought even using print_ln could eat CPU capacity and wanted to avoid that.

Could you explain how I can pin it down using break_point() during compiler optimization? Does this mean, adding the break point will give me more information during the compilation phase itself?

sandy9999 commented 11 months ago

Also, in addition to the above, if I were to analytically compute the the space bs_sum occupies for N = 2^20, m = 5, h = 1, it would be (32 + 20)(2^20)5 bits that's roughly 55 MB. Why should handling 55 MB cause the process to get killed?

mkskeller commented 11 months ago

I don't completely follow. I didn't observe this using print_ln, I checked for the function separately by commenting out code that I didn't want to check for. I thought even using print_ln could eat CPU capacity and wanted to avoid that.

I wouldn't worry about CPU capacity when debugging the memory usage.

Could you explain how I can pin it down using break_point() during compiler optimization? Does this mean, adding the break point will give me more information during the compilation phase itself?

I meant using the following:

[code A]
break_point()
print_ln('A successful')
break_point()
[code B]
break_point()
print_ln('B successful')

This way you know exactly where the computation fails. Just commenting out code might have side effects that obscure the actual issue.

mkskeller commented 11 months ago

Also, in addition to the above, if I were to analytically compute the the space bs_sum occupies for N = 2^20, m = 5, h = 1, it would be (32 + 20)(2^20)5 bits that's roughly 55 MB. Why should handling 55 MB cause the process to get killed?

It could be the proverbial straw that breaks a camel's back. For example, if the actual available resource is 2 GB and you're already using 1.999 GB, 55 MB is enough to go over the limit.

sandy9999 commented 11 months ago

I meant using the following:

[code A]
break_point()
print_ln('A successful')
break_point()
[code B]
break_point()
print_ln('B successful')

This way you know exactly where the computation fails. Just commenting out code might have side effects that obscure the actual issue.

It doesn't seem to work that way for me. Nothing gets printed at all, the process just gets killed. Even if I place the print_ln line as the starting line in my source .mpc file, nothing gets printed.

The traceback also includes

Exit code: 137

Stdout: already printed

Stderr: already printed
mkskeller commented 11 months ago

It seems that exit code 137 indicates the lack of memory (according to Google results). The fact that even a print_ln output at the beginning hints at the fact that the initial memory allocation fails. What is the full output?

sandy9999 commented 11 months ago

Full output with some sensitive details edited.

Setting up players...
Using security parameter 40
Using security parameter 40
Using security parameter 40
Trying to run 64-bit computation
Trying to run 64-bit computation
Trying to run 64-bit computation
bash: line 1: 815975 Killed                  ./replicated-ring-party.x -p 1 custom_data_dt -h <IP> -pn <port>
Exception in thread Thread-20:
Traceback (most recent call last):
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/sandhya/MP-SPDZ/Scripts/../Compiler/compilerLib.py", line 561, in <lambda>
    run = lambda i: connections[i].run(
  File "/usr/local/lib/python3.8/dist-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/usr/local/lib/python3.8/dist-packages/fabric/connection.py", line 23, in opens
    return method(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/fabric/connection.py", line 763, in run
    return self._run(self._remote_runner(), command, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/invoke/context.py", line 113, in _run
    return runner.run(command, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/fabric/runners.py", line 83, in run
    return super().run(command, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/invoke/runners.py", line 395, in run
    return self._run_body(command, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/invoke/runners.py", line 451, in _run_body
    return self.make_promise() if self._asynchronous else self._finish()
  File "/usr/local/lib/python3.8/dist-packages/invoke/runners.py", line 518, in _finish
    raise UnexpectedExit(result)
invoke.exceptions.UnexpectedExit: Encountered a bad command exit code!

Command: 'cd <dir>; ./replicated-ring-party.x -p 1 custom_data_dt -h <IP> -pn <port> '

Exit code: 137

Stdout: already printed

Stderr: already printed

bash: line 1: 814724 Killed                  ./replicated-ring-party.x -p 2 custom_data_dt -h <IP> -pn <port>
Exception in thread Thread-21:
Traceback (most recent call last):
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/sandhya/MP-SPDZ/Scripts/../Compiler/compilerLib.py", line 561, in <lambda>
    run = lambda i: connections[i].run(
  File "/usr/local/lib/python3.8/dist-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/usr/local/lib/python3.8/dist-packages/fabric/connection.py", line 23, in opens
    return method(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/fabric/connection.py", line 763, in run
    return self._run(self._remote_runner(), command, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/invoke/context.py", line 113, in _run
    return runner.run(command, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/fabric/runners.py", line 83, in run
    return super().run(command, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/invoke/runners.py", line 395, in run
    return self._run_body(command, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/invoke/runners.py", line 451, in _run_body
    return self.make_promise() if self._asynchronous else self._finish()
  File "/usr/local/lib/python3.8/dist-packages/invoke/runners.py", line 518, in _finish
    raise UnexpectedExit(result)
invoke.exceptions.UnexpectedExit: Encountered a bad command exit code!

Command: 'cd <dir>; ./replicated-ring-party.x -p 2 custom_data_dt -h <IP> -pn <port> '

Exit code: 137

Stdout: already printed

Stderr: already printed

bash: line 1: 902510 Killed                  ./replicated-ring-party.x -p 0 custom_data_dt -h <IP> -pn <port>
Exception in thread Thread-19:
Traceback (most recent call last):
  File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/sandhya/MP-SPDZ/Scripts/../Compiler/compilerLib.py", line 561, in <lambda>
    run = lambda i: connections[i].run(
  File "/usr/local/lib/python3.8/dist-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/usr/local/lib/python3.8/dist-packages/fabric/connection.py", line 23, in opens
    return method(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/fabric/connection.py", line 763, in run
    return self._run(self._remote_runner(), command, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/invoke/context.py", line 113, in _run
    return runner.run(command, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/fabric/runners.py", line 83, in run
    return super().run(command, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/invoke/runners.py", line 395, in run
    return self._run_body(command, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/invoke/runners.py", line 451, in _run_body
    return self.make_promise() if self._asynchronous else self._finish()
  File "/usr/local/lib/python3.8/dist-packages/invoke/runners.py", line 518, in _finish
    raise UnexpectedExit(result)
invoke.exceptions.UnexpectedExit: Encountered a bad command exit code!

Command: 'cd <dir>; ./replicated-ring-party.x -p 0 custom_data_dt -h <IP> -pn <port> '

Exit code: 137

Stdout: already printed

Stderr: already printed
mkskeller commented 11 months ago

Thank for providing this. However, I don't think I can add anything to what I've already said. All signs point to a lack of RAM. Maybe there's an unexpected quota on the virtual machines. For reference, I have run the program on a single machine with 72 GB RAM, and it finished in about an hour. The memory usage per party remained below 20 GB throughout.

sandy9999 commented 11 months ago

Thank you, that helped me and I've been able to overcome the RAM issue. I'm now getting a Connection timed out issue at the end of it attempting to run for 30 min to an hour.

terminate called after throwing an instance of 'boost::wrapexcept<boost::system::system_error>'
  what():  write_some: Connection timed out
bash: line 1: 831867 Aborted                 (core dumped) .

I even tried increasing ClientAliveInterval and ClientAliveCountMax values for all 3 VMs, but still same issue. I also have passwordless authentication. Any other ideas that I can explore to debug this?

mkskeller commented 11 months ago

This might be related to an issue fixed in this fork: https://github.com/ParallelogramPal/MP-SPDZ/tree/TCP-Keepalive

sandy9999 commented 11 months ago

Thank you! This was exactly what I was looking for!