Closed sandy9999 closed 11 months ago
The argument should be -H HOSTS
.
Thank you, that helped. Following up. I am running:
Scripts/compile-run.py -H HOSTS -E ring breast_tree -Z 3 -R 64
My HOSTS file format.
myname@xxx.xxx.xxx.xxx/home/myname
myname@xxx.xxx.xxx.xxx/home/myname
myname@xxx.xxx.xxx.xxx/home/myname
I am getting the following error.
/usr/bin/ld: cannot open output file static/replicated-ring-party.x: No such file or directory
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [Makefile:150: static/replicated-ring-party.x] Error 1
I realized there's no proper authentication mechanism provided to ssh into the 3 VMs mentioned in HOSTS. Is that what's leading to this error? If yes, how should we provide that?
Thank you for raising this. You should find that d4c96c61bd fixes it.
What do you mean no proper authentication mechanism being provided to ssh?
Thank you for fixing! About the authentication mechanism query, I'm supporting it with a stack trace
Traceback (most recent call last):
File "/usr/local/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
Exception in thread Thread-2 (run):
Exception in thread Thread-3 (run):
Traceback (most recent call last):
File "/usr/local/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
Traceback (most recent call last):
File "/usr/local/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.10/threading.py", line 946, in run
self._target(*self._args, **self._kwargs)
File "/usr/src/MP-SPDZ/Scripts/../Compiler/compilerLib.py", line 520, in run
connection.run(
File "/usr/local/lib/python3.10/site-packages/decorator.py", line 232, in fun
self.run()
return caller(func, *(extras + args), **kw)
File "/usr/local/lib/python3.10/site-packages/fabric/connection.py", line 22, in opens
File "/usr/local/lib/python3.10/threading.py", line 946, in run
self.run()
File "/usr/local/lib/python3.10/threading.py", line 946, in run
self.open()
File "/usr/local/lib/python3.10/site-packages/fabric/connection.py", line 665, in open
self._target(*self._args, **self._kwargs)
File "/usr/src/MP-SPDZ/Scripts/../Compiler/compilerLib.py", line 520, in run
self._target(*self._args, **self._kwargs)
File "/usr/src/MP-SPDZ/Scripts/../Compiler/compilerLib.py", line 520, in run
result = self.client.connect(**kwargs)
File "/usr/local/lib/python3.10/site-packages/paramiko/client.py", line 386, in connect
sock.connect(addr)
connection.run(
connection.run(
TimeoutError: [Errno 110] Connection timed out
File "/usr/local/lib/python3.10/site-packages/decorator.py", line 232, in fun
File "/usr/local/lib/python3.10/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/usr/local/lib/python3.10/site-packages/fabric/connection.py", line 22, in opens
self.open()
File "/usr/local/lib/python3.10/site-packages/fabric/connection.py", line 665, in open
result = self.client.connect(**kwargs)
File "/usr/local/lib/python3.10/site-packages/paramiko/client.py", line 386, in connect
sock.connect(addr)
TimeoutError: [Errno 110] Connection timed out
return caller(func, *(extras + args), **kw)
File "/usr/local/lib/python3.10/site-packages/fabric/connection.py", line 22, in opens
self.open()
File "/usr/local/lib/python3.10/site-packages/fabric/connection.py", line 665, in open
result = self.client.connect(**kwargs)
File "/usr/local/lib/python3.10/site-packages/paramiko/client.py", line 386, in connect
sock.connect(addr)
TimeoutError: [Errno 110] Connection timed out
Exception in thread Thread-6 (<lambda>):
Exception in thread Thread-5 (<lambda>):
Traceback (most recent call last):
Traceback (most recent call last):
Exception in thread Thread-4 (<lambda>):
File "/usr/local/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
Traceback (most recent call last):
File "/usr/local/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
File "/usr/local/lib/python3.10/threading.py", line 1009, in _bootstrap_inner
self.run()
File "/usr/local/lib/python3.10/threading.py", line 946, in run
self.run()
self._target(*self._args, **self._kwargs)
self.run()
File "/usr/local/lib/python3.10/threading.py", line 946, in run
File "/usr/src/MP-SPDZ/Scripts/../Compiler/compilerLib.py", line 561, in <lambda>
run = lambda i: connections[i].run(
File "/usr/local/lib/python3.10/site-packages/decorator.py", line 232, in fun
File "/usr/local/lib/python3.10/threading.py", line 946, in run
return caller(func, *(extras + args), **kw)
File "/usr/local/lib/python3.10/site-packages/fabric/connection.py", line 22, in opens
self._target(*self._args, **self._kwargs)
File "/usr/src/MP-SPDZ/Scripts/../Compiler/compilerLib.py", line 561, in <lambda>
self.open()
File "/usr/local/lib/python3.10/site-packages/fabric/connection.py", line 665, in open
run = lambda i: connections[i].run(
File "/usr/local/lib/python3.10/site-packages/decorator.py", line 232, in fun
result = self.client.connect(**kwargs)
File "/usr/local/lib/python3.10/site-packages/paramiko/client.py", line 386, in connect
sock.connect(addr)
TimeoutError: [Errno 110] Connection timed out
return caller(func, *(extras + args), **kw)
File "/usr/local/lib/python3.10/site-packages/fabric/connection.py", line 22, in opens
self._target(*self._args, **self._kwargs)
File "/usr/src/MP-SPDZ/Scripts/../Compiler/compilerLib.py", line 561, in <lambda>
self.open()
run = lambda i: connections[i].run(
File "/usr/local/lib/python3.10/site-packages/decorator.py", line 232, in fun
File "/usr/local/lib/python3.10/site-packages/fabric/connection.py", line 665, in open
result = self.client.connect(**kwargs)
return caller(func, *(extras + args), **kw)
File "/usr/local/lib/python3.10/site-packages/paramiko/client.py", line 386, in connect
File "/usr/local/lib/python3.10/site-packages/fabric/connection.py", line 22, in opens
self.open()
File "/usr/local/lib/python3.10/site-packages/fabric/connection.py", line 665, in open
sock.connect(addr)
result = self.client.connect(**kwargs)
File "/usr/local/lib/python3.10/site-packages/paramiko/client.py", line 386, in connect
TimeoutError: [Errno 110] Connection timed out
sock.connect(addr)
TimeoutError: [Errno 110] Connection timed out
Basically meant to say that we are not providing any details about password/key etc. Or is it expected that the machines are on the same VNET? If 3 parties are basically 3 VMs, should I be running Scripts/compile-run.py -H HOSTS -E ring breast_tree -Z 3 -R 64
from one of these VMs? (Currently I'm running on a 4th VM that doesn't correspond to these parties)
The assumption is that you have password-less authentication set up, either by ssh-agent or password-less authentication key. See for example https://www.tecmint.com/ssh-passwordless-login-using-ssh-keygen-in-5-easy-steps/
Thanks, that information helps. Also, are the following scenarios allowed?
a) Run Scripts/compile-run.py -H HOSTS -E ring breast_tree -Z 3 -R 64
from a 4th VM that doesn't correspond to any of the 3 VMs in HOSTS.
b) Run Scripts/compile-run.py -H HOSTS -E ring breast_tree -Z 3 -R 64
from one of the 3 VMs mentioned in HOSTS.
Knowing this would help debug NSG rule issues, connection issues that I'm facing.
Both should work if the setup is correct.
This is the error I'm getting now when I run Scripts/compile-run.py -H HOSTS -E ring breast_tree -Z 3 -R 64
on one of the 3 VMs in HOSTS.
I've built from source, and installed necessary packages in all 3 VMs in HOSTS. I've also tried
./compile.py -Z 3 -R 64 breast_tree
(Open 3 sessions using tmux)
./replicated-ring-party.x 0 custom_data_dt -v --batch-size 1
./replicated-ring-party.x 1 custom_data_dt -v --batch-size 1
./replicated-ring-party.x 2 custom_data_dt -v --batch-size 1
On each VM and this works. Any ideas?
This probably still relates to SSH issues. For every host
in HOSTS
, ssh host
should immediately give you a shell.
ssh host is giving me a shell. That's happening. But of course ssh host when host is the VM I'm on wouldn't make sense, that's why I was clarifying scenarios earlier. Would the format of HOSTS change then?
On Tue, Nov 14, 2023, 7:09 AM Marcel Keller @.***> wrote:
This probably still relates to SSH issues. For every host in HOSTS, ssh host should immediately give you a shell.
— Reply to this email directly, view it on GitHub https://github.com/data61/MP-SPDZ/issues/1213#issuecomment-1809403026, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEFLCO35DNOVOEML4HB3I3LYELDUPAVCNFSM6AAAAAA7EOGTMGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQMBZGQYDGMBSGY . You are receiving this because you authored the thread.Message ID: @.***>
ssh host is giving me a shell. That's happening. But of course, ssh host when host is the VM I'm on wouldn't make sense, that's why I was clarifying scenarios earlier. Would the format of HOSTS change then?
I think it's possible to ssh to the host you're already on.
Thank you! All sorted, and I got the 3 VM setup working. :)
I am reopening this thread as I've got more questions.
I created a custom_data_dt.mpc file in Programs/Source that feeds a custom dataset instead of the breast_tree dataset to use the Decision Tree protocol. Contents:
import pandas as pd
import random
import numpy as np
m = 30
n = 2**19
data_x = np.random.uniform(0, 10, (n, m))
data_y = np.random.randint(2, size=(1, n))
df_x = pd.DataFrame(data_x)
df_y = pd.DataFrame(data_y)
df_x = sfix.input_tensor_via(0, df_x)
df_y = sint.input_tensor_via(0, df_y)
df_y = Array.create_from(df_y[0])
program.set_bit_length(32)
sfix.set_precision(16, 31)
from Compiler.decision_tree import TreeClassifier
tree = TreeClassifier(max_depth=5, n_threads=2)
tree.fit(df_x, df_y)
Running this using: Scripts/compile-run.py -H HOSTS -E ring custom_data_dt -Z 3 -R 64
. I'm getting the following exception:
terminate called after throwing an instance of 'boost::wrapexcept<boost::system::system_error>'
what(): read_some: stream truncated
terminate called after throwing an instance of 'boost::wrapexcept<boost::system::system_error>'
what(): read_some: stream truncated
bash: line 1: 491024 Killed
I used pktstat and found that just before it crashed, there was a 99% consumption of bandwidth. How is this happening? Suppose we have a very large sint
array of say 2^25 elements, are the shares of this array sent at once due to which the bandwidth gets clogged? In case, they are, isn't it better to send them sequentially at least when the bandwidth consumption is nearing the limit so that, instead of the protocol getting killed, it just runs for a very long time?
Note: I am up to date with master, including the libOTE memory leak fix.
This is most likely due to memory issues. Scripts/memory-usage.py custom_data_dt
gives an estimate of 7 GB RAM per party but that has to be doubled because of the secret sharing, so you're looking at 14 GB of RAM. I don't think it has anything to with bandwidth.
I'm using 3 VMs of 32 GB RAM, so shouldn't 14 GB RAM be accommodated?
The estimate doesn't take into account protocol states, but I would be surprised if that would double the memory usage. What memory usage do you observe before it crashes? And is the smallest input size that doesn't crash?
Tried now for N = 2^19, m = 20, h = 5
This is the memory usage I observe on VM 1:
And this crashed.
As of the largest input size that doesn't crash, still binary searching through for that, but I can tell you it worked for N = 2^19, m = 11, h = 4.
Scoping down the issue a little more with few observations.
The following line inside the Sort
function looks like one of the bottlenecks.
bs = Matrix.create_from(
sum([k.get_vector().bit_decompose(nb)
for k, nb in reversed(list(zip(keys, n_bits)))], []))
I rewrote this as:
bs_sum = sum(
(k.get_vector().bit_decompose(nb) for k, nb in zip(reversed(keys), n_bits)),
[]
)
bs = Matrix.create_from(bs_sum)
Until, the initialization of bs_sum, it's fine. The Matrix.create_from leads to the process getting killed. Any inherent known issue with Matrix.create_from when scaling up input datasets?
The most likely explanation I can think of is the way memory allocation works. The operating system might allow more allocation than is actually available. Only when the memory gets filled, the OS starts looking for space at which point the program is killed.
I assume you have observed this using print_ln
output? If so, can you surround with break_point()
calls to pin it down during the compiler optimization?
I don't completely follow. I didn't observe this using print_ln, I checked for the function separately by commenting out code that I didn't want to check for. I thought even using print_ln could eat CPU capacity and wanted to avoid that.
Could you explain how I can pin it down using break_point()
during compiler optimization? Does this mean, adding the break point will give me more information during the compilation phase itself?
Also, in addition to the above, if I were to analytically compute the the space bs_sum
occupies for N = 2^20, m = 5, h = 1
, it would be (32 + 20)(2^20)5 bits that's roughly 55 MB. Why should handling 55 MB cause the process to get killed?
I don't completely follow. I didn't observe this using print_ln, I checked for the function separately by commenting out code that I didn't want to check for. I thought even using print_ln could eat CPU capacity and wanted to avoid that.
I wouldn't worry about CPU capacity when debugging the memory usage.
Could you explain how I can pin it down using
break_point()
during compiler optimization? Does this mean, adding the break point will give me more information during the compilation phase itself?
I meant using the following:
[code A]
break_point()
print_ln('A successful')
break_point()
[code B]
break_point()
print_ln('B successful')
This way you know exactly where the computation fails. Just commenting out code might have side effects that obscure the actual issue.
Also, in addition to the above, if I were to analytically compute the the space
bs_sum
occupies forN = 2^20, m = 5, h = 1
, it would be (32 + 20)(2^20)5 bits that's roughly 55 MB. Why should handling 55 MB cause the process to get killed?
It could be the proverbial straw that breaks a camel's back. For example, if the actual available resource is 2 GB and you're already using 1.999 GB, 55 MB is enough to go over the limit.
I meant using the following:
[code A] break_point() print_ln('A successful') break_point() [code B] break_point() print_ln('B successful')
This way you know exactly where the computation fails. Just commenting out code might have side effects that obscure the actual issue.
It doesn't seem to work that way for me. Nothing gets printed at all, the process just gets killed. Even if I place the print_ln
line as the starting line in my source .mpc file, nothing gets printed.
The traceback also includes
Exit code: 137
Stdout: already printed
Stderr: already printed
It seems that exit code 137 indicates the lack of memory (according to Google results). The fact that even a print_ln
output at the beginning hints at the fact that the initial memory allocation fails. What is the full output?
Full output with some sensitive details edited.
Setting up players...
Using security parameter 40
Using security parameter 40
Using security parameter 40
Trying to run 64-bit computation
Trying to run 64-bit computation
Trying to run 64-bit computation
bash: line 1: 815975 Killed ./replicated-ring-party.x -p 1 custom_data_dt -h <IP> -pn <port>
Exception in thread Thread-20:
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/usr/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/sandhya/MP-SPDZ/Scripts/../Compiler/compilerLib.py", line 561, in <lambda>
run = lambda i: connections[i].run(
File "/usr/local/lib/python3.8/dist-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/usr/local/lib/python3.8/dist-packages/fabric/connection.py", line 23, in opens
return method(self, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/fabric/connection.py", line 763, in run
return self._run(self._remote_runner(), command, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/invoke/context.py", line 113, in _run
return runner.run(command, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/fabric/runners.py", line 83, in run
return super().run(command, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/invoke/runners.py", line 395, in run
return self._run_body(command, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/invoke/runners.py", line 451, in _run_body
return self.make_promise() if self._asynchronous else self._finish()
File "/usr/local/lib/python3.8/dist-packages/invoke/runners.py", line 518, in _finish
raise UnexpectedExit(result)
invoke.exceptions.UnexpectedExit: Encountered a bad command exit code!
Command: 'cd <dir>; ./replicated-ring-party.x -p 1 custom_data_dt -h <IP> -pn <port> '
Exit code: 137
Stdout: already printed
Stderr: already printed
bash: line 1: 814724 Killed ./replicated-ring-party.x -p 2 custom_data_dt -h <IP> -pn <port>
Exception in thread Thread-21:
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/usr/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/sandhya/MP-SPDZ/Scripts/../Compiler/compilerLib.py", line 561, in <lambda>
run = lambda i: connections[i].run(
File "/usr/local/lib/python3.8/dist-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/usr/local/lib/python3.8/dist-packages/fabric/connection.py", line 23, in opens
return method(self, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/fabric/connection.py", line 763, in run
return self._run(self._remote_runner(), command, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/invoke/context.py", line 113, in _run
return runner.run(command, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/fabric/runners.py", line 83, in run
return super().run(command, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/invoke/runners.py", line 395, in run
return self._run_body(command, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/invoke/runners.py", line 451, in _run_body
return self.make_promise() if self._asynchronous else self._finish()
File "/usr/local/lib/python3.8/dist-packages/invoke/runners.py", line 518, in _finish
raise UnexpectedExit(result)
invoke.exceptions.UnexpectedExit: Encountered a bad command exit code!
Command: 'cd <dir>; ./replicated-ring-party.x -p 2 custom_data_dt -h <IP> -pn <port> '
Exit code: 137
Stdout: already printed
Stderr: already printed
bash: line 1: 902510 Killed ./replicated-ring-party.x -p 0 custom_data_dt -h <IP> -pn <port>
Exception in thread Thread-19:
Traceback (most recent call last):
File "/usr/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/usr/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/sandhya/MP-SPDZ/Scripts/../Compiler/compilerLib.py", line 561, in <lambda>
run = lambda i: connections[i].run(
File "/usr/local/lib/python3.8/dist-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/usr/local/lib/python3.8/dist-packages/fabric/connection.py", line 23, in opens
return method(self, *args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/fabric/connection.py", line 763, in run
return self._run(self._remote_runner(), command, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/invoke/context.py", line 113, in _run
return runner.run(command, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/fabric/runners.py", line 83, in run
return super().run(command, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/invoke/runners.py", line 395, in run
return self._run_body(command, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/invoke/runners.py", line 451, in _run_body
return self.make_promise() if self._asynchronous else self._finish()
File "/usr/local/lib/python3.8/dist-packages/invoke/runners.py", line 518, in _finish
raise UnexpectedExit(result)
invoke.exceptions.UnexpectedExit: Encountered a bad command exit code!
Command: 'cd <dir>; ./replicated-ring-party.x -p 0 custom_data_dt -h <IP> -pn <port> '
Exit code: 137
Stdout: already printed
Stderr: already printed
Thank for providing this. However, I don't think I can add anything to what I've already said. All signs point to a lack of RAM. Maybe there's an unexpected quota on the virtual machines. For reference, I have run the program on a single machine with 72 GB RAM, and it finished in about an hour. The memory usage per party remained below 20 GB throughout.
Thank you, that helped me and I've been able to overcome the RAM issue. I'm now getting a Connection timed out issue at the end of it attempting to run for 30 min to an hour.
terminate called after throwing an instance of 'boost::wrapexcept<boost::system::system_error>'
what(): write_some: Connection timed out
bash: line 1: 831867 Aborted (core dumped) .
I even tried increasing ClientAliveInterval and ClientAliveCountMax values for all 3 VMs, but still same issue. I also have passwordless authentication. Any other ideas that I can explore to debug this?
This might be related to an issue fixed in this fork: https://github.com/ParallelogramPal/MP-SPDZ/tree/TCP-Keepalive
Thank you! This was exactly what I was looking for!
I'm looking to run Programs/Source/breast_tree.mpc using the 3 party Replicated Secret Sharing protocol in the semi-honest setting. This is what I'm currently trying at the root of the MP-SPDZ repo.
My HOSTS file has the following information:
I also tried the following format:
The HOSTS file is not getting parsed correctly, which I'm guessing is because I've given IP address instead of HOSTNAME. How can I make it work with IP address?