data61 / MP-SPDZ

Versatile framework for multi-party computation
Other
934 stars 279 forks source link

Segmentation fault when running mascot protocol. #983

Closed BeStrongok closed 1 year ago

BeStrongok commented 1 year ago

Hi @mkskeller I'm running mascot protocol, and meet an occasional error which is "Segmentation fault". The gdb debug info is: image The mpc script i use is:

sfix.set_precision(30,50)

print_float_prec(12)
a = sfix.Matrix(100, 1)
a.input_from(0)
b = sfix.Matrix(100, 1)
b.input_from(1)

logit_all = sfix.Array(100)
logit_all.assign_all(0.0)
column_0_0 = a.get_column(0)
column_0_0_arr = sfix.Array(100)
column_0_0_arr.assign_vector(column_0_0)
weight_0_0 = sfix.Array(100)
weight_0_0.assign_all(1.2058442444794362)
@for_range_opt_multithread(10, 100)
def _(i):
    logit_all[i] = logit_all[i] + column_0_0_arr[i] * weight_0_0[i]

column_1_0 = b.get_column(0)
column_1_0_arr = sfix.Array(100)
column_1_0_arr.assign_vector(column_1_0)
weight_1_0 = sfix.Array(100)
weight_1_0.assign_all(0.6917793130980812)
@for_range_opt_multithread(10, 100)
def _(i):
    logit_all[i] = logit_all[i] + column_1_0_arr[i] * weight_1_0[i]

Bias = sfix.Array(100).assign_all(1)
A = sfix.Array(100).assign_all(481.8)
B = sfix.Array(100).assign_all(28.5)
logit_all[:] += Bias[:]
logit_all[:] = A[:] - B[:] * logit_all[:]
@for_range_opt(100)
def _(i):
    print_ln('%s', logit_all[i].reveal())

Can you help me to check this error, it seems like a bug. The version of source code i use is 0.3.0. The command is ./mascot-party.x -pn 18909 -N 2 0 10tFloatABS & ./mascot-party.x -pn 18909 -N 2 1 10tFloatABS

mkskeller commented 1 year ago

I cannot reproduce this.

BeStrongok commented 1 year ago

I cannot reproduce this.

  • How often does it occur?
  • What's your CONFIG/CONFIG.mine?
  • Have you recompiling after make clean?

About once in 5 times, the CONFIG file is:

ROOT = .

OPTIM= -O0
#PROF = -pg
#DEBUG = -DDEBUG
GDEBUG = -g
PROME = -DPROME

# set this to your preferred local storage directory
PREP_DIR = '-DPREP_DIR="Player-Data/"'

# directory to store SSL keys
SSL_DIR = '-DSSL_DIR="Player-Data/"'

# set for SHE preprocessing (SPDZ and Overdrive)
USE_NTL = 0

# set for using GF(2^128)
# unset for GF(2^40)
USE_GF2N_LONG = 1

# set to -march=<architecture> for optimization
# SSE4.2 is required homomorphic encryption in GF(2^n) when compiling with clang
# AES-NI and PCLMUL are not required
# AVX is required for oblivious transfer (OT)
# AVX2 support (Haswell or later) is used to optimize OT
# AVX/AVX2 is required for replicated binary secret sharing
# BMI2 is used to optimize multiplication modulo a prime
# ADX is used to optimize big integer additions
# delete the second line to compile for a platform that supports everything
ARCH = -mtune=native -msse4.1 -msse4.2 -maes -mpclmul -mavx -mavx2 -mbmi2 -madx
ARCH = -march=native

MACHINE := $(shell uname -m)
OS := $(shell uname -s)
ifeq ($(MACHINE), x86_64)
# set this to 0 to avoid using AVX for OT
ifeq ($(OS), Linux)
CHECK_AVX := $(shell grep -q avx /proc/cpuinfo; echo $$?)
ifeq ($(CHECK_AVX), 0)
AVX_OT = 1
else
AVX_OT = 0
endif
else
AVX_OT = 1
endif
else
ARCH =
AVX_OT = 0
endif

# allow to set compiler in CONFIG.mine
CXX = g++

# use CONFIG.mine to overwrite DIR settings
-include CONFIG.mine

ifeq ($(USE_GF2N_LONG),1)
GF2N_LONG = -DUSE_GF2N_LONG
endif

ifeq ($(AVX_OT), 0)
CFLAGS += -DNO_AVX_OT
endif

# MAX_MOD_SZ (for FHE) must be least and GFP_MOD_SZ (for computation)
# must be exactly ceil(len(p)/len(word)) for the relevant prime p
# GFP_MOD_SZ only needs to be set for primes of bit length more that 256.
# Default for MAX_MOD_SZ is 10, which suffices for all Overdrive protocols
# MOD = -DMAX_MOD_SZ=10 -DGFP_MOD_SZ=5

LDLIBS = -lmpirxx -lmpir -lsodium $(MY_LDLIBS)
LDLIBS += -lboost_system
# LDLIBS += -lboost_system -lssl -lcrypto

ifeq ($(USE_NTL),1)
CFLAGS += -DUSE_NTL
LDLIBS := -lntl $(LDLIBS)
endif

ifeq ($(OS), Linux)
LDLIBS += -lrt
endif

ifeq ($(OS), Darwin)
BOOST = -lboost_thread-mt $(MY_BOOST)
else
BOOST = -lboost_thread $(MY_BOOST)
endif

PROME_LIB = -lprometheus-cpp-pull -lprometheus-cpp-core
INC_DIR = /root/pkgs/lib_r/include
SSL_LIB = /root/pkgs/lib_r/lib/libssl.so
CRYPTO_LIB = /root/pkgs/lib_r/lib/libcrypto.so

CFLAGS += -I$(INC_DIR) 
CFLAGS += $(ARCH) $(MY_CFLAGS) $(GDEBUG) -Wextra -Wall $(OPTIM) -I$(ROOT) -pthread $(PROF) $(DEBUG) $(MOD) $(GF2N_LONG) $(PREP_DIR) $(SSL_DIR) $(SECURE) $(PROME) -std=c++11
CPPFLAGS = $(CFLAGS)
LD = $(CXX)

ifeq ($(OS), Darwin)
# for boost with OpenSSL 3
CFLAGS += -Wno-error=deprecated-declarations
ifeq ($(USE_NTL),1)
CFLAGS += -Wno-error=unused-parameter -Wno-error=deprecated-copy
endif
endif

Yes, i recompiled after make clean and error still occurs.

mkskeller commented 1 year ago

Did you change the C++ code?

BeStrongok commented 1 year ago

Did you change the C++ code?

Yes, i added prometheus metrics to monitor the communication, but the changed code only in Networking, i think it's not relevant with this error. I don't see related functions in the debug information.

mkskeller commented 1 year ago

That might be, but any code change leading to a bug I cannot reproduce limits my ability to do anything about it. I have the relevant functionality several times since version 0.3.0, maybe any of these can solve the problem.

BeStrongok commented 1 year ago

I have the relevant functionality several times since version 0.3.0, maybe any of these can solve the problem.

I use the original source code of version 0.3.0 to run the same script, this error still occurs on the 11th run. The error is Segmentation fault. image The code of large_test.py is:

import subprocess

def run_cmd(cmd):
    return subprocess.run([cmd], stdout=subprocess.PIPE,
                            stderr=subprocess.STDOUT,
                            shell=True,
                            encoding="utf-8"
                            )

cmd = "./mascot-party.x -pn 18909 -N 2 0 10tFloatABS & ./mascot-party.x -pn 18909 -N 2 1 10tFloatABS"

for i in range(30):
    result = run_cmd(cmd)
    if result.returncode != 0:
        error_msg = result.stdout
        print(error_msg)
        break
    print("test {} sucess".format(i))
mkskeller commented 1 year ago

The script runs fine on my machine. Do you get any meangingful output by running with valgrind, i.e., valgrind ./mascot-party.x ...?

BeStrongok commented 1 year ago

Yes, i run valgrind ./mascot-party.x -pn 18909 -N 2 0 10tFloatABS, the output info is:

==32038== Memcheck, a memory error detector
==32038== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==32038== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==32038== Command: ./mascot-party.x -pn 18909 -N 2 0 10tFloatABS
==32038== 
vex amd64->IR: unhandled instruction bytes: 0x62 0xF1 0xFD 0x28 0x7F 0x44 0x24 0x1 0x48 0x83
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE
vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=0
==32038== valgrind: Unrecognised instruction at address 0x190639.
==32038==    at 0x190639: avx_memzero(void*, unsigned long) (avx_memcpy.h:63)
==32038==    by 0x19D5CA: modp_<2>::modp_() (modp.h:38)
==32038==    by 0x19C5A1: gfp_<0, 2>::gfp_() (gfp.h:155)
==32038==    by 0x19B373: SemiShare<gfp_<0, 2> >::SemiShare() (SemiShare.h:97)
==32038==    by 0x198B55: Share_<SemiShare<gfp_<0, 2> >, SemiShare<gfp_<0, 2> > >::Share_() (Share.h:91)
==32038==    by 0x1952F3: Share<gfp_<0, 2> >::Share() (Share.h:187)
==32038==    by 0x192F1D: DishonestMajorityFieldMachine<Share, Share, gf2n_long, DishonestMajorityMachine>::DishonestMajorityFieldMachine(int, char const**, ez::ezOptionParser&, bool) (FieldMachine.h:42)
==32038==    by 0x1904A1: main (mascot-party.cpp:9)
==32038== Your program just tried to execute an instruction that Valgrind
==32038== did not recognise.  There are two possible reasons for this.
==32038== 1. Your program has a bug and erroneously jumped to a non-code
==32038==    location.  If you are running Memcheck and you just saw a
==32038==    warning about a bad jump, it's probably your program's fault.
==32038== 2. The instruction is legitimate but Valgrind doesn't handle it,
==32038==    i.e. it's Valgrind's fault.  If you think this is the case or
==32038==    you are not sure, please let us know and we'll try to fix it.
==32038== Either way, Valgrind will now raise a SIGILL signal which will
==32038== probably kill your program.
==32038== 
==32038== Process terminating with default action of signal 4 (SIGILL)
==32038==  Illegal opcode at address 0x190639
==32038==    at 0x190639: avx_memzero(void*, unsigned long) (avx_memcpy.h:63)
==32038==    by 0x19D5CA: modp_<2>::modp_() (modp.h:38)
==32038==    by 0x19C5A1: gfp_<0, 2>::gfp_() (gfp.h:155)
==32038==    by 0x19B373: SemiShare<gfp_<0, 2> >::SemiShare() (SemiShare.h:97)
==32038==    by 0x198B55: Share_<SemiShare<gfp_<0, 2> >, SemiShare<gfp_<0, 2> > >::Share_() (Share.h:91)
==32038==    by 0x1952F3: Share<gfp_<0, 2> >::Share() (Share.h:187)
==32038==    by 0x192F1D: DishonestMajorityFieldMachine<Share, Share, gf2n_long, DishonestMajorityMachine>::DishonestMajorityFieldMachine(int, char const**, ez::ezOptionParser&, bool) (FieldMachine.h:42)
==32038==    by 0x1904A1: main (mascot-party.cpp:9)
==32038== 
==32038== HEAP SUMMARY:
==32038==     in use at exit: 3,360 bytes in 29 blocks
==32038==   total heap usage: 30 allocs, 1 frees, 76,064 bytes allocated
==32038== 
==32038== LEAK SUMMARY:
==32038==    definitely lost: 0 bytes in 0 blocks
==32038==    indirectly lost: 0 bytes in 0 blocks
==32038==      possibly lost: 0 bytes in 0 blocks
==32038==    still reachable: 3,360 bytes in 29 blocks
==32038==         suppressed: 0 bytes in 0 blocks
==32038== Rerun with --leak-check=full to see details of leaked memory
==32038== 
==32038== For lists of detected and suppressed errors, rerun with: -s
==32038== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Illegal instruction
mkskeller commented 1 year ago

It seems that the used version of valgrind doesn't fully the platform. You would need to install a newer version. What CPU and OS are you using?

BeStrongok commented 1 year ago

It seems that the used version of valgrind doesn't fully the platform. You would need to install a newer version. What CPU and OS are you using?

The version of valgrind is 3.15.0, OS is Ubuntu 20.04.4 LTS, CPU info is

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          64
On-line CPU(s) list:             0-63
Thread(s) per core:              2
Core(s) per socket:              16
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
BeStrongok commented 1 year ago

It seems that the used version of valgrind doesn't fully the platform. You would need to install a newer version. What CPU and OS are you using?

I updated valgrind to 3.20.0 and get the same ouput.

==45909== Memcheck, a memory error detector
==45909== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==45909== Using Valgrind-3.20.0 and LibVEX; rerun with -h for copyright info
==45909== Command: ./mascot-party.x -pn 18909 -N 2 0 10tFloatABS
==45909== 
vex amd64->IR: unhandled instruction bytes: 0x62 0xF1 0xFD 0x28 0x7F 0x44 0x24 0x1 0x48 0x83
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE
vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=0
==45909== valgrind: Unrecognised instruction at address 0x190639.
==45909==    at 0x190639: avx_memzero(void*, unsigned long) (avx_memcpy.h:63)
==45909==    by 0x19D5CA: modp_<2>::modp_() (modp.h:38)
==45909==    by 0x19C5A1: gfp_<0, 2>::gfp_() (gfp.h:155)
==45909==    by 0x19B373: SemiShare<gfp_<0, 2> >::SemiShare() (SemiShare.h:97)
==45909==    by 0x198B55: Share_<SemiShare<gfp_<0, 2> >, SemiShare<gfp_<0, 2> > >::Share_() (Share.h:91)
==45909==    by 0x1952F3: Share<gfp_<0, 2> >::Share() (Share.h:187)
==45909==    by 0x192F1D: DishonestMajorityFieldMachine<Share, Share, gf2n_long, DishonestMajorityMachine>::DishonestMajorityFieldMachine(int, char const**, ez::ezOptionParser&, bool) (FieldMachine.h:42)
==45909==    by 0x1904A1: main (mascot-party.cpp:9)
==45909== Your program just tried to execute an instruction that Valgrind
==45909== did not recognise.  There are two possible reasons for this.
==45909== 1. Your program has a bug and erroneously jumped to a non-code
==45909==    location.  If you are running Memcheck and you just saw a
==45909==    warning about a bad jump, it's probably your program's fault.
==45909== 2. The instruction is legitimate but Valgrind doesn't handle it,
==45909==    i.e. it's Valgrind's fault.  If you think this is the case or
==45909==    you are not sure, please let us know and we'll try to fix it.
==45909== Either way, Valgrind will now raise a SIGILL signal which will
==45909== probably kill your program.
==45909== 
==45909== Process terminating with default action of signal 4 (SIGILL)
==45909==  Illegal opcode at address 0x190639
==45909==    at 0x190639: avx_memzero(void*, unsigned long) (avx_memcpy.h:63)
==45909==    by 0x19D5CA: modp_<2>::modp_() (modp.h:38)
==45909==    by 0x19C5A1: gfp_<0, 2>::gfp_() (gfp.h:155)
==45909==    by 0x19B373: SemiShare<gfp_<0, 2> >::SemiShare() (SemiShare.h:97)
==45909==    by 0x198B55: Share_<SemiShare<gfp_<0, 2> >, SemiShare<gfp_<0, 2> > >::Share_() (Share.h:91)
==45909==    by 0x1952F3: Share<gfp_<0, 2> >::Share() (Share.h:187)
==45909==    by 0x192F1D: DishonestMajorityFieldMachine<Share, Share, gf2n_long, DishonestMajorityMachine>::DishonestMajorityFieldMachine(int, char const**, ez::ezOptionParser&, bool) (FieldMachine.h:42)
==45909==    by 0x1904A1: main (mascot-party.cpp:9)
==45909== 
==45909== HEAP SUMMARY:
==45909==     in use at exit: 76,064 bytes in 30 blocks
==45909==   total heap usage: 30 allocs, 0 frees, 76,064 bytes allocated
==45909== 
==45909== LEAK SUMMARY:
==45909==    definitely lost: 0 bytes in 0 blocks
==45909==    indirectly lost: 0 bytes in 0 blocks
==45909==      possibly lost: 0 bytes in 0 blocks
==45909==    still reachable: 76,064 bytes in 30 blocks
==45909==         suppressed: 0 bytes in 0 blocks
==45909== Rerun with --leak-check=full to see details of leaked memory
==45909== 
==45909== For lists of detected and suppressed errors, rerun with: -s
==45909== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Illegal instruction
mkskeller commented 1 year ago

It might be that valgrind doesn't understand AVX-512 instructions supported by your CPU. Try compiling with ARCH = -march=skylake in CONFIG.mine.

BeStrongok commented 1 year ago

ARCH = -march=skylake

I recompiled with this flag, and ran fine, the output of valgrind is: image

But this is an occasional error. It may need several times to hit this error and to see the output of valgrind.

mkskeller commented 1 year ago

It might help to find a shorter program that triggers the error. Have you tried if the error also appears with shorter versions of the program?