Segmentation fault when running mascot protocol.

BeStrongok commented 1 year ago

Hi @mkskeller I'm running mascot protocol, and meet an occasional error which is "Segmentation fault". The gdb debug info is: The mpc script i use is:

sfix.set_precision(30,50)

print_float_prec(12)
a = sfix.Matrix(100, 1)
a.input_from(0)
b = sfix.Matrix(100, 1)
b.input_from(1)

logit_all = sfix.Array(100)
logit_all.assign_all(0.0)
column_0_0 = a.get_column(0)
column_0_0_arr = sfix.Array(100)
column_0_0_arr.assign_vector(column_0_0)
weight_0_0 = sfix.Array(100)
weight_0_0.assign_all(1.2058442444794362)
@for_range_opt_multithread(10, 100)
def _(i):
    logit_all[i] = logit_all[i] + column_0_0_arr[i] * weight_0_0[i]

column_1_0 = b.get_column(0)
column_1_0_arr = sfix.Array(100)
column_1_0_arr.assign_vector(column_1_0)
weight_1_0 = sfix.Array(100)
weight_1_0.assign_all(0.6917793130980812)
@for_range_opt_multithread(10, 100)
def _(i):
    logit_all[i] = logit_all[i] + column_1_0_arr[i] * weight_1_0[i]

Bias = sfix.Array(100).assign_all(1)
A = sfix.Array(100).assign_all(481.8)
B = sfix.Array(100).assign_all(28.5)
logit_all[:] += Bias[:]
logit_all[:] = A[:] - B[:] * logit_all[:]
@for_range_opt(100)
def _(i):
    print_ln('%s', logit_all[i].reveal())

Can you help me to check this error, it seems like a bug. The version of source code i use is 0.3.0. The command is ./mascot-party.x -pn 18909 -N 2 0 10tFloatABS & ./mascot-party.x -pn 18909 -N 2 1 10tFloatABS

mkskeller commented 1 year ago

I cannot reproduce this.

How often does it occur?
What's your CONFIG/CONFIG.mine?
Have you recompiling after make clean?

BeStrongok commented 1 year ago

I cannot reproduce this.

How often does it occur?

What's your CONFIG/CONFIG.mine?

Have you recompiling after make clean?

About once in 5 times, the CONFIG file is:

ROOT = .

OPTIM= -O0
#PROF = -pg
#DEBUG = -DDEBUG
GDEBUG = -g
PROME = -DPROME

# set this to your preferred local storage directory
PREP_DIR = '-DPREP_DIR="Player-Data/"'

# directory to store SSL keys
SSL_DIR = '-DSSL_DIR="Player-Data/"'

# set for SHE preprocessing (SPDZ and Overdrive)
USE_NTL = 0

# set for using GF(2^128)
# unset for GF(2^40)
USE_GF2N_LONG = 1

# set to -march=<architecture> for optimization
# SSE4.2 is required homomorphic encryption in GF(2^n) when compiling with clang
# AES-NI and PCLMUL are not required
# AVX is required for oblivious transfer (OT)
# AVX2 support (Haswell or later) is used to optimize OT
# AVX/AVX2 is required for replicated binary secret sharing
# BMI2 is used to optimize multiplication modulo a prime
# ADX is used to optimize big integer additions
# delete the second line to compile for a platform that supports everything
ARCH = -mtune=native -msse4.1 -msse4.2 -maes -mpclmul -mavx -mavx2 -mbmi2 -madx
ARCH = -march=native

MACHINE := $(shell uname -m)
OS := $(shell uname -s)
ifeq ($(MACHINE), x86_64)
# set this to 0 to avoid using AVX for OT
ifeq ($(OS), Linux)
CHECK_AVX := $(shell grep -q avx /proc/cpuinfo; echo $$?)
ifeq ($(CHECK_AVX), 0)
AVX_OT = 1
else
AVX_OT = 0
endif
else
AVX_OT = 1
endif
else
ARCH =
AVX_OT = 0
endif

# allow to set compiler in CONFIG.mine
CXX = g++

# use CONFIG.mine to overwrite DIR settings
-include CONFIG.mine

ifeq ($(USE_GF2N_LONG),1)
GF2N_LONG = -DUSE_GF2N_LONG
endif

ifeq ($(AVX_OT), 0)
CFLAGS += -DNO_AVX_OT
endif

# MAX_MOD_SZ (for FHE) must be least and GFP_MOD_SZ (for computation)
# must be exactly ceil(len(p)/len(word)) for the relevant prime p
# GFP_MOD_SZ only needs to be set for primes of bit length more that 256.
# Default for MAX_MOD_SZ is 10, which suffices for all Overdrive protocols
# MOD = -DMAX_MOD_SZ=10 -DGFP_MOD_SZ=5

LDLIBS = -lmpirxx -lmpir -lsodium $(MY_LDLIBS)
LDLIBS += -lboost_system
# LDLIBS += -lboost_system -lssl -lcrypto

ifeq ($(USE_NTL),1)
CFLAGS += -DUSE_NTL
LDLIBS := -lntl $(LDLIBS)
endif

ifeq ($(OS), Linux)
LDLIBS += -lrt
endif

ifeq ($(OS), Darwin)
BOOST = -lboost_thread-mt $(MY_BOOST)
else
BOOST = -lboost_thread $(MY_BOOST)
endif

PROME_LIB = -lprometheus-cpp-pull -lprometheus-cpp-core
INC_DIR = /root/pkgs/lib_r/include
SSL_LIB = /root/pkgs/lib_r/lib/libssl.so
CRYPTO_LIB = /root/pkgs/lib_r/lib/libcrypto.so

CFLAGS += -I$(INC_DIR) 
CFLAGS += $(ARCH) $(MY_CFLAGS) $(GDEBUG) -Wextra -Wall $(OPTIM) -I$(ROOT) -pthread $(PROF) $(DEBUG) $(MOD) $(GF2N_LONG) $(PREP_DIR) $(SSL_DIR) $(SECURE) $(PROME) -std=c++11
CPPFLAGS = $(CFLAGS)
LD = $(CXX)

ifeq ($(OS), Darwin)
# for boost with OpenSSL 3
CFLAGS += -Wno-error=deprecated-declarations
ifeq ($(USE_NTL),1)
CFLAGS += -Wno-error=unused-parameter -Wno-error=deprecated-copy
endif
endif

Yes, i recompiled after make clean and error still occurs.

mkskeller commented 1 year ago

Did you change the C++ code?

BeStrongok commented 1 year ago

Did you change the C++ code?

Yes, i added prometheus metrics to monitor the communication, but the changed code only in Networking, i think it's not relevant with this error. I don't see related functions in the debug information.

mkskeller commented 1 year ago

That might be, but any code change leading to a bug I cannot reproduce limits my ability to do anything about it. I have the relevant functionality several times since version 0.3.0, maybe any of these can solve the problem.

BeStrongok commented 1 year ago

I have the relevant functionality several times since version 0.3.0, maybe any of these can solve the problem.

I use the original source code of version 0.3.0 to run the same script, this error still occurs on the 11th run. The error is Segmentation fault. The code of large_test.py is:

import subprocess

def run_cmd(cmd):
    return subprocess.run([cmd], stdout=subprocess.PIPE,
                            stderr=subprocess.STDOUT,
                            shell=True,
                            encoding="utf-8"
                            )

cmd = "./mascot-party.x -pn 18909 -N 2 0 10tFloatABS & ./mascot-party.x -pn 18909 -N 2 1 10tFloatABS"

for i in range(30):
    result = run_cmd(cmd)
    if result.returncode != 0:
        error_msg = result.stdout
        print(error_msg)
        break
    print("test {} sucess".format(i))

mkskeller commented 1 year ago

The script runs fine on my machine. Do you get any meangingful output by running with valgrind, i.e., valgrind ./mascot-party.x ...?

BeStrongok commented 1 year ago

Yes, i run valgrind ./mascot-party.x -pn 18909 -N 2 0 10tFloatABS, the output info is:

==32038== Memcheck, a memory error detector
==32038== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==32038== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==32038== Command: ./mascot-party.x -pn 18909 -N 2 0 10tFloatABS
==32038== 
vex amd64->IR: unhandled instruction bytes: 0x62 0xF1 0xFD 0x28 0x7F 0x44 0x24 0x1 0x48 0x83
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE
vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=0
==32038== valgrind: Unrecognised instruction at address 0x190639.
==32038==    at 0x190639: avx_memzero(void*, unsigned long) (avx_memcpy.h:63)
==32038==    by 0x19D5CA: modp_<2>::modp_() (modp.h:38)
==32038==    by 0x19C5A1: gfp_<0, 2>::gfp_() (gfp.h:155)
==32038==    by 0x19B373: SemiShare<gfp_<0, 2> >::SemiShare() (SemiShare.h:97)
==32038==    by 0x198B55: Share_<SemiShare<gfp_<0, 2> >, SemiShare<gfp_<0, 2> > >::Share_() (Share.h:91)
==32038==    by 0x1952F3: Share<gfp_<0, 2> >::Share() (Share.h:187)
==32038==    by 0x192F1D: DishonestMajorityFieldMachine<Share, Share, gf2n_long, DishonestMajorityMachine>::DishonestMajorityFieldMachine(int, char const**, ez::ezOptionParser&, bool) (FieldMachine.h:42)
==32038==    by 0x1904A1: main (mascot-party.cpp:9)
==32038== Your program just tried to execute an instruction that Valgrind
==32038== did not recognise.  There are two possible reasons for this.
==32038== 1. Your program has a bug and erroneously jumped to a non-code
==32038==    location.  If you are running Memcheck and you just saw a
==32038==    warning about a bad jump, it's probably your program's fault.
==32038== 2. The instruction is legitimate but Valgrind doesn't handle it,
==32038==    i.e. it's Valgrind's fault.  If you think this is the case or
==32038==    you are not sure, please let us know and we'll try to fix it.
==32038== Either way, Valgrind will now raise a SIGILL signal which will
==32038== probably kill your program.
==32038== 
==32038== Process terminating with default action of signal 4 (SIGILL)
==32038==  Illegal opcode at address 0x190639
==32038==    at 0x190639: avx_memzero(void*, unsigned long) (avx_memcpy.h:63)
==32038==    by 0x19D5CA: modp_<2>::modp_() (modp.h:38)
==32038==    by 0x19C5A1: gfp_<0, 2>::gfp_() (gfp.h:155)
==32038==    by 0x19B373: SemiShare<gfp_<0, 2> >::SemiShare() (SemiShare.h:97)
==32038==    by 0x198B55: Share_<SemiShare<gfp_<0, 2> >, SemiShare<gfp_<0, 2> > >::Share_() (Share.h:91)
==32038==    by 0x1952F3: Share<gfp_<0, 2> >::Share() (Share.h:187)
==32038==    by 0x192F1D: DishonestMajorityFieldMachine<Share, Share, gf2n_long, DishonestMajorityMachine>::DishonestMajorityFieldMachine(int, char const**, ez::ezOptionParser&, bool) (FieldMachine.h:42)
==32038==    by 0x1904A1: main (mascot-party.cpp:9)
==32038== 
==32038== HEAP SUMMARY:
==32038==     in use at exit: 3,360 bytes in 29 blocks
==32038==   total heap usage: 30 allocs, 1 frees, 76,064 bytes allocated
==32038== 
==32038== LEAK SUMMARY:
==32038==    definitely lost: 0 bytes in 0 blocks
==32038==    indirectly lost: 0 bytes in 0 blocks
==32038==      possibly lost: 0 bytes in 0 blocks
==32038==    still reachable: 3,360 bytes in 29 blocks
==32038==         suppressed: 0 bytes in 0 blocks
==32038== Rerun with --leak-check=full to see details of leaked memory
==32038== 
==32038== For lists of detected and suppressed errors, rerun with: -s
==32038== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Illegal instruction

mkskeller commented 1 year ago

It seems that the used version of valgrind doesn't fully the platform. You would need to install a newer version. What CPU and OS are you using?

BeStrongok commented 1 year ago

It seems that the used version of valgrind doesn't fully the platform. You would need to install a newer version. What CPU and OS are you using?

The version of valgrind is 3.15.0, OS is Ubuntu 20.04.4 LTS, CPU info is

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          64
On-line CPU(s) list:             0-63
Thread(s) per core:              2
Core(s) per socket:              16
Socket(s):                       2
NUMA node(s):                    2
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           85
Model name:                      Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz

BeStrongok commented 1 year ago

It seems that the used version of valgrind doesn't fully the platform. You would need to install a newer version. What CPU and OS are you using?

I updated valgrind to 3.20.0 and get the same ouput.

==45909== Memcheck, a memory error detector
==45909== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==45909== Using Valgrind-3.20.0 and LibVEX; rerun with -h for copyright info
==45909== Command: ./mascot-party.x -pn 18909 -N 2 0 10tFloatABS
==45909== 
vex amd64->IR: unhandled instruction bytes: 0x62 0xF1 0xFD 0x28 0x7F 0x44 0x24 0x1 0x48 0x83
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE
vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=0
==45909== valgrind: Unrecognised instruction at address 0x190639.
==45909==    at 0x190639: avx_memzero(void*, unsigned long) (avx_memcpy.h:63)
==45909==    by 0x19D5CA: modp_<2>::modp_() (modp.h:38)
==45909==    by 0x19C5A1: gfp_<0, 2>::gfp_() (gfp.h:155)
==45909==    by 0x19B373: SemiShare<gfp_<0, 2> >::SemiShare() (SemiShare.h:97)
==45909==    by 0x198B55: Share_<SemiShare<gfp_<0, 2> >, SemiShare<gfp_<0, 2> > >::Share_() (Share.h:91)
==45909==    by 0x1952F3: Share<gfp_<0, 2> >::Share() (Share.h:187)
==45909==    by 0x192F1D: DishonestMajorityFieldMachine<Share, Share, gf2n_long, DishonestMajorityMachine>::DishonestMajorityFieldMachine(int, char const**, ez::ezOptionParser&, bool) (FieldMachine.h:42)
==45909==    by 0x1904A1: main (mascot-party.cpp:9)
==45909== Your program just tried to execute an instruction that Valgrind
==45909== did not recognise.  There are two possible reasons for this.
==45909== 1. Your program has a bug and erroneously jumped to a non-code
==45909==    location.  If you are running Memcheck and you just saw a
==45909==    warning about a bad jump, it's probably your program's fault.
==45909== 2. The instruction is legitimate but Valgrind doesn't handle it,
==45909==    i.e. it's Valgrind's fault.  If you think this is the case or
==45909==    you are not sure, please let us know and we'll try to fix it.
==45909== Either way, Valgrind will now raise a SIGILL signal which will
==45909== probably kill your program.
==45909== 
==45909== Process terminating with default action of signal 4 (SIGILL)
==45909==  Illegal opcode at address 0x190639
==45909==    at 0x190639: avx_memzero(void*, unsigned long) (avx_memcpy.h:63)
==45909==    by 0x19D5CA: modp_<2>::modp_() (modp.h:38)
==45909==    by 0x19C5A1: gfp_<0, 2>::gfp_() (gfp.h:155)
==45909==    by 0x19B373: SemiShare<gfp_<0, 2> >::SemiShare() (SemiShare.h:97)
==45909==    by 0x198B55: Share_<SemiShare<gfp_<0, 2> >, SemiShare<gfp_<0, 2> > >::Share_() (Share.h:91)
==45909==    by 0x1952F3: Share<gfp_<0, 2> >::Share() (Share.h:187)
==45909==    by 0x192F1D: DishonestMajorityFieldMachine<Share, Share, gf2n_long, DishonestMajorityMachine>::DishonestMajorityFieldMachine(int, char const**, ez::ezOptionParser&, bool) (FieldMachine.h:42)
==45909==    by 0x1904A1: main (mascot-party.cpp:9)
==45909== 
==45909== HEAP SUMMARY:
==45909==     in use at exit: 76,064 bytes in 30 blocks
==45909==   total heap usage: 30 allocs, 0 frees, 76,064 bytes allocated
==45909== 
==45909== LEAK SUMMARY:
==45909==    definitely lost: 0 bytes in 0 blocks
==45909==    indirectly lost: 0 bytes in 0 blocks
==45909==      possibly lost: 0 bytes in 0 blocks
==45909==    still reachable: 76,064 bytes in 30 blocks
==45909==         suppressed: 0 bytes in 0 blocks
==45909== Rerun with --leak-check=full to see details of leaked memory
==45909== 
==45909== For lists of detected and suppressed errors, rerun with: -s
==45909== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Illegal instruction

mkskeller commented 1 year ago

It might be that valgrind doesn't understand AVX-512 instructions supported by your CPU. Try compiling with ARCH = -march=skylake in CONFIG.mine.

BeStrongok commented 1 year ago

ARCH = -march=skylake

I recompiled with this flag, and ran fine, the output of valgrind is:

But this is an occasional error. It may need several times to hit this error and to see the output of valgrind.

mkskeller commented 1 year ago

It might help to find a shorter program that triggers the error. Have you tried if the error also appears with shorter versions of the program?

data61 / MP-SPDZ

Segmentation fault when running mascot protocol. #983