Closed BeStrongok closed 1 year ago
I cannot reproduce this.
CONFIG
/CONFIG.mine
?make clean
?I cannot reproduce this.
- How often does it occur?
- What's your
CONFIG
/CONFIG.mine
?- Have you recompiling after
make clean
?
About once in 5 times, the CONFIG file is:
ROOT = .
OPTIM= -O0
#PROF = -pg
#DEBUG = -DDEBUG
GDEBUG = -g
PROME = -DPROME
# set this to your preferred local storage directory
PREP_DIR = '-DPREP_DIR="Player-Data/"'
# directory to store SSL keys
SSL_DIR = '-DSSL_DIR="Player-Data/"'
# set for SHE preprocessing (SPDZ and Overdrive)
USE_NTL = 0
# set for using GF(2^128)
# unset for GF(2^40)
USE_GF2N_LONG = 1
# set to -march=<architecture> for optimization
# SSE4.2 is required homomorphic encryption in GF(2^n) when compiling with clang
# AES-NI and PCLMUL are not required
# AVX is required for oblivious transfer (OT)
# AVX2 support (Haswell or later) is used to optimize OT
# AVX/AVX2 is required for replicated binary secret sharing
# BMI2 is used to optimize multiplication modulo a prime
# ADX is used to optimize big integer additions
# delete the second line to compile for a platform that supports everything
ARCH = -mtune=native -msse4.1 -msse4.2 -maes -mpclmul -mavx -mavx2 -mbmi2 -madx
ARCH = -march=native
MACHINE := $(shell uname -m)
OS := $(shell uname -s)
ifeq ($(MACHINE), x86_64)
# set this to 0 to avoid using AVX for OT
ifeq ($(OS), Linux)
CHECK_AVX := $(shell grep -q avx /proc/cpuinfo; echo $$?)
ifeq ($(CHECK_AVX), 0)
AVX_OT = 1
else
AVX_OT = 0
endif
else
AVX_OT = 1
endif
else
ARCH =
AVX_OT = 0
endif
# allow to set compiler in CONFIG.mine
CXX = g++
# use CONFIG.mine to overwrite DIR settings
-include CONFIG.mine
ifeq ($(USE_GF2N_LONG),1)
GF2N_LONG = -DUSE_GF2N_LONG
endif
ifeq ($(AVX_OT), 0)
CFLAGS += -DNO_AVX_OT
endif
# MAX_MOD_SZ (for FHE) must be least and GFP_MOD_SZ (for computation)
# must be exactly ceil(len(p)/len(word)) for the relevant prime p
# GFP_MOD_SZ only needs to be set for primes of bit length more that 256.
# Default for MAX_MOD_SZ is 10, which suffices for all Overdrive protocols
# MOD = -DMAX_MOD_SZ=10 -DGFP_MOD_SZ=5
LDLIBS = -lmpirxx -lmpir -lsodium $(MY_LDLIBS)
LDLIBS += -lboost_system
# LDLIBS += -lboost_system -lssl -lcrypto
ifeq ($(USE_NTL),1)
CFLAGS += -DUSE_NTL
LDLIBS := -lntl $(LDLIBS)
endif
ifeq ($(OS), Linux)
LDLIBS += -lrt
endif
ifeq ($(OS), Darwin)
BOOST = -lboost_thread-mt $(MY_BOOST)
else
BOOST = -lboost_thread $(MY_BOOST)
endif
PROME_LIB = -lprometheus-cpp-pull -lprometheus-cpp-core
INC_DIR = /root/pkgs/lib_r/include
SSL_LIB = /root/pkgs/lib_r/lib/libssl.so
CRYPTO_LIB = /root/pkgs/lib_r/lib/libcrypto.so
CFLAGS += -I$(INC_DIR)
CFLAGS += $(ARCH) $(MY_CFLAGS) $(GDEBUG) -Wextra -Wall $(OPTIM) -I$(ROOT) -pthread $(PROF) $(DEBUG) $(MOD) $(GF2N_LONG) $(PREP_DIR) $(SSL_DIR) $(SECURE) $(PROME) -std=c++11
CPPFLAGS = $(CFLAGS)
LD = $(CXX)
ifeq ($(OS), Darwin)
# for boost with OpenSSL 3
CFLAGS += -Wno-error=deprecated-declarations
ifeq ($(USE_NTL),1)
CFLAGS += -Wno-error=unused-parameter -Wno-error=deprecated-copy
endif
endif
Yes, i recompiled after make clean and error still occurs.
Did you change the C++ code?
Did you change the C++ code?
Yes, i added prometheus metrics to monitor the communication, but the changed code only in Networking
, i think it's not relevant with this error. I don't see related functions in the debug information.
That might be, but any code change leading to a bug I cannot reproduce limits my ability to do anything about it. I have the relevant functionality several times since version 0.3.0, maybe any of these can solve the problem.
I have the relevant functionality several times since version 0.3.0, maybe any of these can solve the problem.
I use the original source code of version 0.3.0 to run the same script, this error still occurs on the 11th run.
The error is Segmentation fault
.
The code of large_test.py
is:
import subprocess
def run_cmd(cmd):
return subprocess.run([cmd], stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
shell=True,
encoding="utf-8"
)
cmd = "./mascot-party.x -pn 18909 -N 2 0 10tFloatABS & ./mascot-party.x -pn 18909 -N 2 1 10tFloatABS"
for i in range(30):
result = run_cmd(cmd)
if result.returncode != 0:
error_msg = result.stdout
print(error_msg)
break
print("test {} sucess".format(i))
The script runs fine on my machine. Do you get any meangingful output by running with valgrind, i.e., valgrind ./mascot-party.x ...
?
Yes, i run valgrind ./mascot-party.x -pn 18909 -N 2 0 10tFloatABS
, the output info is:
==32038== Memcheck, a memory error detector
==32038== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==32038== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==32038== Command: ./mascot-party.x -pn 18909 -N 2 0 10tFloatABS
==32038==
vex amd64->IR: unhandled instruction bytes: 0x62 0xF1 0xFD 0x28 0x7F 0x44 0x24 0x1 0x48 0x83
vex amd64->IR: REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR: VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE
vex amd64->IR: PFX.66=0 PFX.F2=0 PFX.F3=0
==32038== valgrind: Unrecognised instruction at address 0x190639.
==32038== at 0x190639: avx_memzero(void*, unsigned long) (avx_memcpy.h:63)
==32038== by 0x19D5CA: modp_<2>::modp_() (modp.h:38)
==32038== by 0x19C5A1: gfp_<0, 2>::gfp_() (gfp.h:155)
==32038== by 0x19B373: SemiShare<gfp_<0, 2> >::SemiShare() (SemiShare.h:97)
==32038== by 0x198B55: Share_<SemiShare<gfp_<0, 2> >, SemiShare<gfp_<0, 2> > >::Share_() (Share.h:91)
==32038== by 0x1952F3: Share<gfp_<0, 2> >::Share() (Share.h:187)
==32038== by 0x192F1D: DishonestMajorityFieldMachine<Share, Share, gf2n_long, DishonestMajorityMachine>::DishonestMajorityFieldMachine(int, char const**, ez::ezOptionParser&, bool) (FieldMachine.h:42)
==32038== by 0x1904A1: main (mascot-party.cpp:9)
==32038== Your program just tried to execute an instruction that Valgrind
==32038== did not recognise. There are two possible reasons for this.
==32038== 1. Your program has a bug and erroneously jumped to a non-code
==32038== location. If you are running Memcheck and you just saw a
==32038== warning about a bad jump, it's probably your program's fault.
==32038== 2. The instruction is legitimate but Valgrind doesn't handle it,
==32038== i.e. it's Valgrind's fault. If you think this is the case or
==32038== you are not sure, please let us know and we'll try to fix it.
==32038== Either way, Valgrind will now raise a SIGILL signal which will
==32038== probably kill your program.
==32038==
==32038== Process terminating with default action of signal 4 (SIGILL)
==32038== Illegal opcode at address 0x190639
==32038== at 0x190639: avx_memzero(void*, unsigned long) (avx_memcpy.h:63)
==32038== by 0x19D5CA: modp_<2>::modp_() (modp.h:38)
==32038== by 0x19C5A1: gfp_<0, 2>::gfp_() (gfp.h:155)
==32038== by 0x19B373: SemiShare<gfp_<0, 2> >::SemiShare() (SemiShare.h:97)
==32038== by 0x198B55: Share_<SemiShare<gfp_<0, 2> >, SemiShare<gfp_<0, 2> > >::Share_() (Share.h:91)
==32038== by 0x1952F3: Share<gfp_<0, 2> >::Share() (Share.h:187)
==32038== by 0x192F1D: DishonestMajorityFieldMachine<Share, Share, gf2n_long, DishonestMajorityMachine>::DishonestMajorityFieldMachine(int, char const**, ez::ezOptionParser&, bool) (FieldMachine.h:42)
==32038== by 0x1904A1: main (mascot-party.cpp:9)
==32038==
==32038== HEAP SUMMARY:
==32038== in use at exit: 3,360 bytes in 29 blocks
==32038== total heap usage: 30 allocs, 1 frees, 76,064 bytes allocated
==32038==
==32038== LEAK SUMMARY:
==32038== definitely lost: 0 bytes in 0 blocks
==32038== indirectly lost: 0 bytes in 0 blocks
==32038== possibly lost: 0 bytes in 0 blocks
==32038== still reachable: 3,360 bytes in 29 blocks
==32038== suppressed: 0 bytes in 0 blocks
==32038== Rerun with --leak-check=full to see details of leaked memory
==32038==
==32038== For lists of detected and suppressed errors, rerun with: -s
==32038== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Illegal instruction
It seems that the used version of valgrind doesn't fully the platform. You would need to install a newer version. What CPU and OS are you using?
It seems that the used version of valgrind doesn't fully the platform. You would need to install a newer version. What CPU and OS are you using?
The version of valgrind is 3.15.0, OS is Ubuntu 20.04.4 LTS, CPU info is
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 48 bits virtual
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Gold 5218 CPU @ 2.30GHz
It seems that the used version of valgrind doesn't fully the platform. You would need to install a newer version. What CPU and OS are you using?
I updated valgrind to 3.20.0
and get the same ouput.
==45909== Memcheck, a memory error detector
==45909== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==45909== Using Valgrind-3.20.0 and LibVEX; rerun with -h for copyright info
==45909== Command: ./mascot-party.x -pn 18909 -N 2 0 10tFloatABS
==45909==
vex amd64->IR: unhandled instruction bytes: 0x62 0xF1 0xFD 0x28 0x7F 0x44 0x24 0x1 0x48 0x83
vex amd64->IR: REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR: VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=NONE
vex amd64->IR: PFX.66=0 PFX.F2=0 PFX.F3=0
==45909== valgrind: Unrecognised instruction at address 0x190639.
==45909== at 0x190639: avx_memzero(void*, unsigned long) (avx_memcpy.h:63)
==45909== by 0x19D5CA: modp_<2>::modp_() (modp.h:38)
==45909== by 0x19C5A1: gfp_<0, 2>::gfp_() (gfp.h:155)
==45909== by 0x19B373: SemiShare<gfp_<0, 2> >::SemiShare() (SemiShare.h:97)
==45909== by 0x198B55: Share_<SemiShare<gfp_<0, 2> >, SemiShare<gfp_<0, 2> > >::Share_() (Share.h:91)
==45909== by 0x1952F3: Share<gfp_<0, 2> >::Share() (Share.h:187)
==45909== by 0x192F1D: DishonestMajorityFieldMachine<Share, Share, gf2n_long, DishonestMajorityMachine>::DishonestMajorityFieldMachine(int, char const**, ez::ezOptionParser&, bool) (FieldMachine.h:42)
==45909== by 0x1904A1: main (mascot-party.cpp:9)
==45909== Your program just tried to execute an instruction that Valgrind
==45909== did not recognise. There are two possible reasons for this.
==45909== 1. Your program has a bug and erroneously jumped to a non-code
==45909== location. If you are running Memcheck and you just saw a
==45909== warning about a bad jump, it's probably your program's fault.
==45909== 2. The instruction is legitimate but Valgrind doesn't handle it,
==45909== i.e. it's Valgrind's fault. If you think this is the case or
==45909== you are not sure, please let us know and we'll try to fix it.
==45909== Either way, Valgrind will now raise a SIGILL signal which will
==45909== probably kill your program.
==45909==
==45909== Process terminating with default action of signal 4 (SIGILL)
==45909== Illegal opcode at address 0x190639
==45909== at 0x190639: avx_memzero(void*, unsigned long) (avx_memcpy.h:63)
==45909== by 0x19D5CA: modp_<2>::modp_() (modp.h:38)
==45909== by 0x19C5A1: gfp_<0, 2>::gfp_() (gfp.h:155)
==45909== by 0x19B373: SemiShare<gfp_<0, 2> >::SemiShare() (SemiShare.h:97)
==45909== by 0x198B55: Share_<SemiShare<gfp_<0, 2> >, SemiShare<gfp_<0, 2> > >::Share_() (Share.h:91)
==45909== by 0x1952F3: Share<gfp_<0, 2> >::Share() (Share.h:187)
==45909== by 0x192F1D: DishonestMajorityFieldMachine<Share, Share, gf2n_long, DishonestMajorityMachine>::DishonestMajorityFieldMachine(int, char const**, ez::ezOptionParser&, bool) (FieldMachine.h:42)
==45909== by 0x1904A1: main (mascot-party.cpp:9)
==45909==
==45909== HEAP SUMMARY:
==45909== in use at exit: 76,064 bytes in 30 blocks
==45909== total heap usage: 30 allocs, 0 frees, 76,064 bytes allocated
==45909==
==45909== LEAK SUMMARY:
==45909== definitely lost: 0 bytes in 0 blocks
==45909== indirectly lost: 0 bytes in 0 blocks
==45909== possibly lost: 0 bytes in 0 blocks
==45909== still reachable: 76,064 bytes in 30 blocks
==45909== suppressed: 0 bytes in 0 blocks
==45909== Rerun with --leak-check=full to see details of leaked memory
==45909==
==45909== For lists of detected and suppressed errors, rerun with: -s
==45909== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
Illegal instruction
It might be that valgrind doesn't understand AVX-512 instructions supported by your CPU. Try compiling with ARCH = -march=skylake
in CONFIG.mine
.
ARCH = -march=skylake
I recompiled with this flag, and ran fine, the output of valgrind is:
But this is an occasional error. It may need several times to hit this error and to see the output of valgrind.
It might help to find a shorter program that triggers the error. Have you tried if the error also appears with shorter versions of the program?
Hi @mkskeller I'm running mascot protocol, and meet an occasional error which is "Segmentation fault". The
gdb
debug info is: The mpc script i use is:Can you help me to check this error, it seems like a bug. The version of source code i use is 0.3.0. The command is
./mascot-party.x -pn 18909 -N 2 0 10tFloatABS & ./mascot-party.x -pn 18909 -N 2 1 10tFloatABS