FLAME-HPC / xparser

The FLAME 1 xparser
0 stars 6 forks source link

Parallel FLAME stop at random iterations, MPI_CommRoutine_HANDSHAKE fails #14

Closed zauster closed 6 years ago

zauster commented 6 years ago

Hey all, i am writing a simple economic model with firms and consumers. When I use xparser -s everything works out fine. When I use xparser -p -f, the code compiles and runs, but stops (sometimes) at a seemingly random iteration (at 352, at 954, ...) and hangs indefinitely.

It is always before the first agent function is carried out (the first function would print something):

$ xparser -p -f bl_model.xml && make && mpirun -np 2 ./main 2500 its/0_20_200.xml -r

...

Cons 199 @ 20: dem 0.324464 | plan 0.263973 | price 0.813566 | have 56.481562 | req 0.324464 
Cons 199 @ 14: dem 0.324464 | plan 0.333777 | price 1.028703 | have 56.147785 | req 0.324464 
Cons 199 @ 18: dem 0.324464 | plan 0.281282 | price 0.866913 | have 55.866504 | req 0.324464 
------- P1:303  
P0:303  ------- 

^C^C% 

Now, when I omit the -f parameter, I get the following error:

$ xparser -p bl_model.xml && make && mpirun -np 2 ./main 2 its/0_20_200.xml -r 
xparser (Version 0.17.1)

***Info: Evnironment variable FLAME_XPARSER_DIR set - looking in /usr/include/xparser for Templates

Code type       : Parallel (DEBUG)
Input XMML file : bl_model.xml
Model root dir  : 
Template dir    : /usr/include/xparser/

Reading XMML file (bl_model.xml)
- Model name     : ACE Baseline Model
- Functions file : firm_functions.c
- Functions file : consumer_functions.c
- xagent   : Firm
- xagent   : Consumer
- Message  : askCurrentPrices
- Message  : CurrentPriceList
- Message  : OrderForm
- Message  : GoodsPackage
- Message  : WagePayment
- Message  : ProfitPayment
- Message  : FiringNotice
- Message  : OpenVacancyRequest
- Message  : JobOffer
- Message  : AcceptedJobOffer
- Message  : LeaveNotice
End of XMML file

Creating dependency graph
Finished dependency loop check
Total communication sync lengths = 17
Ordering functions in process layers
New communication sync lengths = 25

Writing file : stategraph.dot
Writing file : stategraph_colour.dot
Writing file : process_order_graph.dot
Writing file : latex.tex

Generating Makefile using /usr/include/xparser/Makefile.tmpl
Generating xml.c using /usr/include/xparser/xml.tmpl
Generating main.c using /usr/include/xparser/main.tmpl
Generating header.h using /usr/include/xparser/header.tmpl
Generating memory.c using /usr/include/xparser/memory.tmpl
Generating low_primes.h using /usr/include/xparser/low_primes.tmpl
Generating messageboards.c using /usr/include/xparser/messageboards.tmpl
Generating partitioning.c using /usr/include/xparser/partitioning.tmpl
Generating timing.c using /usr/include/xparser/timing.tmpl
Generating Doxyfile using /usr/include/xparser/Doxyfile.tmpl
Generating rules.c using /usr/include/xparser/rules.tmpl

Writing header file : Firm_agent_header.h
Writing header file : Consumer_agent_header.h

--- xparser finished ---

To compile and run the generated code, you will need:
 * libmboard (version 0.3.0 or newer)
mpicc -c -I/usr/local/include  -std=c99 -Wall -D_DEBUG_MODE -g firm_functions.c -o firm_functions.o
mpicc -c -I/usr/local/include  -std=c99 -Wall -D_DEBUG_MODE -g consumer_functions.c -o consumer_functions.o
mpicc -c -I/usr/local/include  -std=c99 -Wall -D_DEBUG_MODE -g main.c -o main.o
mpicc -c -I/usr/local/include  -std=c99 -Wall -D_DEBUG_MODE -g memory.c -o memory.o
mpicc -c -I/usr/local/include  -std=c99 -Wall -D_DEBUG_MODE -g xml.c -o xml.o
mpicc -c -I/usr/local/include  -std=c99 -Wall -D_DEBUG_MODE -g messageboards.c -o messageboards.o
mpicc -c -I/usr/local/include  -std=c99 -Wall -D_DEBUG_MODE -g partitioning.c -o partitioning.o
mpicc -c -I/usr/local/include  -std=c99 -Wall -D_DEBUG_MODE -g rules.c -o rules.o
mpicc -c -I/usr/local/include  -std=c99 -Wall -D_DEBUG_MODE -g timing.c -o timing.o
mpif77 -L/usr/local/lib firm_functions.o consumer_functions.o main.o memory.o xml.o messageboards.o partitioning.o rules.o timing.o -o main -lmboard_pd  -lm
[libmboard] Version        : 0.3.1 (PARALLEL)
[libmboard] Build date     : Sun Mar 18 23:42:24 CET 2018
[libmboard] Config options :  '--prefix=/usr' '--disable-tests' 'CFLAGS=-march=native -O2 -pipe -fstack-protector-strong' 'LDFLAGS=-Wl,-O1,--sort-common,--as-needed,-z,relro' 'CPPFLAGS=-D_FORTIFY_SOURCE=2'

[libmboard] +++ This is a DEBUG version +++
[libmboard] <settings> MBOARD_MEMPOOL_RECYCLE = 0 (default)
[libmboard] <settings> MBOARD_MEMPOOL_BLOCKSIZE = 512 (default)
[libmboard] <settings> MBOARD_COMM_PROTOCOL = HANDSHAKE (default)
[libmboard] <settings> MBOARD_MEMPOOL_RECYCLE = 0 (default)
[libmboard] <settings> MBOARD_MEMPOOL_BLOCKSIZE = 512 (default)
[libmboard] <settings> MBOARD_COMM_PROTOCOL = HANDSHAKE (default)
MPI FLAME Application: ACE Baseline Model 
Debug mode enabled 
Iterations: 2
0> xml: Round-robin partitioning
Reading initial data file: its/0_20_200.xml
Debug mode enabled 
Reading environment data from: its/0_20_200.xml
Reading agent data from: its/0_20_200.xml
output: type='snapshot' format='xml' location='its/' period='20' phase='2'
0> xdiv=2 ydiv=1
0> Round-robin partitioning
0> Partition 0 : 0.000000, 0.500000, 0.000000, 1.000000
0> Partition 1 : 0.500000, 1.000000, 0.000000, 1.000000
Node 0 found its partition data :  0.000000, 0.500000, 0.000000, 1.000000
Node 1 found its partition data :  0.500000, 1.000000, 0.000000, 1.000000
Reading initial data file: its/0_20_200.xml
Reading initial data file: its/0_20_200.xml
Reading environment data from: its/0_20_200.xml
Reading environment data from: its/0_20_200.xml
Reading agent data from: its/0_20_200.xml
Reading agent data from: its/0_20_200.xml
output: type='snapshot' format='xml' location='its/' period='20' phase='2'
output: type='snapshot' format='xml' location='its/' period='20' phase='2'
output: type='snapshot' format='xml' location='its/' period='20' phase='2'
0> Processor name: irene
0> No of agents on node: 110
0> Firm agents on node: 10
0> Consumer agents on node: 100
1> Processor name: irene
1> No of agents on node: 110
1> Firm agents on node: 10
1> Consumer agents on node: 100
0> Agent total check: 220
------- P1:1    
P0:1    ------- 
main: comm_routines_HANDSHAKE.c:301: MBI_CommRoutine_HANDSHAKE_AgreeBufSizes: Assertion `node->board->tt != NULL' failed.
main: comm_routines_HANDSHAKE.c:301: MBI_CommRoutine_HANDSHAKE_AgreeBufSizes: Assertion `node->board->tt != NULL' failed.
[irene:08489] *** Process received signal ***
[irene:08490] *** Process received signal ***
[irene:08490] Signal: Aborted (6)
[irene:08490] Signal code:  (-6)
[irene:08489] Signal: Aborted (6)
[irene:08489] Signal code:  (-6)
[irene:08490] [ 0] [irene:08489] [ 0] /usr/lib/libpthread.so.0(+0x11dd0)[0x7f09a415cdd0]
[irene:08489] [ 1] /usr/lib/libpthread.so.0(+0x11dd0)[0x7f80b0375dd0]
[irene:08490] [ 1] /usr/lib/libc.so.6(gsignal+0x110)[0x7f09a3dc8860]
[irene:08489] [ 2] /usr/lib/libc.so.6(gsignal+0x110)[0x7f80affe1860]
[irene:08490] [ 2] /usr/lib/libc.so.6(abort+0x1c9)[0x7f09a3dc9ec9]
[irene:08489] [ 3] /usr/lib/libc.so.6(abort+0x1c9)[0x7f80affe2ec9]
[irene:08490] [ 3] /usr/lib/libc.so.6(+0x2d0bc)[0x7f09a3dc10bc]
[irene:08489] [ 4] /usr/lib/libc.so.6(+0x2d0bc)[0x7f80affda0bc]
[irene:08490] [ 4] /usr/lib/libc.so.6(+0x2d133)[0x7f09a3dc1133]
[irene:08489] [ 5] ./main(+0x2bc9d)[0x56488d501c9d]
[irene:08489] [ 6] /usr/lib/libc.so.6(+0x2d133)[0x7f80affda133]
[irene:08490] [ 5] ./main(+0x2bc9d)[0x55e8d6ddcc9d]
[irene:08490] [ 6] ./main(+0x2b148)[0x56488d501148]
[irene:08489] [ 7] ./main(+0x2af85)[0x56488d500f85]
[irene:08489] [ 8] ./main(+0x2b148)[0x55e8d6ddc148]
[irene:08490] [ 7] ./main(+0x2af85)[0x55e8d6ddbf85]
[irene:08490] [ 8] /usr/lib/libpthread.so.0(+0x708c)[0x7f80b036b08c]
[irene:08490] [ 9] /usr/lib/libpthread.so.0(+0x708c)[0x7f09a415208c]
[irene:08489] [ 9] /usr/lib/libc.so.6(clone+0x3f)[0x7f80b00a2e7f]
[irene:08490] *** End of error message ***
/usr/lib/libc.so.6(clone+0x3f)[0x7f09a3e89e7f]
[irene:08489] *** End of error message ***
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node irene exited on signal 6 (Aborted).
--------------------------------------------------------------------------

Now, I stripped the model of everything except the first function and the message that is sent between firm and consumer, but the error still happens. I attach this minimal model for you to reproduce the error. I appreciate any help/recommendations on this!

FLAME_Handshake.zip

zauster commented 6 years ago

I'm on a Linux machine with the most recent OpenMPI version and xparser/libmboard from the github repos:

% uname -srv
Linux 4.15.11-1-ARCH #1 SMP PREEMPT Mon Mar 19 18:21:03 UTC 2018

% mpicc -v
Es werden eingebaute Spezifikationen verwendet.
COLLECT_GCC=/usr/bin/gcc
COLLECT_LTO_WRAPPER=/usr/lib/gcc/x86_64-pc-linux-gnu/7.3.1/lto-wrapper
Ziel: x86_64-pc-linux-gnu
Konfiguriert mit: /build/gcc/src/gcc/configure --prefix=/usr --libdir=/usr/lib --libexecdir=/usr/lib --mandir=/usr/share/man --infodir=/usr/share/info --with-bugurl=https://bugs.archlinux.org/ --enable-languages=c,c++,ada,fortran,go,lto,objc,obj-c++ --enable-shared --enable-threads=posix --enable-libmpx --with-system-zlib --with-isl --enable-__cxa_atexit --disable-libunwind-exceptions --enable-clocale=gnu --disable-libstdcxx-pch --disable-libssp --enable-gnu-unique-object --enable-linker-build-id --enable-lto --enable-plugin --enable-install-libiberty --with-linker-hash-style=gnu --enable-gnu-indirect-function --enable-multilib --disable-werror --enable-checking=release --enable-default-pie --enable-default-ssp
Thread-Modell: posix
gcc-Version 7.3.1 20180312 (GCC) 

% pacman -Ss openmpi
extra/openmpi 3.0.0-1 [Installiert]
    High performance message passing library (MPI)
svdhoog commented 6 years ago

zauster: Does this also occur using MPICH2 ? I'm not an expert on parallel, but I recall there were some architectures on which OpenMPI gave different results than MPICH, so perhaps it's worth a try.

zauster commented 6 years ago

@svdhoog thanks for the suggestion!

I installed MPICH and re-compiled libmboard but unfortunately to no effect. The very same errors occur (tested again with the minimal example):

zauster commented 6 years ago

As this is not an issue of xparser, but libmboard, I am closing this issue in favour of an issue at libmboard: https://github.com/FLAME-HPC/libmboard/issues/7