UG4 / ugcore

The core functionality of UG4. Includes sources, build-scripts, and utility scripts.
https://github.com/UG4/ugcore
Other
36 stars 23 forks source link

Regarding running ugshell with openmpi 4.0.4 #30

Closed stephanmg closed 3 years ago

stephanmg commented 4 years ago

Dear all,

compiling the ug4 HEAD revision (23ee853503b21d12836281b3b98e7640452780ee) and trying to run the Laplace 3d example in parallel yields for me this error (with openmpi 4.0.4):

[app_Examples]$ mpirun --mca btl vader,self -np 2 ../ug4_2/bin/ugshell -ex laplace.lua -dim 3 -numRefs 6
********************************************************************************
* ugshell - ug4.0.2, head revision 'unknown',                                  *
*                    compiled 'Thu Sep 10 11:23:37 EDT 2020'                   *
*                    on 'login2'.                                              *
*                                                                              *
* arguments:                                                                   *
*   -outproc id:         Sets the output-proc to id. Default is 0.             *
*   -ex scriptname:      Executes the specified script.                        *
*   -noquit:             Runs the interactive shell after specified script.    *
*   -quiet:              Disables printing of header and trailer.              *
*   -help:               Print this help message and exit.                     *
*   -noterm:             Terminal logging will be disabled.                    *
*   -logtofile filename: Output will be written to the specified file.         *
*   -call:               Combines all following arguments to one lua command   *
*                        and executes it. Ignored if it follows '-ex'.         *
*                        '(', ')', and '"' have to be escaped, e.g.: '\('      *
* Additional parameters are passed to the script through ugargc and ugargv.    *
*                                                                              *
* Initializing: paths... done, bridge... done, plugins... done                 *
********************************************************************************
Loading Domain grids/laplace_sphere_3d.ugx ... done.
Performing integrity check on domain ... done.
refining...
  util.balancer: creating partitioner...
  util.balancer: creating process hierarchy...
  util.balancer: done

new prochess hierarchy:
  lvl:     0
  procs:   1

NOTE: skipping rebalance.
util.refinement: - refining level 0
new prochess hierarchy:
  lvl:     0   1
  procs:   1   2

Redistributing...
[login2:24248] *** Process received signal ***
[login2:24248] Signal: Segmentation fault (11)
[login2:24248] Signal code: Address not mapped (1)
[login2:24248] Failing at address: 0x2aaaa2c09010
[login2:24248] [ 0] /lib64/libpthread.so.0(+0xf630)[0x2aaab8946630]
[login2:24248] [ 1] /home/tug41634/Code/git/ug4_2/ugcore/cmake/../../lib/libug4.so(_ZN2ug22MultiGridSubsetHandler18assign_subset_implINS_6VertexEEEvPT_i+0x1e6)[0x2aaaaf2c41a6]
[login2:24248] [ 2] /home/tug41634/Code/git/ug4_2/ugcore/cmake/../../lib/libug4.so(_ZN2ug14DistributeGridERNS_9MultiGridERNS_17GridSubsetHandlerERNS_28GridDataSerializationHandlerEbPKSt6vectorIiSaIiEERKN3pcl19ProcessCommunicatorE+0x1b77)[0x2aaaaf33a087]
[login2:24248] [ 3] /home/tug41634/Code/git/ug4_2/ugcore/cmake/../../lib/libug4.so(_ZN2ug12LoadBalancer9rebalanceEv+0x1b6)[0x2aaaaf3905e6]
[login2:24248] [ 4] /home/tug41634/Code/git/ug4_2/ugcore/cmake/../../lib/libug4.so(_ZN2ug6bridge11MethodProxyINS_12LoadBalancerEMS2_FbvEbE5applyERKNS0_16MethodPtrWrapperEPvRKNS0_14ParameterStackERSA_+0x24)[0x2aaab04fe104]
[login2:24248] [ 5] /home/tug41634/Code/git/ug4_2/ugcore/cmake/../../lib/libug4.so(+0x22feb9f)[0x2aaaaf6c5b9f]
[login2:24248] [ 6] /home/tug41634/Code/git/ug4_2/ugcore/cmake/../../lib/libug4.so(+0x230169a)[0x2aaaaf6c869a]
[login2:24248] [ 7] /home/tug41634/Code/git/ug4_2/ugcore/cmake/../../lib/libug4.so(+0x2328bec)[0x2aaaaf6efbec]
[login2:24248] [ 8] /home/tug41634/Code/git/ug4_2/ugcore/cmake/../../lib/libug4.so(+0x234905f)[0x2aaaaf71005f]
[login2:24248] [ 9] /home/tug41634/Code/git/ug4_2/ugcore/cmake/../../lib/libug4.so(+0x23298bd)[0x2aaaaf6f08bd]
[login2:24248] [10] /home/tug41634/Code/git/ug4_2/ugcore/cmake/../../lib/libug4.so(+0x2327cac)[0x2aaaaf6eecac]
[login2:24248] [11] /home/tug41634/Code/git/ug4_2/ugcore/cmake/../../lib/libug4.so(+0x2329b2a)[0x2aaaaf6f0b2a]
[login2:24248] [12] /home/tug41634/Code/git/ug4_2/ugcore/cmake/../../lib/libug4.so(lua_pcall+0x46)[0x2aaaaf6e34a6]
[login2:24248] [13] /home/tug41634/Code/git/ug4_2/ugcore/cmake/../../lib/libug4.so(_ZN2ug6script21ParseAndExecuteBufferEPKcS2_+0x219)[0x2aaaaf69bb59]
[login2:24248] [14] /home/tug41634/Code/git/ug4_2/ugcore/cmake/../../lib/libug4.so(_ZN2ug6script12LoadUGScriptEPKcbb+0x30b)[0x2aaaaf69c94b]
[login2:24248] [15] ../ug4_2/bin/ugshell(_Z12ugshell_mainiPPc+0x484)[0x4069e4]
[login2:24248] [16] /lib64/libc.so.6(__libc_start_main+0xf5)[0x2aaab8b75555]
[login2:24248] [17] ../ug4_2/bin/ugshell[0x405152]
[login2:24248] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 0 on node login2 exited on signal 11 (Segmentation fault).

Note, that, when using Open MPI in the v3.1 series the example runs perfectly fine, even for high number of processors.

I noticed a discrepany between how I have to invoke the mpirun command on openmpi-4.0.4. In the v3.1 series I run ugshell via: mpirun -np 2 ugshell but for the v4.0 series I need to add two additional components mpirun --mca btl vader,self -np 2 (vader and self). So maybe I am not using this as intended. Did anybody else experience this kind of problem?

The OS is CentOS 7.6 and ug4 revision 23ee853503b21d12836281b3b98e7640452780ee.

mpicxx --version
g++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39)
mlampe commented 4 years ago

I cannot reproduce this on my CentOS 7 system. I've used the system gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) as you indicated and openmpi 4.0.4 built from source with default options. I also don't have to specify a bit transport.

stephanmg commented 4 years ago

@mlampe thanks for testing this.

I did not build openmpi 4.0.4 on my own/from source, it was provided to me by the compute cluster as a module. I presume (as it is not the default module) it might somehow not suggested to use anyway.

However, as suggested by @bsumirak, I'll try to provide a stack trace with symbols if my time allows. This might allow us to track down the problem.

I assume by bit transport you refer to the Byte Transfer Layer BTL, which I had to modify in my mpirun call?

The hardware might be of interest: 720x Intel Xeon E5-2690 v4 2.6GHz