GeoscienceAustralia / eqrm

Automatically exported from code.google.com/p/eqrm
Other
5 stars 4 forks source link

eqrm crashing when using openmpi-1.6.1 #75

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1.:/nas/gemd/georisk_models/earthquake/sandpits/duncan/EQRM/trunk/eqrm_core$ 
/usr/local/openmpi-1.6.1/bin/mpirun -np 1 python2.7 
implementation_tests/scenarios/TS_risk20.py

The result is;
[rhe-compute1.ga.gov.au:23023] [[61052,1],0] ORTE_ERROR_LOG: Data unpack would 
read past end of buffer in file util/nidmap.c at line 371
[rhe-compute1.ga.gov.au:23023] [[61052,1],0] ORTE_ERROR_LOG: Data unpack would 
read past end of buffer in file base/ess_base_nidmap.c at line 62
[rhe-compute1.ga.gov.au:23023] [[61052,1],0] ORTE_ERROR_LOG: Data unpack would 
read past end of buffer in file ess_env_module.c at line 173
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_base_build_nidmap failed
  --> Returned value Data unpack would read past end of buffer (-26) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[rhe-compute1.ga.gov.au:23023] [[61052,1],0] ORTE_ERROR_LOG: Data unpack would 
read past end of buffer in file runtime/orte_init.c at line 132
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_set_name failed
  --> Returned value Data unpack would read past end of buffer (-26) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: orte_init failed
  --> Returned "Data unpack would read past end of buffer" (-26) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** before MPI was initialized
*** MPI_ERRORS_ARE_FATAL (your MPI job will now abort)
[rhe-compute1.ga.gov.au:23023] Abort before MPI_INIT completed successfully; 
not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 23023 on
node rhe-compute1.ga.gov.au exiting improperly. There are two reasons this 
could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

Original issue reported on code.google.com by duncan.g...@gmail.com on 28 Aug 2012 at 11:53

GoogleCodeExporter commented 9 years ago
/usr/local/openmpi-1.6.1/bin/mpirun -np 23 python2.7 test_parallel.py

This also causes the error.

Original comment by duncan.g...@gmail.com on 28 Aug 2012 at 11:58

GoogleCodeExporter commented 9 years ago
ICT IS is going to install a different version of mpirun.  So I'll stop looking 
at this version.

Original comment by duncan.g...@gmail.com on 29 Aug 2012 at 2:11

GoogleCodeExporter commented 9 years ago
Here's some results from openmpi-1.4.5

This indicates an issue with exiting without calling finalize.

dgray@rhe-compute1:/nas/gemd/georisk_models/earthquake/sandpits/duncan/EQRM/trun
k/eqrm_core$ /usr/local/openmpi-1.4.5/bin/mpirun -np 1 python2.7 
implementation_tests/scenarios/TS_risk20.py
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
CMA: unable to get RDMA device list
--------------------------------------------------------------------------
[[42100,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: rhe-compute1.ga.gov.au

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
Pypar (version 2.1.4) initialised MPI OK with 1 processors
Logfile is './implementation_tests/current/TS_risk20/log-0.txt' with logging 
level of DEBUG, console logging level is INFO
*******1224**********
event_set_handler = generate
P0: Generating event set
P0: Saving event set to ./implementation_tests/current/TS_risk20/newc_event_set
P0: Event set created. Number of events=2
P0: Sites set created. Number of sites=7
P0: do site 1 of 7
P0: do site 2 of 7
P0: do site 3 of 7
P0: do site 4 of 7
P0: do site 5 of 7
P0: do site 6 of 7
P0: do site 7 of 7
time_pre_site_loop_fraction 0.532710280374
event_loop_time (excluding file saving) 0:00:01.070000 hr:min:sec
On node 0, rhe-compute1.ga.gov.au clock (processor) time taken overall 
0:00:01.130000 hr:min:sec.
On node 0, rhe-compute1.ga.gov.au wall time taken overall 0:00:03.131542 
hr:min:sec.
wall_time_taken_overall_seconds = 3.13154196739
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 5097 on
node rhe-compute1.ga.gov.au exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

k/eqrm_core$ /usr/local/openmpi-1.4.5/bin/mpirun -np 1 python2.7 
eqrm_code/test_parallel.py
.librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
CMA: unable to get RDMA device list
--------------------------------------------------------------------------
[[42660,1],0]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: rhe-compute1.ga.gov.au

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
Pypar (version 2.1.4) initialised MPI OK with 1 processors
.
----------------------------------------------------------------------
Ran 2 tests in 0.107s

OK
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 4409 on
node rhe-compute1.ga.gov.au exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

Original comment by duncan.g...@gmail.com on 29 Aug 2012 at 4:59