GeoscienceAustralia / eqrm

Automatically exported from code.google.com/p/eqrm
Other
5 stars 4 forks source link

Segmentation fault crash #25

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. Use the run_rhe.sh script in 
EQRM\sandpits\dburbidg\python_eqrm\EQRM\trunk\case_studies\national\regional
2. Wait several hours

What is the expected output? What do you see instead?
I saw a seg fault instead of successful run. The seg fault error message is as 
follows:

*** Process received signal ***
[rhe-compute1:03412] Signal: Segmentation fault (11)
[rhe-compute1:03412] Signal code: Address not mapped (1)
[rhe-compute1:03412] Failing at address: 0xffffffffb416b8f0
[rhe-compute1:03412] [ 0] /lib64/libpthread.so.0 [0x33e140eb10]
[rhe-compute1:03412] [ 1] /lib64/libc.so.6(memcpy+0xd2) [0x33e087c312]
[rhe-compute1:03412] [ 2] 
/usr/local/lib/libpython2.5.so.1.0(PyString_FromStringAndSize+0xef) 
[0x2aac07ed573f]
[rhe-compute1:03412] [ 3] 
/usr/local/lib/python2.5/lib-dynload/_ctypes.so(ffi_call_unix64+0x4c) 
[0x2aac0f733880]
[rhe-compute1:03412] [ 4] 
/usr/local/lib/python2.5/lib-dynload/_ctypes.so(ffi_call+0x223) [0x2aac0f732fe3]
[rhe-compute1:03412] [ 5] 
/usr/local/lib/python2.5/lib-dynload/_ctypes.so(_CallProc+0x322) 
[0x2aac0f72dca2]
[rhe-compute1:03412] [ 6] /usr/local/lib/python2.5/lib-dynload/_ctypes.so 
[0x2aac0f72734f]
[rhe-compute1:03412] [ 7] 
/usr/local/lib/libpython2.5.so.1.0(PyObject_Call+0x13) [0x2aac07e94b03]
[rhe-compute1:03412] [ 8] 
/usr/local/lib/libpython2.5.so.1.0(PyEval_EvalFrameEx+0x44a5) [0x2aac07f19415]
[rhe-compute1:03412] [ 9] 
/usr/local/lib/libpython2.5.so.1.0(PyEval_EvalCodeEx+0x821) [0x2aac07f1b9c1]
[rhe-compute1:03412] [10] 
/usr/local/lib/libpython2.5.so.1.0(PyEval_EvalFrameEx+0x55cf) [0x2aac07f1a53f]
[rhe-compute1:03412] [11] 
/usr/local/lib/libpython2.5.so.1.0(PyEval_EvalCodeEx+0x821) [0x2aac07f1b9c1]
[rhe-compute1:03412] [12] /usr/local/lib/libpython2.5.so.1.0 [0x2aac07eb5aa3]
[rhe-compute1:03412] [13] 
/usr/local/lib/libpython2.5.so.1.0(PyObject_Call+0x13) [0x2aac07e94b03]
[rhe-compute1:03412] [14] 
/usr/local/lib/libpython2.5.so.1.0(PyObject_CallFunctionObjArgs+0x176) 
[0x2aac07e986c6]
[rhe-compute1:03412] [15] /usr/local/lib/python2.5/lib-dynload/_ctypes.so 
[0x2aac0f72737f]
[rhe-compute1:03412] [16] 
/usr/local/lib/libpython2.5.so.1.0(PyObject_Call+0x13) [0x2aac07e94b03]
[rhe-compute1:03412] [17] 
/usr/local/lib/libpython2.5.so.1.0(PyEval_EvalFrameEx+0x44a5) [0x2aac07f19415]
[rhe-compute1:03412] [18] 
/usr/local/lib/libpython2.5.so.1.0(PyEval_EvalCodeEx+0x821) [0x2aac07f1b9c1]
[rhe-compute1:03412] [19] /usr/local/lib/libpython2.5.so.1.0 [0x2aac07eb5a3d]
[rhe-compute1:03412] [20] 
/usr/local/lib/libpython2.5.so.1.0(PyObject_Call+0x13) [0x2aac07e94b03]
[rhe-compute1:03412] [21] 
/usr/local/lib/libpython2.5.so.1.0(PyEval_EvalFrameEx+0x3e87) [0x2aac07f18df7]
[rhe-compute1:03412] [22] 
/usr/local/lib/libpython2.5.so.1.0(PyEval_EvalCodeEx+0x821) [0x2aac07f1b9c1]
[rhe-compute1:03412] [23] 
/usr/local/lib/libpython2.5.so.1.0(PyEval_EvalFrameEx+0x55cf) [0x2aac07f1a53f]
[rhe-compute1:03412] [24] 
/usr/local/lib/libpython2.5.so.1.0(PyEval_EvalCodeEx+0x821) [0x2aac07f1b9c1]
[rhe-compute1:03412] [25] /usr/local/lib/libpython2.5.so.1.0 [0x2aac07eb5aa3]
[rhe-compute1:03412] [26] 
/usr/local/lib/libpython2.5.so.1.0(PyObject_Call+0x13) [0x2aac07e94b03]
[rhe-compute1:03412] [27] /usr/local/lib/libpython2.5.so.1.0 [0x2aac07e9d01f]
[rhe-compute1:03412] [28] 
/usr/local/lib/libpython2.5.so.1.0(PyObject_Call+0x13) [0x2aac07e94b03]
[rhe-compute1:03412] [29] 
/usr/local/lib/libpython2.5.so.1.0(PyEval_CallObjectWithKeywords+0x6f) 
[0x2aac07f1428f]
[rhe-compute1:03412] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 3412 on node rhe-compute1.ga.gov.au 
exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

What version of the product are you using? On what operating system?

Revision 1887 on rhe-compute1

Please provide any additional information below.

Log file and output files attached.

Original issue reported on code.google.com by David.Bu...@ga.gov.au on 6 Mar 2012 at 5:02

Attachments:

GoogleCodeExporter commented 9 years ago

Original comment by b...@girorosso.com on 6 Mar 2012 at 5:06

GoogleCodeExporter commented 9 years ago
Running the same data in 'load' mode gets past the crash point. There is a 
pretty huge jump in memory use per node, however, when it gets into the site 
loop:

e.g.
2012-03-06 17:10:33,396 INFO                      analysis:352 |P0: do site 1 
of 556
2012-03-06 17:10:33,396 DEBUG                     analysis:354 |Memory: site 1
2012-03-06 17:10:33,397 DEBUG                     analysis:355 |Resource usage: 
memory=659.7MB resident=202.3MB stacksize=0.3MB
2012-03-06 17:19:45,833 INFO                      analysis:352 |P0: do site 2 
of 556
2012-03-06 17:19:45,834 DEBUG                     analysis:354 |Memory: site 2
2012-03-06 17:19:45,835 DEBUG                     analysis:355 |Resource usage: 
memory=2268.6MB resident=1827.7MB stacksize=0.3MB

The memory use focus so far has been on the event set. I will look into the 
temporary arrays that the site loop uses.

The theory is that one of the nodes eventually runs out of memory to map to and 
so crashes. While a lot of memory use has been reduced so far, it looks like 
there is still some work to do in this area.

Original comment by b...@girorosso.com on 6 Mar 2012 at 6:35

GoogleCodeExporter commented 9 years ago
Some more analysis:

Reproduced the problem on 4 nodes on rhe-compute1. This means that we're not 
hitting the issue described here - 
http://www-947.ibm.com/support/entry/portal/docdisplay?lndocid=MIGR-5086614

The behaviour on rhe-compute1 is different than on tornado with the same event 
set in load mode (tornado has yet to crash and may not).

rhe-compute1 node 0:

2012-03-07 12:11:04,697 INFO                      analysis:352 |P0: do site 13 
of 4445
2012-03-07 12:11:04,697 DEBUG                     analysis:354 |Memory: site 13
2012-03-07 12:11:04,698 DEBUG                     analysis:355 |Resource usage: 
memory=4320.1MB resident=3870.4MB stacksize=0.3MB

tornado node 0:

2012-03-07 12:03:29,792 INFO                      analysis:352 |P0: do site 13 
of 4445
2012-03-07 12:03:29,793 DEBUG                     analysis:354 |Memory: site 13
2012-03-07 12:03:29,793 DEBUG                     analysis:355 |Resource usage: 
memory=2360.0MB resident=1832.9MB stacksize=0.3MB

With the same event set I would expect the same memory usage, but rhe-compute1 
is almost double.

Noting:
- Python version used to run simulation is the same (2.5.2)

- OpenMPI is different (rhe-compute1 is 1.4, tornado is 1.2.3)

- OpenMPI on rhe-compute1 gives a warning about interfaces when run

librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
CMA: unable to get RDMA device list
--------------------------------------------------------------------------
[[54677,1],1]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:

Module: OpenFabrics (openib)
  Host: rhe-compute1.ga.gov.au

Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------

- rhe-compute1 and tornado are running slightly different kernel versions and 
different distros

rhe-compute1
$ cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 5.6 (Tikanga)
$ uname -a
Linux rhe-compute1.ga.gov.au 2.6.18-238.12.1.el5 #1 SMP Sat May 7 20:18:50 EDT 
2011 x86_64 x86_64 x86_64 GNU/Linux

tornado
$ cat /etc/redhat-release 
CentOS release 5 (Final)
$ uname -a
Linux tornado.agso.gov.au 2.6.18-53.el5 #1 SMP Mon Nov 12 02:14:55 EST 2007 
x86_64 x86_64 x86_64 GNU/Linux

Original comment by b...@girorosso.com on 7 Mar 2012 at 1:27

GoogleCodeExporter commented 9 years ago
The version of Open MPI on rhe-compute1 is 1.4

$ mpirun --version
mpirun (Open MPI) 1.4

Report bugs to http://www.open-mpi.org/community/help/

This is the stable branch and the latest version is 1.4.5.

While the release notes show a lot of bug fixes, it is not clear that this is 
the source of the issue. It is worthwhile to upgrade, however, simply because 
of the amount of bug fixes in place.

http://svn.open-mpi.org/svn/ompi/branches/v1.4/NEWS

Original comment by b...@girorosso.com on 7 Mar 2012 at 2:01

GoogleCodeExporter commented 9 years ago
I'm putting together an email to Sam about upgrading open MPI.
I found this link;
http://crocea.mednet.ucla.edu/log/hpc-cmb

search for 'Signal code: Address not mapped' to get to the 2.2.2 tips section, 
which describes a similar seg fault issue.

Original comment by duncan.g...@gmail.com on 7 Mar 2012 at 2:39

GoogleCodeExporter commented 9 years ago
The memory usage is a red herring. rhe-compute1 has a lot more memory available 
than tornado. Python garbage collection is kicking in a lot earlier on tornado 
than rhe-compute1 and hence the process sizes are smaller. tornado processes 
seem to stabilise at around 2GB whereas rhe-compute1 stabilises around 6GB.

The complication here is that rhe-compute1 and tornado have a different mix of 
software. The version of Open MPI is a lot newer on rhe-compute1, so trying 
with a new version of python. 2.6 appears to be running without issue for this 
simulation. It has not completed yet so it may not, but it looks promising so 
far.

Original comment by b...@girorosso.com on 7 Mar 2012 at 2:45

GoogleCodeExporter commented 9 years ago
Using python 2.6 the same simulation with 32 nodes completed successfully in 
13h 36m with event_set_handler='load' on rhe-compute1.

With these results it seems that python 2.6 is the one to use with OpenMPI on 
rhe-compute1.

Original comment by b...@girorosso.com on 7 Mar 2012 at 10:18

GoogleCodeExporter commented 9 years ago
Noting that the command I ran was:

mpirun -np 32 -x PYTHONPATH /usr/local/bin/python2.6 runhaz.py

Original comment by b...@girorosso.com on 7 Mar 2012 at 10:19