Closed GoogleCodeExporter closed 9 years ago
Original comment by b...@girorosso.com
on 6 Mar 2012 at 5:06
Running the same data in 'load' mode gets past the crash point. There is a
pretty huge jump in memory use per node, however, when it gets into the site
loop:
e.g.
2012-03-06 17:10:33,396 INFO analysis:352 |P0: do site 1
of 556
2012-03-06 17:10:33,396 DEBUG analysis:354 |Memory: site 1
2012-03-06 17:10:33,397 DEBUG analysis:355 |Resource usage:
memory=659.7MB resident=202.3MB stacksize=0.3MB
2012-03-06 17:19:45,833 INFO analysis:352 |P0: do site 2
of 556
2012-03-06 17:19:45,834 DEBUG analysis:354 |Memory: site 2
2012-03-06 17:19:45,835 DEBUG analysis:355 |Resource usage:
memory=2268.6MB resident=1827.7MB stacksize=0.3MB
The memory use focus so far has been on the event set. I will look into the
temporary arrays that the site loop uses.
The theory is that one of the nodes eventually runs out of memory to map to and
so crashes. While a lot of memory use has been reduced so far, it looks like
there is still some work to do in this area.
Original comment by b...@girorosso.com
on 6 Mar 2012 at 6:35
Some more analysis:
Reproduced the problem on 4 nodes on rhe-compute1. This means that we're not
hitting the issue described here -
http://www-947.ibm.com/support/entry/portal/docdisplay?lndocid=MIGR-5086614
The behaviour on rhe-compute1 is different than on tornado with the same event
set in load mode (tornado has yet to crash and may not).
rhe-compute1 node 0:
2012-03-07 12:11:04,697 INFO analysis:352 |P0: do site 13
of 4445
2012-03-07 12:11:04,697 DEBUG analysis:354 |Memory: site 13
2012-03-07 12:11:04,698 DEBUG analysis:355 |Resource usage:
memory=4320.1MB resident=3870.4MB stacksize=0.3MB
tornado node 0:
2012-03-07 12:03:29,792 INFO analysis:352 |P0: do site 13
of 4445
2012-03-07 12:03:29,793 DEBUG analysis:354 |Memory: site 13
2012-03-07 12:03:29,793 DEBUG analysis:355 |Resource usage:
memory=2360.0MB resident=1832.9MB stacksize=0.3MB
With the same event set I would expect the same memory usage, but rhe-compute1
is almost double.
Noting:
- Python version used to run simulation is the same (2.5.2)
- OpenMPI is different (rhe-compute1 is 1.4, tornado is 1.2.3)
- OpenMPI on rhe-compute1 gives a warning about interfaces when run
librdmacm: couldn't read ABI version.
librdmacm: assuming: 4
CMA: unable to get RDMA device list
--------------------------------------------------------------------------
[[54677,1],1]: A high-performance Open MPI point-to-point messaging module
was unable to find any relevant network interfaces:
Module: OpenFabrics (openib)
Host: rhe-compute1.ga.gov.au
Another transport will be used instead, although this may result in
lower performance.
--------------------------------------------------------------------------
- rhe-compute1 and tornado are running slightly different kernel versions and
different distros
rhe-compute1
$ cat /etc/redhat-release
Red Hat Enterprise Linux Server release 5.6 (Tikanga)
$ uname -a
Linux rhe-compute1.ga.gov.au 2.6.18-238.12.1.el5 #1 SMP Sat May 7 20:18:50 EDT
2011 x86_64 x86_64 x86_64 GNU/Linux
tornado
$ cat /etc/redhat-release
CentOS release 5 (Final)
$ uname -a
Linux tornado.agso.gov.au 2.6.18-53.el5 #1 SMP Mon Nov 12 02:14:55 EST 2007
x86_64 x86_64 x86_64 GNU/Linux
Original comment by b...@girorosso.com
on 7 Mar 2012 at 1:27
The version of Open MPI on rhe-compute1 is 1.4
$ mpirun --version
mpirun (Open MPI) 1.4
Report bugs to http://www.open-mpi.org/community/help/
This is the stable branch and the latest version is 1.4.5.
While the release notes show a lot of bug fixes, it is not clear that this is
the source of the issue. It is worthwhile to upgrade, however, simply because
of the amount of bug fixes in place.
http://svn.open-mpi.org/svn/ompi/branches/v1.4/NEWS
Original comment by b...@girorosso.com
on 7 Mar 2012 at 2:01
I'm putting together an email to Sam about upgrading open MPI.
I found this link;
http://crocea.mednet.ucla.edu/log/hpc-cmb
search for 'Signal code: Address not mapped' to get to the 2.2.2 tips section,
which describes a similar seg fault issue.
Original comment by duncan.g...@gmail.com
on 7 Mar 2012 at 2:39
The memory usage is a red herring. rhe-compute1 has a lot more memory available
than tornado. Python garbage collection is kicking in a lot earlier on tornado
than rhe-compute1 and hence the process sizes are smaller. tornado processes
seem to stabilise at around 2GB whereas rhe-compute1 stabilises around 6GB.
The complication here is that rhe-compute1 and tornado have a different mix of
software. The version of Open MPI is a lot newer on rhe-compute1, so trying
with a new version of python. 2.6 appears to be running without issue for this
simulation. It has not completed yet so it may not, but it looks promising so
far.
Original comment by b...@girorosso.com
on 7 Mar 2012 at 2:45
Using python 2.6 the same simulation with 32 nodes completed successfully in
13h 36m with event_set_handler='load' on rhe-compute1.
With these results it seems that python 2.6 is the one to use with OpenMPI on
rhe-compute1.
Original comment by b...@girorosso.com
on 7 Mar 2012 at 10:18
Noting that the command I ran was:
mpirun -np 32 -x PYTHONPATH /usr/local/bin/python2.6 runhaz.py
Original comment by b...@girorosso.com
on 7 Mar 2012 at 10:19
Original issue reported on code.google.com by
David.Bu...@ga.gov.au
on 6 Mar 2012 at 5:02Attachments: