NCPP / ocgis

OpenClimateGIS is a set of geoprocessing and calculation tools for CF-compliant climate datasets.
Other
70 stars 19 forks source link

"Cannot allocate memory" error #475

Closed aaschwanden closed 6 years ago

aaschwanden commented 6 years ago

Hi,

I'm trying to run OCGIS on an HPC cluster that uses SLURM and has dedicated post-procssing nodes (http://www.gi.alaska.edu/research-computing-systems/hpc).

This sometimes works, but more often it doesn't. Running my script on the login node, I get the following warning:

--------------------------------------------------------------------------
An MPI process has executed an operation involving a call to the
"fork()" system call to create a child process.  Open MPI is currently
operating in a condition that could result in memory corruption or
other system errors; your MPI job may hang, crash, or produce silent
data corruption.  The use of fork() (or system() or other calls that
create child processes) is strongly discouraged.  

The process that invoked fork was:

  Local host:          chinook01 (PID 19077)
  MPI_COMM_WORLD rank: 0

If you are *absolutely sure* that your application will successfully
and correctly survive a call to fork(), you may disable this warning
by setting the mpi_warn_on_fork MCA parameter to 0.

but running the same on the post-processing node, it bails most of the time with:

Traceback (most recent call last):
  File "/u1/uaf/aaschwanden/base/gris-analysis/basins/extract_basins.py", line 6, in <module>
    import ocgis
  File "/u1/uaf/aaschwanden/.local/lib/python2.7/site-packages/ocgis-2.1.0.dev1-py2.7.egg/ocgis/__init__.py", line 8, in <module>
    from ocgis.vmachine.core import vm, OcgVM
  File "/u1/uaf/aaschwanden/.local/lib/python2.7/site-packages/ocgis-2.1.0.dev1-py2.7.egg/ocgis/vmachine/core.py", line 5, in <module>
    from ocgis.base import AbstractOcgisObject
  File "/u1/uaf/aaschwanden/.local/lib/python2.7/site-packages/ocgis-2.1.0.dev1-py2.7.egg/ocgis/base.py", line 10, in <module>
    from ocgis.util.helpers import get_iter
  File "/u1/uaf/aaschwanden/.local/lib/python2.7/site-packages/ocgis-2.1.0.dev1-py2.7.egg/ocgis/util/helpers.py", line 18, in <module>
    from shapely.geometry import Point
  File "/u1/uaf/aaschwanden/.local/lib/python2.7/site-packages/Shapely-1.6.3-py2.7-linux-x86_64.egg/shapely/geometry/__init__.py", line 4, in <module>
    from .base import CAP_STYLE, JOIN_STYLE
  File "/u1/uaf/aaschwanden/.local/lib/python2.7/site-packages/Shapely-1.6.3-py2.7-linux-x86_64.egg/shapely/geometry/base.py", line 17, in <module>
    from shapely.coords import CoordinateSequence
  File "/u1/uaf/aaschwanden/.local/lib/python2.7/site-packages/Shapely-1.6.3-py2.7-linux-x86_64.egg/shapely/coords.py", line 8, in <module>
    from shapely.geos import lgeos
  File "/u1/uaf/aaschwanden/.local/lib/python2.7/site-packages/Shapely-1.6.3-py2.7-linux-x86_64.egg/shapely/geos.py", line 75, in <module>
    _lgeos = load_dll('geos_c', fallbacks=alt_paths)
  File "/u1/uaf/aaschwanden/.local/lib/python2.7/site-packages/Shapely-1.6.3-py2.7-linux-x86_64.egg/shapely/geos.py", line 28, in load_dll
    lib = find_library(libname)
  File "/usr/local/pkg/lang/Python/2.7.12-pic-intel-2016b/lib/python2.7/ctypes/util.py", line 237, in find_library
    return _findSoname_ldconfig(name) or _get_soname(_findLib_gcc(name))
  File "/usr/local/pkg/lang/Python/2.7.12-pic-intel-2016b/lib/python2.7/ctypes/util.py", line 226, in _findSoname_ldconfig
    f = os.popen('LC_ALL=C LANG=C /sbin/ldconfig -p 2>/dev/null')
OSError: [Errno 12] Cannot allocate memory

At first I considered this a glitch in our HPC system that we are not able to track down, but now I've come across some posts that relate this error to the use of "fork" in python:

https://stackoverflow.com/questions/20111242/how-to-avoid-errno-12-cannot-allocate-memory-errors-caused-by-using-subprocess

so I've decided to post it here because it happens during the initialization of OCGIS.

I understand that the information provided here may be insufficient to diagnose the problem, and I'd be happy to share additional debugging info.

Thanks.

bekozi commented 6 years ago

Hi @aaschwanden. A few quick questions before digging into this further.

Thanks!

aaschwanden commented 6 years ago

Hi,

I’m also cc’ing our HPC experts as the idea has come up that a glitch in the configuration of the nodes would allow too many post processing jobs to be run on a single node. We currently don’t know if this is related or a red herring, though.

On Jan 2, 2018, at 6:56 AM, Ben Koziol notifications@github.com wrote:

Hi @aaschwanden. A few quick questions before digging into this further.

• Does this happen immediately when ocgis is first imported? Or does this happen after some other processes are run?

It happens immediately when ocgis is first imported

• Is this job being run in parallel or on a single process?

Serial. I currently don’t know how run ocgis in parallel for my problem. I will contact you with a separate issue to explore ways how to use ocgis most efficiently for my needs to extract data from large (>0.5TB) files.

Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

bekozi commented 6 years ago

Thanks @aaschwanden. Let me know what you find out! Reading through the SO post you linked too, it sounds like popen/fork in Python can use a non-negligible amount of memory. This could help explain the out-of-memory error if the node usage is too high.

Serial. I currently don’t know how run ocgis in parallel for my problem. I will contact you with a separate issue to explore ways how to use ocgis most efficiently for my needs to extract data from large (>0.5TB) files.

Sounds good!

bekozi commented 6 years ago

Hi @aaschwanden. I wanted to check in and see if you've made any progress on this issue. How's it going?

aaschwanden commented 6 years ago

I’ve been traveling for the past four weeks and haven’t had a chance to look into it. Back in the office next week to revisit the issue.

On Jan 31, 2018, at 10:07, Ben Koziol notifications@github.com wrote:

Hi @aaschwanden. I wanted to check in and see if you've made any progress on this issue. How's it going?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

bekozi commented 6 years ago

Ahhh, I hope you are having a good trip. No rush of course.

bekozi commented 6 years ago

@aaschwanden I'm going to close this for now. Let me know if there is still an issue on your end.