MPAS-Dev / MPAS

Repository for private MPAS development prior to the MPAS v6.0 release.
Other
4 stars 0 forks source link

Insufficient virtual memory error #1490

Closed pwolfram closed 6 years ago

pwolfram commented 6 years ago

The current version of ocean/develop fails with

forrtl: severe (41): insufficient virtual memory
Image              PC                Routine            Line        Source             
ocean_model        0000000000E5A07B  for_allocate          Unknown  Unknown
ocean_model        0000000000B1AC35  Unknown               Unknown  Unknown
ocean_model        00000000007C7F9A  Unknown               Unknown  Unknown
ocean_model        00000000007C6731  Unknown               Unknown  Unknown
ocean_model        0000000000807E52  Unknown               Unknown  Unknown
ocean_model        0000000000977E3D  Unknown               Unknown  Unknown
ocean_model        0000000000983424  Unknown               Unknown  Unknown
ocean_model        000000000074BE08  Unknown               Unknown  Unknown
ocean_model        000000000066D7A6  Unknown               Unknown  Unknown
ocean_model        000000000040F8A5  Unknown               Unknown  Unknown
ocean_model        000000000040F83E  Unknown               Unknown  Unknown
ocean_model        000000000040F7EE  Unknown               Unknown  Unknown
libc-2.17.so       00002B566F764C05  __libc_start_main     Unknown  Unknown
ocean_model        000000000040F6E9  Unknown               Unknown  Unknown

errors when using the default SOMA test cases across resolutions (4, 8, 16, 32) targeting between 200-300 cells/processor.

This is on grizzly using make ifort CORE=ocean AUTOCLEAN=true DEBUG=false with mpirun.

Currently Loaded Modules:
  1) python/anaconda-2.7-climate   3) openmpi/1.10.5   5) parallel-netcdf/1.5.0
  2) intel/17.0.1                  4) netcdf/4.4.1     6) pio/1.7.2

Linux gr-fe2.lanl.gov 3.10.0-693.11.6.1chaos.ch6.x86_64 #1 SMP Wed Jan 3 18:19:50 PST 2018 x86_64 x86_64 x86_64 GNU/Linux

The issue appears to be related to the commit

commit 1cd59184880c14dfc9b80292453d37f66fa792d6
Merge: 6b15cda9 ab9ae329
Author: Mark Petersen <mpetersen@lanl.gov>
Date:   Thu Jan 18 06:47:08 2018

    Merge PR #1458 'bill/exchange_reuse_ocean_core' into ocean/develop

    Primary changes here are the use of mpas_dmpar reuse calls during the
    barotropic subcycle. This prevents the subcycle from creating and destroying
    the same data structure during every subcycle iteration.

    There is also a lesser change where I tracked the subcycle data dependencies by
    hand, and modified to remove some "not sure why but an extra exchange here
    works" code. I originally did this to enable using larger halos which would not
    need to be exchanged during every subcycle, but the gains were not there so I
    moved on.

Testing again (in consultation with @mark-petersen) using hash:

*   6b15cda97 (remove_restart_variables) Merge 'develop' into ocean/develop (reusable halo exchanges)

has not yet yielded any errors and appears to be working, suggesting a possible bug (as identified by @mark-petersen) in 1cd59184880c14dfc9b80292453d37f66fa792d6.

pwolfram commented 6 years ago

@mark-petersen, did this get fixed? This may be related to the memory leak I'm observing locally.

pwolfram commented 6 years ago

I'm getting a similar issue with the Delaware wetting / drying test case:

App launch reported: 1 (out of 1) daemons - 36 (out of 36) procs
Insufficient memory to allocate Fortran RTL message buffer, message #41 = hex 00000029.
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[61678,1],1]
  Exit code:    41
--------------------------------------------------------------------------
mark-petersen commented 6 years ago

We fixed a memory leak just after that. Of course, I don't know if it is related to the problem you are seeing now. The fix PRs are https://github.com/MPAS-Dev/MPAS/pull/1501 https://github.com/MPAS-Dev/MPAS/pull/1502 https://github.com/MPAS-Dev/MPAS/pull/1515 But they are not worth reading in any detail.

pwolfram commented 6 years ago

Thanks @mark-petersen, this is great. There may be another issue that has cropped up but the key issue highlighted here was resolved so I'm going to close this issue.