COSIMA / ocean-ic

Create MOM or NEMO initial conditions from GODAS or ORAS4. Uses ESMF for regridding.
2 stars 2 forks source link

Error: ESMF_RegridWeightGen failed return code 139 #3

Open FanghuaWu opened 6 years ago

FanghuaWu commented 6 years ago

Hi Nic,

I am working on the interpolate WOA obs data onto MOM grid (0.1, 0.25, 1.0). The ocean-ic code (from Aidan’s interpolation scripts: https://github.com/aidanheerdegen/initial_conditions_WOA) works well for 0.25 and 1.0 degree.

However, for tenth-degree version, I got the following error message:

forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
libintlc.so.5      00007F3B59524961  Unknown               Unknown  Unknown
libintlc.so.5      00007F3B595230B7  Unknown               Unknown  Unknown
libifcoremt.so.5   00007F3B57862942  Unknown               Unknown  Unknown
libifcoremt.so.5   00007F3B57862796  Unknown               Unknown  Unknown
libifcoremt.so.5   00007F3B577CC2A6  Unknown               Unknown  Unknown
libifcoremt.so.5   00007F3B577DD533  Unknown               Unknown  Unknown
libpthread.so.0    00007F3B5930A7E0  Unknown               Unknown  Unknown
libopen-pal.so.6   00007F3B56B654E5  Unknown               Unknown  Unknown
libmpi.so.1        00007F3B57D6030C  Unknown               Unknown  Unknown
libmpi.so.1        00007F3B57D6083B  Unknown               Unknown  Unknown
mca_coll_tuned.so  00007F3B4964E046  Unknown               Unknown  Unknown
mca_coll_tuned.so  00007F3B4964E512  Unknown               Unknown  Unknown
mca_coll_tuned.so  00007F3B496417E9  Unknown               Unknown  Unknown
libmpi.so.1        00007F3B57D7A3F0  Unknown               Unknown  Unknown
libesmf.so         00007F3B5C7EB53D  c_esmc_vmbarrier_         309  ESMCI_VM_F.C
libesmf.so         00007F3B5CD8FA4F  esmf_vmmod_mp_esm        2790  ESMF_VM.F90
libesmf.so         00007F3B5CC05EE3  esmf_ioscripmod_m        2259  ESMF_IOScrip.F90
libesmf.so         00007F3B5CD2D5F4  esmf_regridweight        1398  ESMF_RegridWeightGen.F90
ESMF_RegridWeight  000000000040551C  MAIN__                    728  ESMF_RegridWeightGen.F90
ESMF_RegridWeight  00000000004021FE  Unknown               Unknown  Unknown
libc.so.6          00007F3B570E1D1D  Unknown               Unknown  Unknown
ESMF_RegridWeight  0000000000402109  Unknown               Unknown  Unknown
Error: ESMF_RegridWeightGen failed return code 139
b' Starting weight generation with these inputs: \n   
Source File: /jobfs/local/9103526.r-man2/tmpv6ri03vt.nc\n   
Destination File: /jobfs/local/9103526.r-man2/tmpg6bbr25t.nc\n   
Weight File: /jobfs/local/9103526.r-man2/tmpolfwg0fn.nc\n   
Source File is in SCRIP format\n   
Source Grid is a global grid\n   
Source Grid is a logically rectangular grid\n   
Destination File is in SCRIP format\n   
Destination Grid is a global grid\n   
Destination Grid is a logically rectangular grid\n   
Regrid Method: bilinear\n   
Pole option: ALL\n   
Norm Type: dstarea\n \n--------------------------------------------------------------------------\n
mpirun noticed that process rank 0 with PID 25242 on node r2652 exited on signal 11 (Segmentation fault).\n
--------------------------------------------------------------------------\n'
Contents of PET0.RegridWeightGen.Log:
20170926 135914.240 INFO             PET0 Running with ESMF Version 6.3.0rp1

My work directory is : /short/x77/fw4078/test/WOA2013/initial_conditions_WOA/. The difference between the three grid versions is only the ocean_hgrid.nc and ocean_vgrid.nc. I guess the configutre file for tenth-degree version should be changed. Would you please give me some suggestions? Thanks a lot.

nichannah commented 6 years ago

Hi @FanghuaWu does this work with the default ocean_hgrid.nc and ocean_vgrid.nc referenced by Aidan's script?

P.S. I have run Aidan's script with 16 cpus and 32 Gb of memory and it seems to be working. I will try to reproduce using your input files.

FanghuaWu commented 6 years ago

Hi @nicjhan If you mean 01 version with using Aidan's script and the default ocean_hgrid.nc and ocean_vgrid.nc, it doesn't work for me. I also tested with 16 CPUs and 32Gb of memory and got the same error.

nichannah commented 6 years ago

OK, we need to figure what what's different between your setup and mine.

Can you please post more of your output. For example the contents of these files (I don't have permission to read them):

-rw------- 1 fw4078 x77 15051 Oct 6 15:52 make_ic.e9448292 -rw------- 1 fw4078 x77 1044 Oct 6 15:52 make_ic.o9448292

FanghuaWu commented 6 years ago

I have opened the read permission for those two files and also posted here.

BTW, I run with 16 CPUs. However, there are only 8 log files. It should be 16 log files, right?

make_ic.o9448292

input.nc: OK
ocean_hgrid.nc: OK
ocean_vgrid.nc: OK
/apps/esmf/6.3.0rp1-intel/bin/binO/Linux.intel.64.openmpi.default/ESMF_RegridWeightGen
/apps/openmpi/wrapper/mpirun
global_src_grid_scrip /jobfs/local/9448292.r-man2/tmpcw_upx53.nc
dest_grid_scrip /jobfs/local/9448292.r-man2/tmp2mm9ia1a.nc

======================================================================================
                  Resource Usage on 2017-10-06 15:52:05:
   Job Id:             9448292.r-man2
   Project:            x77
   Exit Status:        1
   Service Units:      10.39
   NCPUs Requested:    16                     NCPUs Used: 16
                                           CPU Time Used: 00:29:17
   Memory Requested:   32.0GB                Memory Used: 17.94GB
   Walltime requested: 01:00:00            Walltime Used: 00:12:59
   JobFS requested:    400.0GB                JobFS used: 3.27GB
======================================================================================

make_ic.e9448292

forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
libintlc.so.5      00007FE2FAA06961  Unknown               Unknown  Unknown
libintlc.so.5      00007FE2FAA050B7  Unknown               Unknown  Unknown
libifcoremt.so.5   00007FE2F8D44942  Unknown               Unknown  Unknown
libifcoremt.so.5   00007FE2F8D44796  Unknown               Unknown  Unknown
libifcoremt.so.5   00007FE2F8CAE2A6  Unknown               Unknown  Unknown
libifcoremt.so.5   00007FE2F8CBF533  Unknown               Unknown  Unknown
libpthread.so.0    00007FE2FA7EC7E0  Unknown               Unknown  Unknown
mca_btl_vader.so   00007FE2ECF39DD7  Unknown               Unknown  Unknown
mca_btl_vader.so   00007FE2ECF3B6B8  Unknown               Unknown  Unknown
libopen-pal.so.6   00007FE2F80474FC  Unknown               Unknown  Unknown
libmpi.so.1        00007FE2F924230C  Unknown               Unknown  Unknown
libmpi.so.1        00007FE2F924283B  Unknown               Unknown  Unknown
mca_coll_tuned.so  00007FE2EAB30046  Unknown               Unknown  Unknown
mca_coll_tuned.so  00007FE2EAB30512  Unknown               Unknown  Unknown
mca_coll_tuned.so  00007FE2EAB237E9  Unknown               Unknown  Unknown
libmpi.so.1        00007FE2F925C3F0  Unknown               Unknown  Unknown
libesmf.so         00007FE2FDCCD53D  c_esmc_vmbarrier_         309  ESMCI_VM_F.C
libesmf.so         00007FE2FE271A4F  esmf_vmmod_mp_esm        2790  ESMF_VM.F90
libesmf.so         00007FE2FE0E7EE3  esmf_ioscripmod_m        2259  ESMF_IOScrip.F90
libesmf.so         00007FE2FE20F5F4  esmf_regridweight        1398  ESMF_RegridWeightGen.F90
ESMF_RegridWeight  000000000040551C  MAIN__                    728  ESMF_RegridWeightGen.F90
ESMF_RegridWeight  00000000004021FE  Unknown               Unknown  Unknown
libc.so.6          00007FE2F85C3D1D  Unknown               Unknown  Unknown
ESMF_RegridWeight  0000000000402109  Unknown               Unknown  Unknown
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source
libintlc.so.5      00007FAEAB7D0961  Unknown               Unknown  Unknown
libintlc.so.5      00007FAEAB7CF0B7  Unknown               Unknown  Unknown
libifcoremt.so.5   00007FAEA9B0E942  Unknown               Unknown  Unknown
libifcoremt.so.5   00007FAEA9B0E796  Unknown               Unknown  Unknown
libifcoremt.so.5   00007FAEA9A782A6  Unknown               Unknown  Unknown
libifcoremt.so.5   00007FAEA9A89533  Unknown               Unknown  Unknown
libpthread.so.0    00007FAEAB5B67E0  Unknown               Unknown  Unknown
mca_btl_vader.so   00007FAE9DD056B8  Unknown               Unknown  Unknown
libopen-pal.so.6   00007FAEA8E114FC  Unknown               Unknown  Unknown
libmpi.so.1        00007FAEAA00C30C  Unknown               Unknown  Unknown
libmpi.so.1        00007FAEAA00C83B  Unknown               Unknown  Unknown
mca_coll_tuned.so  00007FAE9B8FA046  Unknown               Unknown  Unknown
mca_coll_tuned.so  00007FAE9B8FA512  Unknown               Unknown  Unknown
mca_coll_tuned.so  00007FAE9B8ED7E9  Unknown               Unknown  Unknown
libmpi.so.1        00007FAEAA0263F0  Unknown               Unknown  Unknown
libesmf.so         00007FAEAEA9753D  c_esmc_vmbarrier_         309  ESMCI_VM_F.C
libesmf.so         00007FAEAF03BA4F  esmf_vmmod_mp_esm        2790  ESMF_VM.F90
libesmf.so         00007FAEAEEB1EE3  esmf_ioscripmod_m        2259  ESMF_IOScrip.F90
libesmf.so         00007FAEAEFD95F4  esmf_regridweight        1398  ESMF_RegridWeightGen.F90
ESMF_RegridWeight  000000000040551C  MAIN__                    728  ESMF_RegridWeightGen.F90
ESMF_RegridWeight  00000000004021FE  Unknown               Unknown  Unknown
libc.so.6          00007FAEA938DD1D  Unknown               Unknown  Unknown
ESMF_RegridWeight  0000000000402109  Unknown               Unknown  Unknown
.
.
.
Error: ESMF_RegridWeightGen failed return code 139
b' Starting weight generation with these inputs: \n   Source File: /jobfs/local/9448292.r-man2/tmpcw_upx53.nc\n   Destination File: /jobfs/local/9448292.r-man2/tmp2mm9ia1a.nc\n   Weight File: /jobfs/local/9448292.r-man2/tmp9s5keb8o.nc\n   Source File is in SCRIP format\n   Source Grid is a global grid\n   Source Grid is a logically rectangular grid\n   Destination File is in SCRIP format\n   Destination Grid is a global grid\n   Destination Grid is a logically rectangular grid\n   Regrid Method: bilinear\n   Pole option: ALL\n   Norm Type: dstarea\n \n--------------------------------------------------------------------------\nmpirun noticed that process rank 0 with PID 7088 on node r302 exited on signal 11 (Segmentation fault).\n--------------------------------------------------------------------------\n'
Contents of PET0.RegridWeightGen.Log:
20170926 135914.240 INFO             PET0 Running with ESMF Version 6.3.0rp1
20171001 130203.828 INFO             PET0 Running with ESMF Version 6.3.0rp1
20171001 163829.691 INFO             PET0 Running with ESMF Version 6.3.0rp1
20171001 165020.945 INFO             PET0 Running with ESMF Version 6.3.0rp1
20171006 154545.381 INFO             PET0 Running with ESMF Version 6.3.0rp1
nichannah commented 6 years ago

Hi @FanghuaWu can you please make sure you have something like this in your ~/.bashrc . Notice the ulimit -s unlimited bit.

if [ -f /etc/bashrc ]; then . /etc/bashrc fi

ulimit -s unlimited

I think that might be your problem. If this fixes things we need to document that need somewhere.

FanghuaWu commented 6 years ago

Yes, @nicjhan, after setting stack size unlimited, the problem was solved. Thank you very much.

nichannah commented 6 years ago

@aidanheerdegen do you have any ideas about how we can avoid this kind of thing in the future? How many hours of pain has this one line caused our group over time!?