Open FanghuaWu opened 6 years ago
Hi @FanghuaWu does this work with the default ocean_hgrid.nc and ocean_vgrid.nc referenced by Aidan's script?
P.S. I have run Aidan's script with 16 cpus and 32 Gb of memory and it seems to be working. I will try to reproduce using your input files.
Hi @nicjhan If you mean 01 version with using Aidan's script and the default ocean_hgrid.nc and ocean_vgrid.nc, it doesn't work for me. I also tested with 16 CPUs and 32Gb of memory and got the same error.
OK, we need to figure what what's different between your setup and mine.
Can you please post more of your output. For example the contents of these files (I don't have permission to read them):
-rw------- 1 fw4078 x77 15051 Oct 6 15:52 make_ic.e9448292 -rw------- 1 fw4078 x77 1044 Oct 6 15:52 make_ic.o9448292
I have opened the read permission for those two files and also posted here.
BTW, I run with 16 CPUs. However, there are only 8 log files. It should be 16 log files, right?
make_ic.o9448292
input.nc: OK
ocean_hgrid.nc: OK
ocean_vgrid.nc: OK
/apps/esmf/6.3.0rp1-intel/bin/binO/Linux.intel.64.openmpi.default/ESMF_RegridWeightGen
/apps/openmpi/wrapper/mpirun
global_src_grid_scrip /jobfs/local/9448292.r-man2/tmpcw_upx53.nc
dest_grid_scrip /jobfs/local/9448292.r-man2/tmp2mm9ia1a.nc
======================================================================================
Resource Usage on 2017-10-06 15:52:05:
Job Id: 9448292.r-man2
Project: x77
Exit Status: 1
Service Units: 10.39
NCPUs Requested: 16 NCPUs Used: 16
CPU Time Used: 00:29:17
Memory Requested: 32.0GB Memory Used: 17.94GB
Walltime requested: 01:00:00 Walltime Used: 00:12:59
JobFS requested: 400.0GB JobFS used: 3.27GB
======================================================================================
make_ic.e9448292
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libintlc.so.5 00007FE2FAA06961 Unknown Unknown Unknown
libintlc.so.5 00007FE2FAA050B7 Unknown Unknown Unknown
libifcoremt.so.5 00007FE2F8D44942 Unknown Unknown Unknown
libifcoremt.so.5 00007FE2F8D44796 Unknown Unknown Unknown
libifcoremt.so.5 00007FE2F8CAE2A6 Unknown Unknown Unknown
libifcoremt.so.5 00007FE2F8CBF533 Unknown Unknown Unknown
libpthread.so.0 00007FE2FA7EC7E0 Unknown Unknown Unknown
mca_btl_vader.so 00007FE2ECF39DD7 Unknown Unknown Unknown
mca_btl_vader.so 00007FE2ECF3B6B8 Unknown Unknown Unknown
libopen-pal.so.6 00007FE2F80474FC Unknown Unknown Unknown
libmpi.so.1 00007FE2F924230C Unknown Unknown Unknown
libmpi.so.1 00007FE2F924283B Unknown Unknown Unknown
mca_coll_tuned.so 00007FE2EAB30046 Unknown Unknown Unknown
mca_coll_tuned.so 00007FE2EAB30512 Unknown Unknown Unknown
mca_coll_tuned.so 00007FE2EAB237E9 Unknown Unknown Unknown
libmpi.so.1 00007FE2F925C3F0 Unknown Unknown Unknown
libesmf.so 00007FE2FDCCD53D c_esmc_vmbarrier_ 309 ESMCI_VM_F.C
libesmf.so 00007FE2FE271A4F esmf_vmmod_mp_esm 2790 ESMF_VM.F90
libesmf.so 00007FE2FE0E7EE3 esmf_ioscripmod_m 2259 ESMF_IOScrip.F90
libesmf.so 00007FE2FE20F5F4 esmf_regridweight 1398 ESMF_RegridWeightGen.F90
ESMF_RegridWeight 000000000040551C MAIN__ 728 ESMF_RegridWeightGen.F90
ESMF_RegridWeight 00000000004021FE Unknown Unknown Unknown
libc.so.6 00007FE2F85C3D1D Unknown Unknown Unknown
ESMF_RegridWeight 0000000000402109 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libintlc.so.5 00007FAEAB7D0961 Unknown Unknown Unknown
libintlc.so.5 00007FAEAB7CF0B7 Unknown Unknown Unknown
libifcoremt.so.5 00007FAEA9B0E942 Unknown Unknown Unknown
libifcoremt.so.5 00007FAEA9B0E796 Unknown Unknown Unknown
libifcoremt.so.5 00007FAEA9A782A6 Unknown Unknown Unknown
libifcoremt.so.5 00007FAEA9A89533 Unknown Unknown Unknown
libpthread.so.0 00007FAEAB5B67E0 Unknown Unknown Unknown
mca_btl_vader.so 00007FAE9DD056B8 Unknown Unknown Unknown
libopen-pal.so.6 00007FAEA8E114FC Unknown Unknown Unknown
libmpi.so.1 00007FAEAA00C30C Unknown Unknown Unknown
libmpi.so.1 00007FAEAA00C83B Unknown Unknown Unknown
mca_coll_tuned.so 00007FAE9B8FA046 Unknown Unknown Unknown
mca_coll_tuned.so 00007FAE9B8FA512 Unknown Unknown Unknown
mca_coll_tuned.so 00007FAE9B8ED7E9 Unknown Unknown Unknown
libmpi.so.1 00007FAEAA0263F0 Unknown Unknown Unknown
libesmf.so 00007FAEAEA9753D c_esmc_vmbarrier_ 309 ESMCI_VM_F.C
libesmf.so 00007FAEAF03BA4F esmf_vmmod_mp_esm 2790 ESMF_VM.F90
libesmf.so 00007FAEAEEB1EE3 esmf_ioscripmod_m 2259 ESMF_IOScrip.F90
libesmf.so 00007FAEAEFD95F4 esmf_regridweight 1398 ESMF_RegridWeightGen.F90
ESMF_RegridWeight 000000000040551C MAIN__ 728 ESMF_RegridWeightGen.F90
ESMF_RegridWeight 00000000004021FE Unknown Unknown Unknown
libc.so.6 00007FAEA938DD1D Unknown Unknown Unknown
ESMF_RegridWeight 0000000000402109 Unknown Unknown Unknown
.
.
.
Error: ESMF_RegridWeightGen failed return code 139
b' Starting weight generation with these inputs: \n Source File: /jobfs/local/9448292.r-man2/tmpcw_upx53.nc\n Destination File: /jobfs/local/9448292.r-man2/tmp2mm9ia1a.nc\n Weight File: /jobfs/local/9448292.r-man2/tmp9s5keb8o.nc\n Source File is in SCRIP format\n Source Grid is a global grid\n Source Grid is a logically rectangular grid\n Destination File is in SCRIP format\n Destination Grid is a global grid\n Destination Grid is a logically rectangular grid\n Regrid Method: bilinear\n Pole option: ALL\n Norm Type: dstarea\n \n--------------------------------------------------------------------------\nmpirun noticed that process rank 0 with PID 7088 on node r302 exited on signal 11 (Segmentation fault).\n--------------------------------------------------------------------------\n'
Contents of PET0.RegridWeightGen.Log:
20170926 135914.240 INFO PET0 Running with ESMF Version 6.3.0rp1
20171001 130203.828 INFO PET0 Running with ESMF Version 6.3.0rp1
20171001 163829.691 INFO PET0 Running with ESMF Version 6.3.0rp1
20171001 165020.945 INFO PET0 Running with ESMF Version 6.3.0rp1
20171006 154545.381 INFO PET0 Running with ESMF Version 6.3.0rp1
Hi @FanghuaWu can you please make sure you have something like this in your ~/.bashrc . Notice the ulimit -s unlimited bit.
if [ -f /etc/bashrc ]; then . /etc/bashrc fi
ulimit -s unlimited
I think that might be your problem. If this fixes things we need to document that need somewhere.
Yes, @nicjhan, after setting stack size unlimited, the problem was solved. Thank you very much.
@aidanheerdegen do you have any ideas about how we can avoid this kind of thing in the future? How many hours of pain has this one line caused our group over time!?
Hi Nic,
I am working on the interpolate WOA obs data onto MOM grid (0.1, 0.25, 1.0). The ocean-ic code (from Aidan’s interpolation scripts: https://github.com/aidanheerdegen/initial_conditions_WOA) works well for 0.25 and 1.0 degree.
However, for tenth-degree version, I got the following error message:
My work directory is :
/short/x77/fw4078/test/WOA2013/initial_conditions_WOA/
. The difference between the three grid versions is only the ocean_hgrid.nc and ocean_vgrid.nc. I guess the configutre file for tenth-degree version should be changed. Would you please give me some suggestions? Thanks a lot.