Closed debbardet closed 5 years ago
According to @ehouarn that could be a memory problem since you are adding vertical levels. Could you try with normal 32-level starts and tell me if it works?
I discussed with @ehouarn and, even if he took a lot of memory, the model crash at the same step. He thinks it can be a problem in the physics part, which can't use 64 levels.
I try to run with normal 32-level starts, and it doesn't work, but not for the same reason:
0479: USING DEFAULTS : xios_output = T 0469: USING DEFAULTS : enable_io = T 0469: USING DEFAULTS : xios_output = T srun: error: n1277: task 0: Exited with exit code 174 srun: Terminating job step 5208584.0 0000: slurmstepd: STEP 5208584.0 ON n1277 CANCELLED AT 2018-09-28T11:34:56 srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
Maybe, I forgot something
It is strange, the model should work with 32-level start, really. The model is not sending any error message ?
No, in the output file (icosa_lmdz_270.out) there are only the lines of my previous message. I don't know what is an error n1277.
Did you raise info level ? By the way, I made a commit to make the model verbose by default
Anyhow, this is strange, the model should work with 32-level starts. Maybe we introduced an error recently in the XML and DEF files, you can try with previous commits
The obvious next step is to try in debug mode
OK I tried with reference settings file
git checkout 2722cedec094f9d019e595b36fd4d8f64f3c9673
And trying to run the model in makestart (better because only request two nodes) does not work
The error is
00: forrtl: severe (174): SIGSEGV, segmentation fault occurred
00: Image PC Routine Line Source
00: icosa_lmdz.exe 00000000013A39E1 Unknown Unknown Unknown
00: icosa_lmdz.exe 00000000013A1B1B Unknown Unknown Unknown
00: icosa_lmdz.exe 00000000013439D4 Unknown Unknown Unknown
00: icosa_lmdz.exe 00000000013437E6 Unknown Unknown Unknown
00: icosa_lmdz.exe 00000000012E9B57 Unknown Unknown Unknown
00: icosa_lmdz.exe 00000000012F0780 Unknown Unknown Unknown
00: libpthread-2.17.s 00002AB7B1C4E5E0 Unknown Unknown Unknown
00: icosa_lmdz.exe 00000000004C1B33 write_field_mod_m 1318 write_field.f90
00: icosa_lmdz.exe 00000000004BC576 write_field_mod_m 393 write_field.f90
00: icosa_lmdz.exe 00000000004B7CA3 write_field_mod_m 105 write_field.f90
00: icosa_lmdz.exe 00000000004234AA icosa_init_mod_mp 48 icosa_init.f90
00: libiomp5.s
00: o 00002AB7B1F00413 __kmp_invoke_micr Unknown Unknown
00: libiomp5.so 00002AB7B1ED060D __kmp_fork_call Unknown Unknown
00: libiomp5.so 00002AB7B1EA8EE8 __kmpc_fork_call Unknown Unknown
00: icosa_lmdz.exe 0000000000422CAE icosa_init_mod_mp 39 icosa_init.f90
00: icosa_lmdz.exe 000000000041A5F7 MAIN__ 4 icosa_lmdz.f90
00: icosa_lmdz.exe 000000000041A59E Unknown Unknown Unknown
00: libc-2.17.so 00002AB7B21C3C05 __libc_start_main Unknown Unknown
00: icosa_lmdz.exe 000000000041A4A9 Unknown Unknown Unknown
Thus I think it is an error introduced recently in the HEAD version of one of the component of the code
And on my side, the winner is:
000: forrtl: severe (408): fort: (3): Subscript #2 of the array F2D_ARR has value -858993460 which is less than the lower bound of 1
000:
000: Image PC Routine Line Source
000: icosa_lmdz.exe 0000000002B949C6 Unknown Unknown Unknown
000: icosa_lmdz.exe 00000000014F0913 bilinearbig_ 95 bilinearbig.f90
000: icosa_lmdz.exe 00000000014C021C interpolateh2h2_ 125 interpolateH2H2.f90
000: icosa_lmdz.exe 00000000012B378E optcv_ 200 optcv.f90
000: icosa_lmdz.exe 000000000109752E callcorrk_ 807 callcorrk.f90
000: icosa_lmdz.exe 0000000000DB980E physiq_mod_mp_phy 825 physiq_mod.f90
Ha! You got this with or without debug mode? Looks like an error introduced in LMDZ.GENERIC
I don't get what is going on. bilinearbig
has not changed recently, or the few recent changes have nothing to do with f2d_arr
And bilinearbig
is called only in routines where arguments named nX
and nY
in bilinearbig
are actually hardcoded. So there is no reason for this error to occur. I am clueless.
Try just running the 1D model (with 64 layers) in debug mode; in my case: forrtl: severe (408): fort: (2): Subscript #1 of the array TAUCUMV has value 130 which is greater than the upper bound of 129
Image PC Routine Line Source
libifcoremt.so.5 00002B5597C9E0E9 for_emit_diagnost Unknown Unknown
rcm1d_64_phystds 0000000000A09AED optcv 375 optcv.f90
rcm1d_64_phystds 00000000007F4A3E callcorrk 807 callcorrk.f90
rcm1d_64_phystd_s 000000000056BE53 physiq_mod_mp_phy 812 physiq_mod.f90
rcm1d_64_phystd_s 000000000041D5B2 MAIN__ 2739 rcm1d.f
I tried with coming back to LMDZ.GENERIC version 1984 and still did not work
This is really strange, I tried to run in makestart (from a profile) using a model compiled with version
##############
### CONFIG ###
ver_dyn=687 # ICOSAGCM
ver_phys=1911 # ARCH ICOSA_LMDZ LMDZ.COMMON LMDZ.GENERIC
ver_xios=1459 # XIOS
ver_ioipsl=302 # IOIPSL
##############
and it still did not work....?
I came back to LMDZ.GENERIC version 1984 and Saturn1D is running (since 13'). I will try in 3D with the start files that I modified
Despite I came back to version 1984, I keep the same error than previously with the 3D model:
0609: USING DEFAULTS : enable_io = T 0609: USING DEFAULTS : xios_output = T srun: error: n1072: task 0: Exited with exit code 174 srun: Terminating job step 5209112.0 0000: slurmstepd: STEP 5209112.0 ON n1072 CANCELLED AT 2018-09-28T15:06:26 srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: error: n1683: tasks 1176-1198: Killed
OK so this means that it is coming from DYNAMICO, or the interface
Yes with version 1984 the problem is still
000: forrtl: severe (408): fort: (3): Subscript #2 of the array F2D_ARR has value -858993460 which is less than the lower bound of 1
000:
000: Image PC Routine Line Source
000: icosa_lmdz.exe 0000000002B95E66 Unknown Unknown Unknown
000: icosalmdz.exe 00000000014F1DBB bilinearbig 95 bilinearbig.f90
000: icosalmdz.exe 00000000014C16C4 interpolateh2h2 125 interpolateH2H2.f90
000: icosalmdz.exe 00000000012B419A optcv 193 optcv.f90
000: icosalmdz.exe 0000000001097C99 callcorrk 812 callcorrk.f90
000: icosa_lmdz.exe 0000000000DB955E physiq_mod_mp_phy 825 physiq_mod.f90
This is so weird!
I have a segmentation fault error in ma case:
0262: USING DEFAULTS : xios_output = T 0000: forrtl: severe (174): SIGSEGV, segmentation fault occurred 0000: Image PC Routine Line Source 0000: icosa_lmdz.exe 0000000002C253B1 Unknown Unknown Unknown 0000: icosa_lmdz.exe 0000000002C234EB Unknown Unknown Unknown 0000: icosa_lmdz.exe 0000000002BDC174 Unknown Unknown Unknown 0000: icosa_lmdz.exe 0000000002BDBF86 Unknown Unknown Unknown 0000: icosa_lmdz.exe 0000000002B82437 Unknown Unknown Unknown 0000: icosa_lmdz.exe 0000000002B89060 Unknown Unknown Unknown 0000: libpthread-2.17.s 00002B7856C025E0 Unknown Unknown Unknown 0000: icosa_lmdz.exe 000000000064A91A write_field_mod_m 1318 write_field.f90 0000: icosa_lmdz.exe 000000000061A927 write_field_mod_m 393 write_field.f90 0000: icosa_lmdz.exe 000000000061A08D write_field_mod_m 105 write_field.f90 0000: icosa_lmdz.exe 000000000042453B icosa_init_mod_mp 48 icosa_init.f90 0000: libiomp5.s 0000: o 00002B7856EB4413 kmp_invoke_micr Unknown Unknown 0000: libiomp5.so 00002B7856E8460D kmp_fork_call Unknown Unknown 0000: libiomp5.so 00002B7856E5CEE8 __kmpc_fork_call Unknown Unknown 0000: icosa_lmdz.exe 00000000004243F6 icosa_init_mod_mp 39 icosa_init.f90 0000: icosa_lmdz.exe 000000000041BCE0 Unknown Unknown Unknown 0000: icosa_lmdz.exe 000000000041BC9E Unknown Unknown Unknown 0000: libc-2.17.so 00002B7857177C05 __libc_start_main Unknown Unknown 0000: icosa_lmdz.exe 000000000041BBA9 Unknown Unknown Unknown srun: error: n1018: task 0: Exited with exit code 174 srun: Terminating job step 5209290.0 0000: slurmstepd: STEP 5209290.0 ON n1018 CANCELLED AT 2018-09-28T15:38:51 srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
OK I have something interesting. @scabanes just told me that his model no longer runs and stops rapidly as it is the case for everyone -- but he did not recompile the model. So and executable that used to work, no longer works. So this must be a machine-related problem.
After some iterations with CINES IT people, a suggestion from them: Can you try adding "ulimit -s unlimited" after the "source ../code/ARCH/arch-X64_OCCIGEN.env" line in your job and check if it solves the problem?
OK I tried, this solved the problem apparently I have no more seg fault! Then I am having the problem #5 filed by @alboiss so meet you there to continue the debugging process.
Something must have changed by the way on the way the cluster works, because we used to have ulimit -s unlimited in our env file and we never ran into problems before
Anyhow this is good news
I have an issue (one more time...). The model is stopped after ~3 minutes of calculations, without error message. In icosa_lmdz_270.out, I can see it's stopped at the moment where it check the value of "q" to use aerosols: 0200: naerkind= back2lay 1 0200: Warning: no variance range in aeroptproperties 0200: Tracers found in aeropacity: 0200: If you would like to use aerosols, make sure any old 0200: start files are updated in newstart using the option 0200: q=0 0200: Active aerosols found in aeropacity: 0200: iaero_back2lay= 1
I checked the value of "q" in my start file (start_icosa_270.nc) and all the values of q are zero.