aymeric-spiga / dynamico-giant

0 stars 2 forks source link

Stopped model after 3 minutes of calculations #4

Closed debbardet closed 5 years ago

debbardet commented 5 years ago

I have an issue (one more time...). The model is stopped after ~3 minutes of calculations, without error message. In icosa_lmdz_270.out, I can see it's stopped at the moment where it check the value of "q" to use aerosols: 0200: naerkind= back2lay 1 0200: Warning: no variance range in aeroptproperties 0200: Tracers found in aeropacity: 0200: If you would like to use aerosols, make sure any old 0200: start files are updated in newstart using the option 0200: q=0 0200: Active aerosols found in aeropacity: 0200: iaero_back2lay= 1

I checked the value of "q" in my start file (start_icosa_270.nc) and all the values of q are zero.

aymeric-spiga commented 5 years ago

According to @ehouarn that could be a memory problem since you are adding vertical levels. Could you try with normal 32-level starts and tell me if it works?

debbardet commented 5 years ago

I discussed with @ehouarn and, even if he took a lot of memory, the model crash at the same step. He thinks it can be a problem in the physics part, which can't use 64 levels.

I try to run with normal 32-level starts, and it doesn't work, but not for the same reason:

0479: USING DEFAULTS : xios_output = T 0469: USING DEFAULTS : enable_io = T 0469: USING DEFAULTS : xios_output = T srun: error: n1277: task 0: Exited with exit code 174 srun: Terminating job step 5208584.0 0000: slurmstepd: STEP 5208584.0 ON n1277 CANCELLED AT 2018-09-28T11:34:56 srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

Maybe, I forgot something

aymeric-spiga commented 5 years ago

It is strange, the model should work with 32-level start, really. The model is not sending any error message ?

debbardet commented 5 years ago

No, in the output file (icosa_lmdz_270.out) there are only the lines of my previous message. I don't know what is an error n1277.

aymeric-spiga commented 5 years ago

Did you raise info level ? By the way, I made a commit to make the model verbose by default

aymeric-spiga commented 5 years ago

Anyhow, this is strange, the model should work with 32-level starts. Maybe we introduced an error recently in the XML and DEF files, you can try with previous commits

ehouarn commented 5 years ago

The obvious next step is to try in debug mode

aymeric-spiga commented 5 years ago

OK I tried with reference settings file git checkout 2722cedec094f9d019e595b36fd4d8f64f3c9673

And trying to run the model in makestart (better because only request two nodes) does not work

The error is

00: forrtl: severe (174): SIGSEGV, segmentation fault occurred
00: Image              PC                Routine            Line        Source
00: icosa_lmdz.exe     00000000013A39E1  Unknown               Unknown  Unknown
00: icosa_lmdz.exe     00000000013A1B1B  Unknown               Unknown  Unknown
00: icosa_lmdz.exe     00000000013439D4  Unknown               Unknown  Unknown
00: icosa_lmdz.exe     00000000013437E6  Unknown               Unknown  Unknown
00: icosa_lmdz.exe     00000000012E9B57  Unknown               Unknown  Unknown
00: icosa_lmdz.exe     00000000012F0780  Unknown               Unknown  Unknown
00: libpthread-2.17.s  00002AB7B1C4E5E0  Unknown               Unknown  Unknown
00: icosa_lmdz.exe     00000000004C1B33  write_field_mod_m        1318  write_field.f90
00: icosa_lmdz.exe     00000000004BC576  write_field_mod_m         393  write_field.f90
00: icosa_lmdz.exe     00000000004B7CA3  write_field_mod_m         105  write_field.f90
00: icosa_lmdz.exe     00000000004234AA  icosa_init_mod_mp          48  icosa_init.f90
00: libiomp5.s
00: o        00002AB7B1F00413  __kmp_invoke_micr     Unknown  Unknown
00: libiomp5.so        00002AB7B1ED060D  __kmp_fork_call       Unknown  Unknown
00: libiomp5.so        00002AB7B1EA8EE8  __kmpc_fork_call      Unknown  Unknown
00: icosa_lmdz.exe     0000000000422CAE  icosa_init_mod_mp          39  icosa_init.f90
00: icosa_lmdz.exe     000000000041A5F7  MAIN__                      4  icosa_lmdz.f90
00: icosa_lmdz.exe     000000000041A59E  Unknown               Unknown  Unknown
00: libc-2.17.so       00002AB7B21C3C05  __libc_start_main     Unknown  Unknown
00: icosa_lmdz.exe     000000000041A4A9  Unknown               Unknown  Unknown

Thus I think it is an error introduced recently in the HEAD version of one of the component of the code

ehouarn commented 5 years ago

And on my side, the winner is:

000: forrtl: severe (408): fort: (3): Subscript #2 of the array F2D_ARR has value -858993460 which is less than the lower bound of 1
000: 
000: Image              PC                Routine            Line        Source             
000: icosa_lmdz.exe     0000000002B949C6  Unknown               Unknown  Unknown
000: icosa_lmdz.exe     00000000014F0913  bilinearbig_               95  bilinearbig.f90
000: icosa_lmdz.exe     00000000014C021C  interpolateh2h2_          125  interpolateH2H2.f90
000: icosa_lmdz.exe     00000000012B378E  optcv_                    200  optcv.f90
000: icosa_lmdz.exe     000000000109752E  callcorrk_                807  callcorrk.f90
000: icosa_lmdz.exe     0000000000DB980E  physiq_mod_mp_phy         825  physiq_mod.f90
aymeric-spiga commented 5 years ago

Ha! You got this with or without debug mode? Looks like an error introduced in LMDZ.GENERIC

aymeric-spiga commented 5 years ago

I don't get what is going on. bilinearbig has not changed recently, or the few recent changes have nothing to do with f2d_arr

aymeric-spiga commented 5 years ago

And bilinearbig is called only in routines where arguments named nX and nY in bilinearbig are actually hardcoded. So there is no reason for this error to occur. I am clueless.

ehouarn commented 5 years ago

Try just running the 1D model (with 64 layers) in debug mode; in my case: forrtl: severe (408): fort: (2): Subscript #1 of the array TAUCUMV has value 130 which is greater than the upper bound of 129

Image PC Routine Line Source
libifcoremt.so.5 00002B5597C9E0E9 for_emit_diagnost Unknown Unknown rcm1d_64_phystds 0000000000A09AED optcv 375 optcv.f90 rcm1d_64_phystds 00000000007F4A3E callcorrk 807 callcorrk.f90 rcm1d_64_phystd_s 000000000056BE53 physiq_mod_mp_phy 812 physiq_mod.f90 rcm1d_64_phystd_s 000000000041D5B2 MAIN__ 2739 rcm1d.f

aymeric-spiga commented 5 years ago

I tried with coming back to LMDZ.GENERIC version 1984 and still did not work

aymeric-spiga commented 5 years ago

This is really strange, I tried to run in makestart (from a profile) using a model compiled with version

##############
### CONFIG ###
ver_dyn=687 # ICOSAGCM
ver_phys=1911 # ARCH ICOSA_LMDZ LMDZ.COMMON LMDZ.GENERIC
ver_xios=1459 # XIOS
ver_ioipsl=302 # IOIPSL
##############

and it still did not work....?

debbardet commented 5 years ago

I came back to LMDZ.GENERIC version 1984 and Saturn1D is running (since 13'). I will try in 3D with the start files that I modified

debbardet commented 5 years ago

Despite I came back to version 1984, I keep the same error than previously with the 3D model:

0609: USING DEFAULTS : enable_io = T 0609: USING DEFAULTS : xios_output = T srun: error: n1072: task 0: Exited with exit code 174 srun: Terminating job step 5209112.0 0000: slurmstepd: STEP 5209112.0 ON n1072 CANCELLED AT 2018-09-28T15:06:26 srun: Job step aborted: Waiting up to 32 seconds for job step to finish. srun: error: n1683: tasks 1176-1198: Killed

aymeric-spiga commented 5 years ago

OK so this means that it is coming from DYNAMICO, or the interface

ehouarn commented 5 years ago

Yes with version 1984 the problem is still 000: forrtl: severe (408): fort: (3): Subscript #2 of the array F2D_ARR has value -858993460 which is less than the lower bound of 1 000: 000: Image PC Routine Line Source
000: icosa_lmdz.exe 0000000002B95E66 Unknown Unknown Unknown 000: icosalmdz.exe 00000000014F1DBB bilinearbig 95 bilinearbig.f90 000: icosalmdz.exe 00000000014C16C4 interpolateh2h2 125 interpolateH2H2.f90 000: icosalmdz.exe 00000000012B419A optcv 193 optcv.f90 000: icosalmdz.exe 0000000001097C99 callcorrk 812 callcorrk.f90 000: icosa_lmdz.exe 0000000000DB955E physiq_mod_mp_phy 825 physiq_mod.f90

aymeric-spiga commented 5 years ago

This is so weird!

debbardet commented 5 years ago

I have a segmentation fault error in ma case:

0262: USING DEFAULTS : xios_output = T 0000: forrtl: severe (174): SIGSEGV, segmentation fault occurred 0000: Image PC Routine Line Source 0000: icosa_lmdz.exe 0000000002C253B1 Unknown Unknown Unknown 0000: icosa_lmdz.exe 0000000002C234EB Unknown Unknown Unknown 0000: icosa_lmdz.exe 0000000002BDC174 Unknown Unknown Unknown 0000: icosa_lmdz.exe 0000000002BDBF86 Unknown Unknown Unknown 0000: icosa_lmdz.exe 0000000002B82437 Unknown Unknown Unknown 0000: icosa_lmdz.exe 0000000002B89060 Unknown Unknown Unknown 0000: libpthread-2.17.s 00002B7856C025E0 Unknown Unknown Unknown 0000: icosa_lmdz.exe 000000000064A91A write_field_mod_m 1318 write_field.f90 0000: icosa_lmdz.exe 000000000061A927 write_field_mod_m 393 write_field.f90 0000: icosa_lmdz.exe 000000000061A08D write_field_mod_m 105 write_field.f90 0000: icosa_lmdz.exe 000000000042453B icosa_init_mod_mp 48 icosa_init.f90 0000: libiomp5.s 0000: o 00002B7856EB4413 kmp_invoke_micr Unknown Unknown 0000: libiomp5.so 00002B7856E8460D kmp_fork_call Unknown Unknown 0000: libiomp5.so 00002B7856E5CEE8 __kmpc_fork_call Unknown Unknown 0000: icosa_lmdz.exe 00000000004243F6 icosa_init_mod_mp 39 icosa_init.f90 0000: icosa_lmdz.exe 000000000041BCE0 Unknown Unknown Unknown 0000: icosa_lmdz.exe 000000000041BC9E Unknown Unknown Unknown 0000: libc-2.17.so 00002B7857177C05 __libc_start_main Unknown Unknown 0000: icosa_lmdz.exe 000000000041BBA9 Unknown Unknown Unknown srun: error: n1018: task 0: Exited with exit code 174 srun: Terminating job step 5209290.0 0000: slurmstepd: STEP 5209290.0 ON n1018 CANCELLED AT 2018-09-28T15:38:51 srun: Job step aborted: Waiting up to 32 seconds for job step to finish.

aymeric-spiga commented 5 years ago

OK I have something interesting. @scabanes just told me that his model no longer runs and stops rapidly as it is the case for everyone -- but he did not recompile the model. So and executable that used to work, no longer works. So this must be a machine-related problem.

ehouarn commented 5 years ago

After some iterations with CINES IT people, a suggestion from them: Can you try adding "ulimit -s unlimited" after the "source ../code/ARCH/arch-X64_OCCIGEN.env" line in your job and check if it solves the problem?

aymeric-spiga commented 5 years ago

OK I tried, this solved the problem apparently I have no more seg fault! Then I am having the problem #5 filed by @alboiss so meet you there to continue the debugging process.

Something must have changed by the way on the way the cluster works, because we used to have ulimit -s unlimited in our env file and we never ran into problems before

Anyhow this is good news