CICE-Consortium / CICE

Development repository for the CICE sea-ice model
Other
57 stars 131 forks source link

More use of uninitialized arrays / variables (running 'base_suite' with `-init=snan,arrays`) #599

Open phil-blain opened 3 years ago

phil-blain commented 3 years ago

Following #579, and to my request, the machine files for daley were updated to use the signalling NaN initialization for debug builds.

I ran the 'base_suite' on master yesterday for the first time since then and noticed several more failures in debug cases due to the code using arrays or variable before they are initialized. @apcraig I know you did not activate the flag on other machines so I guess you must be aware of some of these issues, but I spent a little time documenting the failures I got and so I thought it would be worth it to list them here.

So here goes my findings:

problems in ice_init_column.F90

daley_intel_restart_gx3_6x2_alt01_debug_short

 (calc_timesteps) modified npt from        10 d with dt=       3600.00
 (calc_timesteps)                to       240 1 with dt=       3600.00
 (calc_timesteps) start time is  2005-01-01:00000
 (calc_timesteps)   end time is  2005-01-11:00000

forrtl: error (65): floating invalid
Image              PC                Routine            Line        Source
cice               00000000022DF5E4  Unknown               Unknown  Unknown
cice               0000000001B73480  Unknown               Unknown  Unknown
cice               00000000012B5FEF  ice_init_column_m         433  ice_init_column.F90
cice               0000000000403821  cice_initmod_mp_c         220  CICE_InitMod.F90
cice               0000000000401EB3  cice_initmod_mp_c          52  CICE_InitMod.F90
cice               000000000040168B  MAIN__                     43  CICE.F90
cice               00000000004015F2  Unknown               Unknown  Unknown
cice               00000000023C014F  Unknown               Unknown  Unknown
cice               00000000004014DA  Unknown               Unknown  Unknown

array apeffn is used but is not initialized if shortwave= 'ccsm3'

daley_intel_restart_gx3_8x2_alt02_debug_short

 (calc_timesteps)   end time is  2005-01-11:00000

  Initial forcing data year =         2005
  Final   forcing data year =         2005

 Atmospheric data files:
 /home/ords/cmdd/cmde/sice500//CICE_data/forcing/gx3/JRA55/8XDAILY/JRA55_gx3_03hr_forcing_2005.nc
  Set current forcing data year =         2005
 (JRA55_data) reading forcing file 1st ts = /home/ords/cmdd/cmde/sice500//CICE_data/forcing/gx3/JRA55/8XDAILY/JRA55_gx3_03hr_forcing_2005.nc
forrtl: error (65): floating invalid
Image              PC                Routine            Line        Source
cice               00000000022DF5E4  Unknown               Unknown  Unknown
cice               0000000001B73480  Unknown               Unknown  Unknown
cice               00000000012B4306  ice_init_column_m         429  ice_init_column.F90
cice               0000000000403821  cice_initmod_mp_c         220  CICE_InitMod.F90
cice               0000000000401EB3  cice_initmod_mp_c          52  CICE_InitMod.F90
cice               000000000040168B  MAIN__                     43  CICE.F90
cice               00000000004015F2  Unknown               Unknown  Unknown
cice               00000000023C014F  Unknown               Unknown  Unknown
cice               00000000004014DA  Unknown               Unknown  Unknown

array albpndn is used but is not initialized if shortwave= 'ccsm3'

daley_intel_restart_gx3_4x2_alt03_debug_short

same backtrace as above but at line 425 array albicen is used but is not initialized if calc_Tsfc = .false.


Problems in ice_history_bgc.F90

daley_intel_smoke_gx3_4x4_alt04_debug_short, daley_intel_smoke_gx3_4x1_debug_isotope

 /home/ords/cmdd/cmde/sice500//CICE_data/forcing/gx3/JRA55/8XDAILY/JRA55_gx3_03hr_forcing_2005.nc
  Set current forcing data year =         2005
 (JRA55_data) reading forcing file 1st ts = /home/ords/cmdd/cmde/sice500//CICE_data/forcing/gx3/JRA55/8XDAILY/JRA55_gx3_03hr_forcing_2005.nc
*** longjmp causes uninitialized stack frame ***: ./cice terminated
*** longjmp causes uninitialized stack frame ***: ./cice terminated
*** longjmp causes uninitialized stack frame ***: ./cice terminated
*** longjmp causes uninitialized stack frame ***: ./cice terminated
*** longjmp causes uninitialized stack frame ***: ./cice terminated
*** longjmp causes uninitialized stack frame ***: ./cice terminated
*** longjmp causes uninitialized stack frame ***: ./cice terminated
*** longjmp causes uninitialized stack frame ***: ./cice terminated
*** longjmp causes uninitialized stack frame ***: ./cice terminated
*** longjmp causes uninitialized stack frame ***: ./cice terminated
*** longjmp causes uninitialized stack frame ***: ./cice terminated
*** longjmp causes uninitialized stack frame ***: ./cice terminated
*** longjmp causes uninitialized stack frame ***: ./cice terminated
*** longjmp causes uninitialized stack frame ***: ./cice terminated
*** longjmp causes uninitialized stack frame ***: ./cice terminated
*** longjmp causes uninitialized stack frame ***: ./cice terminated
*** longjmp causes uninitialized stack frame ***: ./cice terminated
*** longjmp causes uninitialized stack frame ***: ./cice terminated
forrtl: error (65): floating invalid
Image              PC                Routine            Line        Source
cice               00000000022DF5E4  Unknown               Unknown  Unknown
cice               0000000001B73480  Unknown               Unknown  Unknown
cice               00000000011E60C6  ice_history_share         908  ice_history_shared.F90
cice               00000000010B61D9  ice_history_bgc_m        2343  ice_history_bgc.F90
cice               0000000000F8E089  ice_history_mp_ac        3035  ice_history.F90
cice               000000000204E5E3  Unknown               Unknown  Unknown
cice               0000000001FF888A  Unknown               Unknown  Unknown
cice               0000000001FF78E1  Unknown               Unknown  Unknown
cice               000000000204E9CA  Unknown               Unknown  Unknown
cice               00000000020E25A9  Unknown               Unknown  Unknown

In GDB:

(gdb) bt
#0  raise (sig=...) at ../sysdeps/unix/sysv/linux/raise.c:51
#1  0x00000000023c7631 in abort () at abort.c:79
#2  0x00000000022df641 in for.signal_handler ()
#3  <signal handler called>
#4  0x00000000011e60c6 in ice_history_shared::accum_hist_field_2d (id=..., iblk=..., field_accum=..., field=...)
    at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice/cicecore/cicedynB/analysis/ice_history_shared.F90:908
#5  0x00000000010b61d9 in ice_history_bgc::accum_hist_bgc (iblk=...) at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice/cicecore/cicedynB/analysis/ice_history_bgc.F90:2343
#6  0x0000000000f8e089 in ice_history::L_ice_history_mp_accum_hist__1878__par_loop0_2_9 () at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice/cicecore/cicedynB/analysis/ice_history.F90:3035
#7  0x000000000204e5e3 in __kmp_invoke_microtask ()
#8  0x0000000001ff888a in __kmp_invoke_task_func ()
#9  0x0000000001ff78e1 in __kmp_launch_thread ()
#10 0x000000000204e9ca in _INTERNAL_26_______src_z_Linux_util_cpp_fb37008b::__kmp_launch_worker(void*) ()
#11 0x00000000020e25a9 in start_thread (arg=...) at pthread_create.c:465
#12 0x000000000243556f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

In DDT:

#9 icemodel () at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice/cicecore/drivers/standalone/cice/CICE.F90:43 (at 0x000000000040168b)
#8 cice_initmod::cice_initialize () at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice/cicecore/drivers/standalone/cice/CICE_InitMod.F90:52 (at 0x0000000000401eb3)
#7 cice_initmod::cice_init () at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice/cicecore/drivers/standalone/cice/CICE_InitMod.F90:225 (at 0x0000000000403842)
#6 ice_history::accum_hist (dt=3600) at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice/cicecore/cicedynB/analysis/ice_history.F90:1878 (at 0x0000000000eb6d2d)
#5 __kmpc_fork_call () (at 0x0000000001fc5885)
#4 __kmp_fork_call () (at 0x0000000001ffa136)
#3 __kmp_invoke_task_func () (at 0x0000000001ff888a)
#2 __kmp_invoke_microtask () (at 0x000000000204e5e3)
#1 ice_history::L_ice_history_mp_accum_hist__1878__par_loop0_2_9 () at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice/cicecore/cicedynB/analysis/ice_history.F90:3035 (at 0x0000000000f8e089)
#0 ice_history_bgc::accum_hist_bgc (iblk=1) at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice/cicecore/cicedynB/analysis/ice_history_bgc.F90:2344 (at 0x00000000010b5e56)

field_accum is all NaN (which is PP_net one frame above). Same for grow_net, upNO, upNH. Looks like the first two are initialized in init_history_bgc, called by ice_step, so not yet initialized when we write the initial condition.

daley_intel_smoke_gx3_8x2_bgcz_debug, daley_intel_smoke_gx3_8x1_bgcskl_debug

idem for ocean_bio


Problems in ice_grid.F90

daley_intel_restart_gbox128_4x2_boxdyn_debug_short, daley_intel_smoke_gbox128_2x2_boxadv_debug_short, daley_intel_smoke_gbox128_4x4_boxrestore_debug_short

#4  0x0000000000e62b68 in ice_grid::gridbox_corners () at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice/cicecore/cicedynB/infrastructure/ice_grid.F90:2209
#5  0x0000000000dec5af in ice_grid::init_grid2 () at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice/cicecore/cicedynB/infrastructure/ice_grid.F90:560
#6  0x000000000040206f in cice_initmod::cice_init () at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice/cicecore/drivers/standalone/cice/CICE_InitMod.F90:121
#7  0x0000000000401eb3 in cice_initmod::cice_initialize () at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice/cicecore/drivers/standalone/cice/CICE_InitMod.F90:52
#8  0x000000000040168b in icemodel () at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice/cicecore/drivers/standalone/cice/CICE.F90:43

lont_bounds is all NaN


Problems in ice_history_fsd.F90

daley_intel_smoke_gx3_4x2_debug_diag24_fsd1_run5day, daley_intel_restart_gx3_4x2_debug_fsd12_short

#4  0x00000000011969ad in ice_history_fsd::accum_hist_fsd (iblk=...) at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice/cicecore/cicedynB/analysis/ice_history_fsd.F90:404
#5  0x0000000000f8e0a7 in ice_history::L_ice_history_mp_accum_hist__1878__par_loop0_2_9 () at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice/cicecore/cicedynB/analysis/ice_history.F90:3041
#6  0x000000000204e5e3 in __kmp_invoke_microtask ()
#7  0x0000000001ff888a in __kmp_invoke_task_func ()
#8  0x0000000001ffa136 in __kmp_fork_call ()
#9  0x0000000001fc5885 in __kmpc_fork_call ()
#10 0x0000000000eb6d2d in ice_history::accum_hist (dt=...) at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice/cicecore/cicedynB/analysis/ice_history.F90:1878
#11 0x0000000000403842 in cice_initmod::cice_init () at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice/cicecore/drivers/standalone/cice/CICE_InitMod.F90:225
#12 0x0000000000401eb3 in cice_initmod::cice_initialize () at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice/cicecore/drivers/standalone/cice/CICE_InitMod.F90:52
#13 0x000000000040168b in icemodel () at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice/cicecore/drivers/standalone/cice/CICE.F90:43

aicen_init is not initialized. (initialized in save_init called in ice_step).


Problems in icepack_zsalinity.F90

daley_intel_smoke_gx3_8x2_debug_diag24_run5day_zsal

#4  0x0000000001b71f7a in icepack_zsalinity::merge_zsal_fluxes (aicens=..., zsal_totn=..., zsal_tot=..., fzsal=..., fzsaln=..., fzsal_g=..., fzsaln_g=...)
    at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice/icepack/columnphysics/icepack_zsalinity.F90:1107
#5  0x0000000001b4c1e8 in icepack_zsalinity::zsalinity (n_cat=..., dt=..., nilyr=..., bgrid=..., cgrid=..., igrid=..., trcrn_s=..., trcrn_q=..., trcrn_si=..., ntrcr=..., fbri=..., bsin=..., 
    btin=..., bphin=..., iphin=..., ikin=..., hbr_old=..., hbrin=..., hin=..., hin_old=..., idin=..., darcy_v=..., brine_sal=..., brine_rho=..., ibrine_sal=..., ibrine_rho=..., 
    dh_direct=..., rayleigh_criteria=..., first_ice=..., sss=..., sst=..., dh_top=..., dh_bot=..., fzsal=..., fzsal_g=..., bphi_min=..., nblyr=..., vicen=..., aicen=..., zsal_tot=...)
    at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice/icepack/columnphysics/icepack_zsalinity.F90:177
#6  0x0000000001b26b6e in icepack_zbgc::icepack_biogeochemistry (dt=..., ntrcr=..., nbtrcr=..., upno=..., upnh=..., idi=..., iki=..., zfswin=..., zsal_tot=..., darcy_v=..., grow_net=..., 
    pp_net=..., hbri=..., dhbr_bot=..., dhbr_top=..., zoo=..., fbio_snoice=..., fbio_atmice=..., ocean_bio=..., first_ice=..., fswpenln=..., bphi=..., btiz=..., ice_bio_net=..., 
    snow_bio_net=..., fswthrun=..., rayleigh_criteria=..., sice_rho=..., fzsal=..., fzsal_g=..., bgrid=..., igrid=..., icgrid=..., cgrid=..., nblyr=..., nilyr=..., nslyr=..., n_algae=..., 
    n_zaero=..., ncat=..., n_doc=..., n_dic=..., n_don=..., n_fed=..., n_fep=..., meltbn=..., melttn=..., congeln=..., snoicen=..., sst=..., sss=..., fsnow=..., meltsn=..., hin_old=..., 
    flux_bio=..., flux_bio_atm=..., aicen_init=..., vicen_init=..., aicen=..., vicen=..., vsnon=..., aice0=..., trcrn=..., vsnon_init=..., skl_bgc=...)
    at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice/icepack/columnphysics/icepack_zbgc.F90:1060
#7  0x00000000015e5055 in ice_step_mod::biogeochemistry (dt=..., iblk=...) at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice/cicecore/cicedynB/general/ice_step_mod.F90:1467
#8  0x000000000040e907 in cice_runmod::L_cice_runmod_mp_ice_step__211__par_loop0_2_2 () at /fs/homeu1/eccc/cmd/cmde/phb001/code/cice/cicecore/drivers/standalone/cice/CICE_RunMod.F90:227

zsal_tot is NaN.


@apcraig @eclare108213 @JFLemieux73

eclare108213 commented 3 years ago

@phil-blain Thank you for pointing out all of these issues! I suspect that most of them are cases where they aren't actually used in the model calculation until later, when they might have real values, but we should fix them anyhow. Can you tell whether fixing any of them would change the solution? I'm looking for prioritization.

Any volunteers to take care of any of these (maybe you're working on related code?)

phil-blain commented 3 years ago

I did not investigate in details if it would change the solution... I'm not working on these area of the code currently.

eclare108213 commented 3 years ago

Looking a little closer at these failures. Thanks again @phil-blain for providing the traceback information. None of these should affect the solution, although they might affect the output.

ice_init_column.F90

array apeffn is used but is not initialized if shortwave= 'ccsm3' array albpndn is used but is not initialized if shortwave= 'ccsm3' array albicen is used but is not initialized if calc_Tsfc = .false.

apeffn and albpndn are not used with ccsm3. Rather than putting conditionals inside the loop in order to stop the calculation that uses them (which is just a merge from category values to the cell-aggregate value), it would be easier to just initialize them to zero for any case. Same goes for albicen when calc_Tsfc=F.

ice_history_bgc.F90, ice_history_fsd.F90 Would these problems be fixed simply by calling init_history_bgc and save_init during the run initialization process (i.e. in CICE_InitMod.F90)? Maybe we need a general history initialization routine.

icepack_zsalinity.F90

zsal_tot is NaN

This one does not appear to be initialized in CICE or in the Icepack driver, ever -- definitely a bug. It's strictly a history variable.

phil-blain commented 2 years ago

Problems in ice_grid.F90

those were fixed in https://github.com/CICE-Consortium/CICE/pull/749