E3SM-Project / E3SM

Energy Exascale Earth System Model source code. NOTE: use "maint" branches for your work. Head of master is not validated.
https://docs.e3sm.org/E3SM
Other
346 stars 353 forks source link

floating invalid error in 20tr_cam5_av1c-04p2 on cori-knl #3061

Closed lxu16 closed 4 years ago

lxu16 commented 5 years ago

I have been debugging this float invalid error for a while on the cori machine without any clues. I included e3sm.log error below. I can compile and run the code on edison without any problem and have this issue after switching to the cori machine. I felt the error is related to the r8 (double precision) flag and NetCDF file input because the code is as same as before. The error just pops out after switching the machine from edison to cori.

May I use some help in the E3SM community? I appreciate your suggestions.

Screen Shot 2019-07-11 at 4 40 25 PM

ndkeen commented 5 years ago

Are you able to provide a simple way for someone else to recreate the error? Either a script or via a create_test command?

Note that switching to cori from edison does change a few things, but it could easily be the case that we could make the code fail in the same way on edison as well. Certainly, adjusting the PE layout can exercise the code in different ways. Things easy to try: change the number of MPI's, turn off threads, run in DEBUG, run on cori-haswell (instead of cori-knl)...

worleyph commented 5 years ago

What do you mean by "r8 (double precision) flag "?

lxu16 commented 5 years ago

Here is the PE layout I used for the simulation. I will try cori-haswell to see what happened.

else if ( lowercase $processor_config == 'customknl' ) then

e3sm_print 'using custom layout for cori-knl because $processor_config = '$processor_config

${xmlchange_exe} MAX_TASKS_PER_NODE="64" ${xmlchange_exe} COSTPES_PER_NODE="256"

${xmlchange_exe} NTASKS_ATM="5400" ${xmlchange_exe} ROOTPE_ATM="0"

${xmlchange_exe} NTASKS_LND="320" ${xmlchange_exe} ROOTPE_LND="5120"

${xmlchange_exe} NTASKS_ICE="5120" ${xmlchange_exe} ROOTPE_ICE="0"

${xmlchange_exe} NTASKS_OCN="3840" ${xmlchange_exe} ROOTPE_OCN="5440"

${xmlchange_exe} NTASKS_CPL="5120" ${xmlchange_exe} ROOTPE_CPL="0"

${xmlchange_exe} NTASKS_GLC="320" ${xmlchange_exe} ROOTPE_GLC="5120"

${xmlchange_exe} NTASKS_ROF="320" ${xmlchange_exe} ROOTPE_ROF="5120"

${xmlchange_exe} NTASKS_WAV="5120" ${xmlchange_exe} ROOTPE_WAV="0"

${xmlchange_exe} NTHRDS_ATM="1" ${xmlchange_exe} NTHRDS_LND="1" ${xmlchange_exe} NTHRDS_ICE="1" ${xmlchange_exe} NTHRDS_OCN="1" ${xmlchange_exe} NTHRDS_CPL="1" ${xmlchange_exe} NTHRDS_GLC="1" ${xmlchange_exe} NTHRDS_ROF="1" ${xmlchange_exe} NTHRDS_WAV="1"

endif

Are you able to provide a simple way for someone else to recreate the error? Either a script or via a create_test command?

Note that switching to cori from edison does change a few things, but it could easily be the case that we could make the code fail in the same way on edison as well. Certainly, adjusting the PE layout can exercise the code in different ways. Things easy to try: change the number of MPI's, turn off threads, run in DEBUG, run on cori-haswell (instead of cori-knl)...

lxu16 commented 5 years ago

I mean FC_AUTO_R8 flag.

-r8

When I tried to read new file I created for soil erodibility, the values of variable do not seem right without this flag.

What do you mean by "r8 (double precision) flag "?

rljacob commented 5 years ago

This is still not enough info. We need the "create_newcase" line which you can find in README.case in the case directory. And the full path to your case directory.

lxu16 commented 5 years ago

My case directory is listed below. I changed the access permission and let me know in case you can not access it. /global/cscratch1/sd/lix011/acme_scratch/FCTR20yrIC_SolP.ne30_ne30/case_scripts

This is still not enough info. We need the "create_newcase" line which you can find in README.case in the case directory. And the full path to your case directory.

ndkeen commented 5 years ago

I still think it's better if we can recreate the case.

cori11% ls /global/cscratch1/sd/lix011/acme_scratch/FCTR20yrIC_SolP.ne30_ne30/
ls: cannot access '/global/cscratch1/sd/lix011/acme_scratch/FCTR20yrIC_SolP.ne30_ne30/': Permission denied
lxu16 commented 5 years ago

Can you try one more time to access the case directory? or let me know how I share the script with you to recreate the case.

worleyph commented 5 years ago
 <FC_AUTO_R8>
 -r8
 </FC_AUTO_R8>

We STRONGLY discourage autopromotion. Please explicitly type your variables with the correct type (r8 I assume, using the usual "types" module).

ndkeen commented 5 years ago

I copied your run_e3sm script here:

/global/cscratch1/sd/lix011/acme_scratch/FCTR20yrIC_SolP.ne30_ne30/case_scripts/run_script_provenance/run_solP_F20TRC5AV1C-04P2.ne30_ne30.cori.csh.2019-07-11_15:24:40_PDT

And made some changes to allow this to work for me. When I set include_fire to be true (as it is there), I get the following error during build:

   Calling /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/components/cam/cime_config/buildnml
ERROR: Command: '/global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/components/cam/bld/configure -s -ccsm_seq -ice none -ocn docn -comp_intf mct  -spmd -spmd -smp -nosmp -dyn se -dyn_target preqx -res ne30np4  -cosp_libdir /global/cscratch1/sd/ndk/E3SM_simulations/FCTR20yrIC_SolP.ne30_ne30/bld/atm/obj/cosp -phys cam5 -clubb_sgs -microphys mg2 -rain_evap_to_coarse_aero -nlev 72 -chem linoz_mam4_resus_mom_soag_biop -bc_dep_to_snow_updates -cosp ' failed with error 'ERROR: linoz_mam4_resus_mom_soag_biop is not a valid value for parameter chem: valid values are waccm_mozart,waccm_mozart_mam3,waccm_mozart_sulfur,waccm_ghg,trop_mozart,trop_mozart_mam3,trop_mozart_soa,trop_strat_soa,trop_strat_mam3,trop_strat_mam7,super_fast_llnl,super_fast_llnl_mam3,trop_ghg,trop_bam,trop_mam3,trop_mam4,trop_mam4_resus,trop_mam4_resus_soag,trop_mam4_resus_mom,trop_mam4_mom,trop_mam7,linoz_mam3,linoz_mam4_resus,linoz_mam4_resus_soag,linoz_mam4_resus_mom,linoz_mam4_resus_mom_soag,none' from dir '/global/cscratch1/sd/ndk/E3SM_simulations/FCTR20yrIC_SolP.ne30_ne30/case_scripts/Buildconf/camconf'

Are there code changes you are making?

When I set include_fire to be false, it is now building.

One thing you can easily try yourself, is building with DEBUG=TRUE. This might easily catch some floating-point issues and give you more information.

ndkeen commented 5 years ago

Also I see a potential issue in the way your script is setting the PE layout.

You have:
  ${xmlchange_exe} MAX_TASKS_PER_NODE="64"
  ${xmlchange_exe} COSTPES_PER_NODE="256"

And what you want is:
  ${xmlchange_exe} MAX_MPITASKS_PER_NODE="64"
  ${xmlchange_exe} MAX_TASKS_PER_NODE="256"

MAX_MPITASKS_PER_NODE has a new name and is the most important setting. The COSTPES variables is not needed at all.

This will certainly impact your PE layout, but may not fix the error.

In your casedir, the CseStatus file has:

ERROR: RUN FAIL: Command 'srun  --label  -n 9280 -c 8   --cpu_bind=cores   -m plane=33

Which is not what you want.

lxu16 commented 5 years ago

I created the new chemistry module called "linoz_mam4_resus_mom_soag_biop" that is specifically designed to include both soluble and insoluble phosphorus aerosol emitted from different sources from landscapes (e.g., fires, fossil fuel, dust, etc) into the atmosphere. Could you use the option "include_fire=False" (that will use the chemistry module "linoz_mam4_resus_mom_soag") to see if you can compile and run the model successfully?

I copied your run_e3sm script here:

/global/cscratch1/sd/lix011/acme_scratch/FCTR20yrIC_SolP.ne30_ne30/case_scripts/run_script_provenance/run_solP_F20TRC5AV1C-04P2.ne30_ne30.cori.csh.2019-07-11_15:24:40_PDT

And made some changes to allow this to work for me. When I set include_fire to be true (as it is there), I get the following error during build:

   Calling /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/components/cam/cime_config/buildnml
ERROR: Command: '/global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/components/cam/bld/configure -s -ccsm_seq -ice none -ocn docn -comp_intf mct  -spmd -spmd -smp -nosmp -dyn se -dyn_target preqx -res ne30np4  -cosp_libdir /global/cscratch1/sd/ndk/E3SM_simulations/FCTR20yrIC_SolP.ne30_ne30/bld/atm/obj/cosp -phys cam5 -clubb_sgs -microphys mg2 -rain_evap_to_coarse_aero -nlev 72 -chem linoz_mam4_resus_mom_soag_biop -bc_dep_to_snow_updates -cosp ' failed with error 'ERROR: linoz_mam4_resus_mom_soag_biop is not a valid value for parameter chem: valid values are waccm_mozart,waccm_mozart_mam3,waccm_mozart_sulfur,waccm_ghg,trop_mozart,trop_mozart_mam3,trop_mozart_soa,trop_strat_soa,trop_strat_mam3,trop_strat_mam7,super_fast_llnl,super_fast_llnl_mam3,trop_ghg,trop_bam,trop_mam3,trop_mam4,trop_mam4_resus,trop_mam4_resus_soag,trop_mam4_resus_mom,trop_mam4_mom,trop_mam7,linoz_mam3,linoz_mam4_resus,linoz_mam4_resus_soag,linoz_mam4_resus_mom,linoz_mam4_resus_mom_soag,none' from dir '/global/cscratch1/sd/ndk/E3SM_simulations/FCTR20yrIC_SolP.ne30_ne30/case_scripts/Buildconf/camconf'

Are there code changes you are making?

When I set include_fire to be false, it is now building.

One thing you can easily try yourself, is building with DEBUG=TRUE. This might easily catch some floating-point issues and give you more information.

lxu16 commented 5 years ago

OK I will remove this flags in the config_compilers.xml.

 <FC_AUTO_R8>
 -r8
 </FC_AUTO_R8>

We STRONGLY discourage autopromotion. Please explicitly type your variables with the correct type (r8 I assume, using the usual "types" module).

lxu16 commented 5 years ago

I see. It is good to know that. I will modify the PE layout and try one more time. Thanks!

Also I see a potential issue in the way your script is setting the PE layout.

You have:
  ${xmlchange_exe} MAX_TASKS_PER_NODE="64"
  ${xmlchange_exe} COSTPES_PER_NODE="256"

And what you want is:
  ${xmlchange_exe} MAX_MPITASKS_PER_NODE="64"
  ${xmlchange_exe} MAX_TASKS_PER_NODE="256"

MAX_MPITASKS_PER_NODE has a new name and is the most important setting. The COSTPES variables is not needed at all.

This will certainly impact your PE layout, but may not fix the error.

In your casedir, the CseStatus file has:

ERROR: RUN FAIL: Command 'srun  --label  -n 9280 -c 8   --cpu_bind=cores   -m plane=33

Which is not what you want.

ndkeen commented 5 years ago

I now see that with or without fire, I would need access to /global/u1/l/lix011 to test the script out as-is.

lxu16 commented 5 years ago

I changed the access permission for the inputdata directory for the run without fires and you may try if you can access those data.

I now see that with or without fire, I would need access to /global/u1/l/lix011 to test the script out as-is.

ndkeen commented 5 years ago

Let us know how the test goes when you have the correct number of MPI tasks per node. And if you could try with DEBUG=TRUE (I assume you know you can xmlchange DEBUG=TRUE before building to get this).

I did try again but the permissions are still off.

cori07% ls -l /global/u1/l/lix011/fire.test/ELMv1.ne30_ne30.restart.clm2.r.1997-01-01-00000.nc
ls: cannot access '/global/u1/l/lix011/fire.test/ELMv1.ne30_ne30.restart.clm2.r.1997-01-01-00000.nc': Permission denied
lxu16 commented 5 years ago

I can try the correct PE layout and switch on DEBUG=TRUE first.

BTW, I modified the permission for the /global/u1/l/lix011/fire.test/ELMv1.ne30_ne30.restart.clm2.r.1997-01-01-00000.nc.

Let us know how the test goes when you have the correct number of MPI tasks per node. And if you could try with DEBUG=TRUE (I assume you know you can xmlchange DEBUG=TRUE before building to get this).

I did try again but the permissions are still off.

cori07% ls -l /global/u1/l/lix011/fire.test/ELMv1.ne30_ne30.restart.clm2.r.1997-01-01-00000.nc
ls: cannot access '/global/u1/l/lix011/fire.test/ELMv1.ne30_ne30.restart.clm2.r.1997-01-01-00000.nc': Permission denied
ndkeen commented 5 years ago

OK, I was able to build/run and even with DEBUG=TRUE, I get the same error as you. Without DEBUG=TRUE, I actually did get a different error though (COSP related).

casedir:  /global/cscratch1/sd/ndk/E3SM_simulations/FCTR20yrIC_noFire.DEBUG.ne30_ne30

5440: (seq_domain_areafactinit) : min/max drv2mdl   0.999994415310287       1.00000412978783    areafact_o_OCN
2112: forrtl: error (65): floating invalid
2112: Image              PC                Routine            Line        Source
2112: e3sm.exe           000000000BB24BDE  Unknown               Unknown  Unknown
2112: e3sm.exe           000000000B3BE420  Unknown               Unknown  Unknown
2112: e3sm.exe           000000000219EE2E  clubb_intr_mp_clu        1584  clubb_intr.F90
2112: e3sm.exe           0000000000D2A820  physpkg_mp_tphysb        2483  physpkg.F90
2112: e3sm.exe           0000000000D00AC6  physpkg_mp_phys_r        1034  physpkg.F90
2112: e3sm.exe           000000000081AD21  cam_comp_mp_cam_r         250  cam_comp.F90
2112: e3sm.exe           00000000007EA132  atm_comp_mct_mp_a         341  atm_comp_mct.F90
2112: e3sm.exe           00000000004563E3  component_mod_mp_         267  component_mod.F90
2112: e3sm.exe           0000000000429B6D  cime_comp_mod_mp_        1962  cime_comp_mod.F90
2112: e3sm.exe           000000000044C23B  MAIN__                     92  cime_driver.F90
2112: e3sm.exe           000000000040A80E  Unknown               Unknown  Unknown
2112: e3sm.exe           000000000BBFE4D9  Unknown               Unknown  Unknown

      !  Compute thermodynamic stuff needed for CLUBB on thermo levels.                                                                                                
      !  Inputs for the momentum levels are set below setup_clubb core                                                                                                 
      do k=1,pver
         p_in_Pa(k+1)         = state1%pmid(i,pver-k+1)                              ! Pressure profile                                                                
         exner(k+1)           = 1._r8/exner_clubb(i,pver-k+1)
         rho_ds_zt(k+1)       = (1._r8/gravit)*(state1%pdel(i,pver-k+1)/dz_g(pver-k+1))
         invrs_rho_ds_zt(k+1) = 1._r8/(rho_ds_zt(k+1))                               ! Inverse ds rho at thermo                                                        
         rho(i,k+1)           = rho_ds_zt(k+1)                                       ! rho on thermo                                                                   
         thv_ds_zt(k+1)       = thv(i,pver-k+1)                                      ! thetav on thermo                                                                
         rfrzm(k+1)           = state1%q(i,pver-k+1,ixcldice)
         radf(k+1)            = radf_clubb(i,pver-k+1)
         qrl_clubb(k+1)       = qrl(i,pver-k+1)/(cpair*state1%pdel(i,pver-k+1))  ! << this line
      enddo
lxu16 commented 5 years ago

That IS exactly same error I had! BTW, I tried the run by fixing the PE layout you recommend above and the error is still there. Now I submitted the job using the compiler in cori-haswell and see what is going to happen.....

OK, I was able to build/run and even with DEBUG=TRUE, I get the same error as you. Without DEBUG=TRUE, I actually did get a different error though (COSP related).

casedir:  /global/cscratch1/sd/ndk/E3SM_simulations/FCTR20yrIC_noFire.DEBUG.ne30_ne30

5440: (seq_domain_areafactinit) : min/max drv2mdl   0.999994415310287       1.00000412978783    areafact_o_OCN
2112: forrtl: error (65): floating invalid
2112: Image              PC                Routine            Line        Source
2112: e3sm.exe           000000000BB24BDE  Unknown               Unknown  Unknown
2112: e3sm.exe           000000000B3BE420  Unknown               Unknown  Unknown
2112: e3sm.exe           000000000219EE2E  clubb_intr_mp_clu        1584  clubb_intr.F90
2112: e3sm.exe           0000000000D2A820  physpkg_mp_tphysb        2483  physpkg.F90
2112: e3sm.exe           0000000000D00AC6  physpkg_mp_phys_r        1034  physpkg.F90
2112: e3sm.exe           000000000081AD21  cam_comp_mp_cam_r         250  cam_comp.F90
2112: e3sm.exe           00000000007EA132  atm_comp_mct_mp_a         341  atm_comp_mct.F90
2112: e3sm.exe           00000000004563E3  component_mod_mp_         267  component_mod.F90
2112: e3sm.exe           0000000000429B6D  cime_comp_mod_mp_        1962  cime_comp_mod.F90
2112: e3sm.exe           000000000044C23B  MAIN__                     92  cime_driver.F90
2112: e3sm.exe           000000000040A80E  Unknown               Unknown  Unknown
2112: e3sm.exe           000000000BBFE4D9  Unknown               Unknown  Unknown

      !  Compute thermodynamic stuff needed for CLUBB on thermo levels.                                                                                                
      !  Inputs for the momentum levels are set below setup_clubb core                                                                                                 
      do k=1,pver
         p_in_Pa(k+1)         = state1%pmid(i,pver-k+1)                              ! Pressure profile                                                                
         exner(k+1)           = 1._r8/exner_clubb(i,pver-k+1)
         rho_ds_zt(k+1)       = (1._r8/gravit)*(state1%pdel(i,pver-k+1)/dz_g(pver-k+1))
         invrs_rho_ds_zt(k+1) = 1._r8/(rho_ds_zt(k+1))                               ! Inverse ds rho at thermo                                                        
         rho(i,k+1)           = rho_ds_zt(k+1)                                       ! rho on thermo                                                                   
         thv_ds_zt(k+1)       = thv(i,pver-k+1)                                      ! thetav on thermo                                                                
         rfrzm(k+1)           = state1%q(i,pver-k+1,ixcldice)
         radf(k+1)            = radf_clubb(i,pver-k+1)
         qrl_clubb(k+1)       = qrl(i,pver-k+1)/(cpair*state1%pdel(i,pver-k+1))  ! << this line
      enddo
ndkeen commented 5 years ago

Rebuilding with GNU compiler, I get an error in perhaps same place:

5440: (seq_domain_areafactinit) : min/max drv2mdl   0.999994415310287       1.00000412978783    areafact_o_OCN
2112:
2112: Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
2112:
2112: Backtrace for this error:
2112: #0  0x22e5e2f in ???
2112:   at /home/abuild/rpmbuild/BUILD/glibc-2.22/nptl/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
2112: #1  0xb18423 in __clubb_intr_MOD_clubb_tend_cam
2112:   at /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/components/cam/src/physics/cam/clubb_intr.F90:1584
2112: #2  0x61d1be in tphysbc
2112:   at /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/components/cam/src/physics/cam/physpkg.F90:2485
2112: #3  0x626b68 in __physpkg_MOD_phys_run1
2112:   at /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/components/cam/src/physics/cam/physpkg.F90:1036
2112: #4  0x511cd1 in __cam_comp_MOD_cam_run1
2112:   at /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/components/cam/src/control/cam_comp.F90:250
2112: #5  0x50bd42 in __atm_comp_mct_MOD_atm_init_mct
2112:   at /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/components/cam/src/cpl/atm_comp_mct.F90:341
2112: #6  0x43d4fb in __component_mod_MOD_component_init_cc
2112:   at /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/cime/src/drivers/mct/main/component_mod.F90:258
2112: #7  0x42ebc8 in __cime_comp_mod_MOD_cime_init
2112:   at /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/cime/src/drivers/mct/main/cime_comp_mod.F90:1965
2112: #8  0x439337 in cime_driver
2112:   at /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/cime/src/drivers/mct/main/cime_driver.F90:92
2112: #9  0x4393de in main
2112:   at /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/cime/src/drivers/mct/main/cime_driver.F90:23
srun: error: nid04032: task 2112: Floating point exception
lxu16 commented 5 years ago

Does that mean that the compset "F20TRC5AV1C-04P2" perhaps is not working on cori-knl?

There is no code change at all in the no-fire run (except those input files I created). The model uses the default chemistry module defined in the CAM5 namelist of E3SM.

Rebuilding with GNU compiler, I get an error in perhaps same place, but with new info:

5440: (seq_domain_areafactinit) : min/max drv2mdl   0.999994415310287       1.00000412978783    areafact_o_OCN
2112:
2112: Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
2112:
2112: Backtrace for this error:
2112: #0  0x22e5e2f in ???
2112:   at /home/abuild/rpmbuild/BUILD/glibc-2.22/nptl/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
2112: #1  0xb18423 in __clubb_intr_MOD_clubb_tend_cam
2112:   at /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/components/cam/src/physics/cam/clubb_intr.F90:1584
2112: #2  0x61d1be in tphysbc
2112:   at /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/components/cam/src/physics/cam/physpkg.F90:2485
2112: #3  0x626b68 in __physpkg_MOD_phys_run1
2112:   at /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/components/cam/src/physics/cam/physpkg.F90:1036
2112: #4  0x511cd1 in __cam_comp_MOD_cam_run1
2112:   at /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/components/cam/src/control/cam_comp.F90:250
2112: #5  0x50bd42 in __atm_comp_mct_MOD_atm_init_mct
2112:   at /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/components/cam/src/cpl/atm_comp_mct.F90:341
2112: #6  0x43d4fb in __component_mod_MOD_component_init_cc
2112:   at /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/cime/src/drivers/mct/main/component_mod.F90:258
2112: #7  0x42ebc8 in __cime_comp_mod_MOD_cime_init
2112:   at /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/cime/src/drivers/mct/main/cime_comp_mod.F90:1965
2112: #8  0x439337 in cime_driver
2112:   at /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/cime/src/drivers/mct/main/cime_driver.F90:92
2112: #9  0x4393de in main
2112:   at /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/cime/src/drivers/mct/main/cime_driver.F90:23
srun: error: nid04032: task 2112: Floating point exception
ndkeen commented 5 years ago

I think there is likely a floating-point issue as the compiler reports. The fact that you did not see it with edison may be more a function of the software (compiler version, for example), than the actual hardware.

I added a write statement just before the line that produces a floating-point issue and it looks like the array qrl is not initialized as the values are either zero or something like 3.953e-323. It's not clear where/when qrl is supposed to be initialized -- I do see there is a line call pbuf_get_field(pbuf, qrl_idx, qrl).

Different compilers treat using uninitialized variables differently, but typically we want to find/fix those.

Here is what I printed:

         write(*,'(a,i8,a,i8,a,i8,a,es10.4,a,es12.4)') "ndk i=", i, " k=", k, " pver=", pver, " state1%pdel(i,pver-k+1)=", state1%pdel(i,pver-k+1), " qrl(i,pver-k+1)=", qrl(i,pver-k+1) 
         qrl_clubb(k+1)       = qrl(i,pver-k+1)/(cpair*state1%pdel(i,pver-k+1))

I don't think cpair is zero as other lines in the code divide by this variable. I was looking for state1%pdel(i,pver-k+1) to be zero somewhere, but that doesn't seem to be the case. The values of qrl are mostly zero, but several are something like "e-318" as noted above.

I also ran this same case with nothing in user_nl_cam and user_nl_clm and I see the same result.

Do we expect this type of test to work? create_test SMS_Ln5.ne30_ne30.F20TRC5AV1C-04P2

I did just try:

SMS_Ln5.ne30_ne30.F20TRC5AV1C-04P2
SMS_D_Ln5.ne30_ne30.F20TRC5AV1C-04P2

which both passed.

ndkeen commented 5 years ago

I also tested without COSP -- same failure. Then I tested using the current master as the repo -- same failure. Now I will attempt reproducing this with an F case and some of the key options here.

lxu16 commented 5 years ago

@ndkeen Thanks for helping me testing the code!

I tried multiple ways, like switching the master (newest vs the older one published in Aug. 2018), modifying the compiler option (back and forth), unlimit the stacksize and coredumpsize, the error changed to

Screen Shot 2019-07-23 at 9 43 41 AM

Then I switched off the "-cosp" option in the CAM_CONFIG_OPTS, then the 3-day test run succeeded. Afterwards I switch on the "-cosp" but use cosp_lite=.true. in the namelist, so far so good. I finished one-year run and the excutable seems working now......

ndkeen commented 5 years ago

Hmmm. That's interesting that when I switched off COSP entirely, I still got the same error.

lxu16 commented 5 years ago

@ndkeen I tried to use the newest one but got the same error and switched back to the older master (Aug. 2018 version), it surprisingly passed. It IS interesting, isn't it?

ndkeen commented 5 years ago

When I run a F-case (ATM only) using the compset F20TRC5AV1C-04P2 it will run with current master as I stated above. Then I started adding some of the options that you have in your script. Primarily, these changes:

./xmlchange --id CLM_BLDNML_OPTS --val "-bgc bgc -nutrient cnp -nutrient_comp_pathway eca -soil_decomp century -methane -nitrif_denitrif"

With these, it will fail with a floating invalid error (run in DEBUG). It is a different error than posted above, but I can change the PE layout in the run_e3sm script and also get this same error:

 45: forrtl: error (65): floating invalid
 45: Image              PC                Routine            Line        Source             
 45: e3sm.exe           000000000C69BF8E  Unknown               Unknown  Unknown
 45: e3sm.exe           000000000BF35120  Unknown               Unknown  Unknown
 45: e3sm.exe           000000000743B047  allocationmod_mp_        2415  AllocationMod.F90
 45: e3sm.exe           0000000007FA2F44  soillittdecompmod         410  SoilLittDecompMod.F90
 45: e3sm.exe           0000000007A94E6D  ecosystemdynmod_m         590  EcosystemDynMod.F90
 45: e3sm.exe           00000000058E6B28  clm_driver_mp_clm        1012  clm_driver.F90
 45: e3sm.exe           000000000AAEB3E3  Unknown               Unknown  Unknown
 45: e3sm.exe           000000000AAA30D7  Unknown               Unknown  Unknown
 45: e3sm.exe           000000000AA75BD4  Unknown               Unknown  Unknown
 45: e3sm.exe           00000000058C063B  clm_driver_mp_clm         560  clm_driver.F90
 45: e3sm.exe           000000000588CC33  lnd_comp_mct_mp_l         509  lnd_comp_mct.F90
 45: e3sm.exe           0000000000466A13  component_mod_mp_         737  component_mod.F90
 45: e3sm.exe           0000000000430284  cime_comp_mod_mp_        2602  cime_comp_mod.F90
 45: e3sm.exe           000000000044EA0D  MAIN__                    133  cime_driver.F90

So then I started taking off some of those options to see if I could narrow anything down. I found that using:

./xmlchange --id CLM_BLDNML_OPTS --val "-bgc bgc -nutrient_comp_pathway eca"

Will cause the above error, while other combinations do not. It looks like I cannot try simply -nutrient_comp_pathway eca -- it must also need -bgc bgc.

The source where it is stopping is here:

                  dsolutionp_dt(c,j) = gross_pmin_vr(c,j) -potential_immob_p_vr(c,j) - &
                       col_plant_pdemand_vr(c,j) + biochem_pmin_vr_col(c,j) + &
                       primp_to_labilep_vr_col(c,j) + pdep_to_sminp(c) *ndep_prof(c,j)

Which makes me think it's another example of an array being used before init.

ndkeen commented 5 years ago

In github issue #3142, I'm now seeing that I get the same error with our "normal" F compset cases -- but only if I force 1 thread (pure MPI).

susburrows commented 5 years ago

Hi @lxu16 and @ndkeen -- just wanted to let you know that I got an identical error message today while building/running an F compset in ne30 from the current maint-v1.0 branch on cori. The traceback indicates the failure is occurring at line 1584 in clubb_intr.F90, the same as @lxu16 's original report. I am testing the solution proposed in @ndkeen 's issue #3142 (which looks like it might be relevant to this issue, too). I'll keep you posted on how this goes, but also wanted to ask if you have made any further progress / resolved the issue in the meantime? Thanks!

lxu16 commented 5 years ago

Did you try to define cosp_lite=.true. in the namelist and use "-cosp" in the runscript? I used this strategy to solve the error.

Sent from Yahoo Mail for iPhone

On Wednesday, August 28, 2019, 12:20, susburrows notifications@github.com wrote:

Hi @lxu16 and @ndkeen -- just wanted to let you know that I got an identical error message today while building/running an F compset in ne30 from the current maint-v1.0 branch on cori. The traceback indicates the failure is occurring at line 1584 in clubb_intr.F90, the same as @lxu16 's original report. I am testing the solution proposed in @ndkeen 's issue #3142 (which looks like it might be relevant to this issue, too). I'll keep you posted on how this goes, but also wanted to ask if you have made any further progress / resolved the issue in the meantime? Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ndkeen commented 5 years ago

@lxu16: I don't think setting cosp_lite to true is a solution to this problem. It may have allowed you to continue to run, but it could be dangerous. Also, I think you are using a PE layout meant for coupled case, while you appear to actually only be doing ATM-only.

lxu16 commented 5 years ago

@ndkeen I agree. This is the work around to the error. Please keep me posted if you find the better way to resolve this issue. Thanks!

@lxu16: I don't think setting cosp_lite to true is a solution to this problem. It may have allowed you to continue to run, but it could be dangerous. Also, I think you are using a PE layout meant for coupled case, while you appear to actually only be doing ATM-only.

ndkeen commented 4 years ago

I think the fix in PR3324 should work here, but I'm unable to run the same script as I did before (even after making Cori module changes). If someone could simply add the one line (to init qrl_idx=0) and try again that would be great? Or provide me an updated script that works on Cori and I can try.