Closed lxu16 closed 4 years ago
Are you able to provide a simple way for someone else to recreate the error? Either a script or via a create_test command?
Note that switching to cori from edison does change a few things, but it could easily be the case that we could make the code fail in the same way on edison as well. Certainly, adjusting the PE layout can exercise the code in different ways. Things easy to try: change the number of MPI's, turn off threads, run in DEBUG, run on cori-haswell (instead of cori-knl)...
What do you mean by "r8 (double precision) flag "?
Here is the PE layout I used for the simulation. I will try cori-haswell to see what happened.
else if ( lowercase $processor_config
== 'customknl' ) then
e3sm_print 'using custom layout for cori-knl because $processor_config = '$processor_config
${xmlchange_exe} MAX_TASKS_PER_NODE="64" ${xmlchange_exe} COSTPES_PER_NODE="256"
${xmlchange_exe} NTASKS_ATM="5400" ${xmlchange_exe} ROOTPE_ATM="0"
${xmlchange_exe} NTASKS_LND="320" ${xmlchange_exe} ROOTPE_LND="5120"
${xmlchange_exe} NTASKS_ICE="5120" ${xmlchange_exe} ROOTPE_ICE="0"
${xmlchange_exe} NTASKS_OCN="3840" ${xmlchange_exe} ROOTPE_OCN="5440"
${xmlchange_exe} NTASKS_CPL="5120" ${xmlchange_exe} ROOTPE_CPL="0"
${xmlchange_exe} NTASKS_GLC="320" ${xmlchange_exe} ROOTPE_GLC="5120"
${xmlchange_exe} NTASKS_ROF="320" ${xmlchange_exe} ROOTPE_ROF="5120"
${xmlchange_exe} NTASKS_WAV="5120" ${xmlchange_exe} ROOTPE_WAV="0"
${xmlchange_exe} NTHRDS_ATM="1" ${xmlchange_exe} NTHRDS_LND="1" ${xmlchange_exe} NTHRDS_ICE="1" ${xmlchange_exe} NTHRDS_OCN="1" ${xmlchange_exe} NTHRDS_CPL="1" ${xmlchange_exe} NTHRDS_GLC="1" ${xmlchange_exe} NTHRDS_ROF="1" ${xmlchange_exe} NTHRDS_WAV="1"
endif
Are you able to provide a simple way for someone else to recreate the error? Either a script or via a create_test command?
Note that switching to cori from edison does change a few things, but it could easily be the case that we could make the code fail in the same way on edison as well. Certainly, adjusting the PE layout can exercise the code in different ways. Things easy to try: change the number of MPI's, turn off threads, run in DEBUG, run on cori-haswell (instead of cori-knl)...
I mean FC_AUTO_R8 flag.
When I tried to read new file I created for soil erodibility, the values of variable do not seem right without this flag.
What do you mean by "r8 (double precision) flag "?
This is still not enough info. We need the "create_newcase" line which you can find in README.case in the case directory. And the full path to your case directory.
My case directory is listed below. I changed the access permission and let me know in case you can not access it. /global/cscratch1/sd/lix011/acme_scratch/FCTR20yrIC_SolP.ne30_ne30/case_scripts
This is still not enough info. We need the "create_newcase" line which you can find in README.case in the case directory. And the full path to your case directory.
I still think it's better if we can recreate the case.
cori11% ls /global/cscratch1/sd/lix011/acme_scratch/FCTR20yrIC_SolP.ne30_ne30/
ls: cannot access '/global/cscratch1/sd/lix011/acme_scratch/FCTR20yrIC_SolP.ne30_ne30/': Permission denied
Can you try one more time to access the case directory? or let me know how I share the script with you to recreate the case.
<FC_AUTO_R8>
-r8
</FC_AUTO_R8>
We STRONGLY discourage autopromotion. Please explicitly type your variables with the correct type (r8 I assume, using the usual "types" module).
I copied your run_e3sm script here:
/global/cscratch1/sd/lix011/acme_scratch/FCTR20yrIC_SolP.ne30_ne30/case_scripts/run_script_provenance/run_solP_F20TRC5AV1C-04P2.ne30_ne30.cori.csh.2019-07-11_15:24:40_PDT
And made some changes to allow this to work for me. When I set include_fire
to be true (as it is there), I get the following error during build:
Calling /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/components/cam/cime_config/buildnml
ERROR: Command: '/global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/components/cam/bld/configure -s -ccsm_seq -ice none -ocn docn -comp_intf mct -spmd -spmd -smp -nosmp -dyn se -dyn_target preqx -res ne30np4 -cosp_libdir /global/cscratch1/sd/ndk/E3SM_simulations/FCTR20yrIC_SolP.ne30_ne30/bld/atm/obj/cosp -phys cam5 -clubb_sgs -microphys mg2 -rain_evap_to_coarse_aero -nlev 72 -chem linoz_mam4_resus_mom_soag_biop -bc_dep_to_snow_updates -cosp ' failed with error 'ERROR: linoz_mam4_resus_mom_soag_biop is not a valid value for parameter chem: valid values are waccm_mozart,waccm_mozart_mam3,waccm_mozart_sulfur,waccm_ghg,trop_mozart,trop_mozart_mam3,trop_mozart_soa,trop_strat_soa,trop_strat_mam3,trop_strat_mam7,super_fast_llnl,super_fast_llnl_mam3,trop_ghg,trop_bam,trop_mam3,trop_mam4,trop_mam4_resus,trop_mam4_resus_soag,trop_mam4_resus_mom,trop_mam4_mom,trop_mam7,linoz_mam3,linoz_mam4_resus,linoz_mam4_resus_soag,linoz_mam4_resus_mom,linoz_mam4_resus_mom_soag,none' from dir '/global/cscratch1/sd/ndk/E3SM_simulations/FCTR20yrIC_SolP.ne30_ne30/case_scripts/Buildconf/camconf'
Are there code changes you are making?
When I set include_fire
to be false, it is now building.
One thing you can easily try yourself, is building with DEBUG=TRUE. This might easily catch some floating-point issues and give you more information.
Also I see a potential issue in the way your script is setting the PE layout.
You have:
${xmlchange_exe} MAX_TASKS_PER_NODE="64"
${xmlchange_exe} COSTPES_PER_NODE="256"
And what you want is:
${xmlchange_exe} MAX_MPITASKS_PER_NODE="64"
${xmlchange_exe} MAX_TASKS_PER_NODE="256"
MAX_MPITASKS_PER_NODE has a new name and is the most important setting. The COSTPES variables is not needed at all.
This will certainly impact your PE layout, but may not fix the error.
In your casedir, the CseStatus file has:
ERROR: RUN FAIL: Command 'srun --label -n 9280 -c 8 --cpu_bind=cores -m plane=33
Which is not what you want.
I created the new chemistry module called "linoz_mam4_resus_mom_soag_biop" that is specifically designed to include both soluble and insoluble phosphorus aerosol emitted from different sources from landscapes (e.g., fires, fossil fuel, dust, etc) into the atmosphere. Could you use the option "include_fire=False" (that will use the chemistry module "linoz_mam4_resus_mom_soag") to see if you can compile and run the model successfully?
I copied your run_e3sm script here:
/global/cscratch1/sd/lix011/acme_scratch/FCTR20yrIC_SolP.ne30_ne30/case_scripts/run_script_provenance/run_solP_F20TRC5AV1C-04P2.ne30_ne30.cori.csh.2019-07-11_15:24:40_PDT
And made some changes to allow this to work for me. When I set
include_fire
to be true (as it is there), I get the following error during build:Calling /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/components/cam/cime_config/buildnml ERROR: Command: '/global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/components/cam/bld/configure -s -ccsm_seq -ice none -ocn docn -comp_intf mct -spmd -spmd -smp -nosmp -dyn se -dyn_target preqx -res ne30np4 -cosp_libdir /global/cscratch1/sd/ndk/E3SM_simulations/FCTR20yrIC_SolP.ne30_ne30/bld/atm/obj/cosp -phys cam5 -clubb_sgs -microphys mg2 -rain_evap_to_coarse_aero -nlev 72 -chem linoz_mam4_resus_mom_soag_biop -bc_dep_to_snow_updates -cosp ' failed with error 'ERROR: linoz_mam4_resus_mom_soag_biop is not a valid value for parameter chem: valid values are waccm_mozart,waccm_mozart_mam3,waccm_mozart_sulfur,waccm_ghg,trop_mozart,trop_mozart_mam3,trop_mozart_soa,trop_strat_soa,trop_strat_mam3,trop_strat_mam7,super_fast_llnl,super_fast_llnl_mam3,trop_ghg,trop_bam,trop_mam3,trop_mam4,trop_mam4_resus,trop_mam4_resus_soag,trop_mam4_resus_mom,trop_mam4_mom,trop_mam7,linoz_mam3,linoz_mam4_resus,linoz_mam4_resus_soag,linoz_mam4_resus_mom,linoz_mam4_resus_mom_soag,none' from dir '/global/cscratch1/sd/ndk/E3SM_simulations/FCTR20yrIC_SolP.ne30_ne30/case_scripts/Buildconf/camconf'
Are there code changes you are making?
When I set
include_fire
to be false, it is now building.One thing you can easily try yourself, is building with DEBUG=TRUE. This might easily catch some floating-point issues and give you more information.
OK I will remove this flags in the config_compilers.xml.
<FC_AUTO_R8> -r8 </FC_AUTO_R8>
We STRONGLY discourage autopromotion. Please explicitly type your variables with the correct type (r8 I assume, using the usual "types" module).
I see. It is good to know that. I will modify the PE layout and try one more time. Thanks!
Also I see a potential issue in the way your script is setting the PE layout.
You have: ${xmlchange_exe} MAX_TASKS_PER_NODE="64" ${xmlchange_exe} COSTPES_PER_NODE="256" And what you want is: ${xmlchange_exe} MAX_MPITASKS_PER_NODE="64" ${xmlchange_exe} MAX_TASKS_PER_NODE="256"
MAX_MPITASKS_PER_NODE has a new name and is the most important setting. The COSTPES variables is not needed at all.
This will certainly impact your PE layout, but may not fix the error.
In your casedir, the CseStatus file has:
ERROR: RUN FAIL: Command 'srun --label -n 9280 -c 8 --cpu_bind=cores -m plane=33
Which is not what you want.
I now see that with or without fire, I would need access to /global/u1/l/lix011
to test the script out as-is.
I changed the access permission for the inputdata directory for the run without fires and you may try if you can access those data.
I now see that with or without fire, I would need access to
/global/u1/l/lix011
to test the script out as-is.
Let us know how the test goes when you have the correct number of MPI tasks per node. And if you could try with DEBUG=TRUE (I assume you know you can xmlchange DEBUG=TRUE before building to get this).
I did try again but the permissions are still off.
cori07% ls -l /global/u1/l/lix011/fire.test/ELMv1.ne30_ne30.restart.clm2.r.1997-01-01-00000.nc
ls: cannot access '/global/u1/l/lix011/fire.test/ELMv1.ne30_ne30.restart.clm2.r.1997-01-01-00000.nc': Permission denied
I can try the correct PE layout and switch on DEBUG=TRUE first.
BTW, I modified the permission for the /global/u1/l/lix011/fire.test/ELMv1.ne30_ne30.restart.clm2.r.1997-01-01-00000.nc.
Let us know how the test goes when you have the correct number of MPI tasks per node. And if you could try with DEBUG=TRUE (I assume you know you can xmlchange DEBUG=TRUE before building to get this).
I did try again but the permissions are still off.
cori07% ls -l /global/u1/l/lix011/fire.test/ELMv1.ne30_ne30.restart.clm2.r.1997-01-01-00000.nc ls: cannot access '/global/u1/l/lix011/fire.test/ELMv1.ne30_ne30.restart.clm2.r.1997-01-01-00000.nc': Permission denied
OK, I was able to build/run and even with DEBUG=TRUE, I get the same error as you. Without DEBUG=TRUE, I actually did get a different error though (COSP related).
casedir: /global/cscratch1/sd/ndk/E3SM_simulations/FCTR20yrIC_noFire.DEBUG.ne30_ne30
5440: (seq_domain_areafactinit) : min/max drv2mdl 0.999994415310287 1.00000412978783 areafact_o_OCN
2112: forrtl: error (65): floating invalid
2112: Image PC Routine Line Source
2112: e3sm.exe 000000000BB24BDE Unknown Unknown Unknown
2112: e3sm.exe 000000000B3BE420 Unknown Unknown Unknown
2112: e3sm.exe 000000000219EE2E clubb_intr_mp_clu 1584 clubb_intr.F90
2112: e3sm.exe 0000000000D2A820 physpkg_mp_tphysb 2483 physpkg.F90
2112: e3sm.exe 0000000000D00AC6 physpkg_mp_phys_r 1034 physpkg.F90
2112: e3sm.exe 000000000081AD21 cam_comp_mp_cam_r 250 cam_comp.F90
2112: e3sm.exe 00000000007EA132 atm_comp_mct_mp_a 341 atm_comp_mct.F90
2112: e3sm.exe 00000000004563E3 component_mod_mp_ 267 component_mod.F90
2112: e3sm.exe 0000000000429B6D cime_comp_mod_mp_ 1962 cime_comp_mod.F90
2112: e3sm.exe 000000000044C23B MAIN__ 92 cime_driver.F90
2112: e3sm.exe 000000000040A80E Unknown Unknown Unknown
2112: e3sm.exe 000000000BBFE4D9 Unknown Unknown Unknown
! Compute thermodynamic stuff needed for CLUBB on thermo levels.
! Inputs for the momentum levels are set below setup_clubb core
do k=1,pver
p_in_Pa(k+1) = state1%pmid(i,pver-k+1) ! Pressure profile
exner(k+1) = 1._r8/exner_clubb(i,pver-k+1)
rho_ds_zt(k+1) = (1._r8/gravit)*(state1%pdel(i,pver-k+1)/dz_g(pver-k+1))
invrs_rho_ds_zt(k+1) = 1._r8/(rho_ds_zt(k+1)) ! Inverse ds rho at thermo
rho(i,k+1) = rho_ds_zt(k+1) ! rho on thermo
thv_ds_zt(k+1) = thv(i,pver-k+1) ! thetav on thermo
rfrzm(k+1) = state1%q(i,pver-k+1,ixcldice)
radf(k+1) = radf_clubb(i,pver-k+1)
qrl_clubb(k+1) = qrl(i,pver-k+1)/(cpair*state1%pdel(i,pver-k+1)) ! << this line
enddo
That IS exactly same error I had! BTW, I tried the run by fixing the PE layout you recommend above and the error is still there. Now I submitted the job using the compiler in cori-haswell and see what is going to happen.....
OK, I was able to build/run and even with DEBUG=TRUE, I get the same error as you. Without DEBUG=TRUE, I actually did get a different error though (COSP related).
casedir: /global/cscratch1/sd/ndk/E3SM_simulations/FCTR20yrIC_noFire.DEBUG.ne30_ne30 5440: (seq_domain_areafactinit) : min/max drv2mdl 0.999994415310287 1.00000412978783 areafact_o_OCN 2112: forrtl: error (65): floating invalid 2112: Image PC Routine Line Source 2112: e3sm.exe 000000000BB24BDE Unknown Unknown Unknown 2112: e3sm.exe 000000000B3BE420 Unknown Unknown Unknown 2112: e3sm.exe 000000000219EE2E clubb_intr_mp_clu 1584 clubb_intr.F90 2112: e3sm.exe 0000000000D2A820 physpkg_mp_tphysb 2483 physpkg.F90 2112: e3sm.exe 0000000000D00AC6 physpkg_mp_phys_r 1034 physpkg.F90 2112: e3sm.exe 000000000081AD21 cam_comp_mp_cam_r 250 cam_comp.F90 2112: e3sm.exe 00000000007EA132 atm_comp_mct_mp_a 341 atm_comp_mct.F90 2112: e3sm.exe 00000000004563E3 component_mod_mp_ 267 component_mod.F90 2112: e3sm.exe 0000000000429B6D cime_comp_mod_mp_ 1962 cime_comp_mod.F90 2112: e3sm.exe 000000000044C23B MAIN__ 92 cime_driver.F90 2112: e3sm.exe 000000000040A80E Unknown Unknown Unknown 2112: e3sm.exe 000000000BBFE4D9 Unknown Unknown Unknown ! Compute thermodynamic stuff needed for CLUBB on thermo levels. ! Inputs for the momentum levels are set below setup_clubb core do k=1,pver p_in_Pa(k+1) = state1%pmid(i,pver-k+1) ! Pressure profile exner(k+1) = 1._r8/exner_clubb(i,pver-k+1) rho_ds_zt(k+1) = (1._r8/gravit)*(state1%pdel(i,pver-k+1)/dz_g(pver-k+1)) invrs_rho_ds_zt(k+1) = 1._r8/(rho_ds_zt(k+1)) ! Inverse ds rho at thermo rho(i,k+1) = rho_ds_zt(k+1) ! rho on thermo thv_ds_zt(k+1) = thv(i,pver-k+1) ! thetav on thermo rfrzm(k+1) = state1%q(i,pver-k+1,ixcldice) radf(k+1) = radf_clubb(i,pver-k+1) qrl_clubb(k+1) = qrl(i,pver-k+1)/(cpair*state1%pdel(i,pver-k+1)) ! << this line enddo
Rebuilding with GNU compiler, I get an error in perhaps same place:
5440: (seq_domain_areafactinit) : min/max drv2mdl 0.999994415310287 1.00000412978783 areafact_o_OCN
2112:
2112: Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation.
2112:
2112: Backtrace for this error:
2112: #0 0x22e5e2f in ???
2112: at /home/abuild/rpmbuild/BUILD/glibc-2.22/nptl/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0
2112: #1 0xb18423 in __clubb_intr_MOD_clubb_tend_cam
2112: at /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/components/cam/src/physics/cam/clubb_intr.F90:1584
2112: #2 0x61d1be in tphysbc
2112: at /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/components/cam/src/physics/cam/physpkg.F90:2485
2112: #3 0x626b68 in __physpkg_MOD_phys_run1
2112: at /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/components/cam/src/physics/cam/physpkg.F90:1036
2112: #4 0x511cd1 in __cam_comp_MOD_cam_run1
2112: at /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/components/cam/src/control/cam_comp.F90:250
2112: #5 0x50bd42 in __atm_comp_mct_MOD_atm_init_mct
2112: at /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/components/cam/src/cpl/atm_comp_mct.F90:341
2112: #6 0x43d4fb in __component_mod_MOD_component_init_cc
2112: at /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/cime/src/drivers/mct/main/component_mod.F90:258
2112: #7 0x42ebc8 in __cime_comp_mod_MOD_cime_init
2112: at /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/cime/src/drivers/mct/main/cime_comp_mod.F90:1965
2112: #8 0x439337 in cime_driver
2112: at /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/cime/src/drivers/mct/main/cime_driver.F90:92
2112: #9 0x4393de in main
2112: at /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/cime/src/drivers/mct/main/cime_driver.F90:23
srun: error: nid04032: task 2112: Floating point exception
Does that mean that the compset "F20TRC5AV1C-04P2" perhaps is not working on cori-knl?
There is no code change at all in the no-fire run (except those input files I created). The model uses the default chemistry module defined in the CAM5 namelist of E3SM.
Rebuilding with GNU compiler, I get an error in perhaps same place, but with new info:
5440: (seq_domain_areafactinit) : min/max drv2mdl 0.999994415310287 1.00000412978783 areafact_o_OCN 2112: 2112: Program received signal SIGFPE: Floating-point exception - erroneous arithmetic operation. 2112: 2112: Backtrace for this error: 2112: #0 0x22e5e2f in ??? 2112: at /home/abuild/rpmbuild/BUILD/glibc-2.22/nptl/../sysdeps/unix/sysv/linux/x86_64/sigaction.c:0 2112: #1 0xb18423 in __clubb_intr_MOD_clubb_tend_cam 2112: at /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/components/cam/src/physics/cam/clubb_intr.F90:1584 2112: #2 0x61d1be in tphysbc 2112: at /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/components/cam/src/physics/cam/physpkg.F90:2485 2112: #3 0x626b68 in __physpkg_MOD_phys_run1 2112: at /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/components/cam/src/physics/cam/physpkg.F90:1036 2112: #4 0x511cd1 in __cam_comp_MOD_cam_run1 2112: at /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/components/cam/src/control/cam_comp.F90:250 2112: #5 0x50bd42 in __atm_comp_mct_MOD_atm_init_mct 2112: at /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/components/cam/src/cpl/atm_comp_mct.F90:341 2112: #6 0x43d4fb in __component_mod_MOD_component_init_cc 2112: at /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/cime/src/drivers/mct/main/component_mod.F90:258 2112: #7 0x42ebc8 in __cime_comp_mod_MOD_cime_init 2112: at /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/cime/src/drivers/mct/main/cime_comp_mod.F90:1965 2112: #8 0x439337 in cime_driver 2112: at /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/cime/src/drivers/mct/main/cime_driver.F90:92 2112: #9 0x4393de in main 2112: at /global/cscratch1/sd/ndk/wacmy/E3SM_code/20180822/cime/src/drivers/mct/main/cime_driver.F90:23 srun: error: nid04032: task 2112: Floating point exception
I think there is likely a floating-point issue as the compiler reports. The fact that you did not see it with edison may be more a function of the software (compiler version, for example), than the actual hardware.
I added a write statement just before the line that produces a floating-point issue and it looks like the array qrl
is not initialized as the values are either zero or something like 3.953e-323
. It's not clear where/when qrl is supposed to be initialized -- I do see there is a line call pbuf_get_field(pbuf, qrl_idx, qrl)
.
Different compilers treat using uninitialized variables differently, but typically we want to find/fix those.
Here is what I printed:
write(*,'(a,i8,a,i8,a,i8,a,es10.4,a,es12.4)') "ndk i=", i, " k=", k, " pver=", pver, " state1%pdel(i,pver-k+1)=", state1%pdel(i,pver-k+1), " qrl(i,pver-k+1)=", qrl(i,pver-k+1)
qrl_clubb(k+1) = qrl(i,pver-k+1)/(cpair*state1%pdel(i,pver-k+1))
I don't think cpair
is zero as other lines in the code divide by this variable. I was looking for state1%pdel(i,pver-k+1)
to be zero somewhere, but that doesn't seem to be the case. The values of qrl are mostly zero, but several are something like "e-318" as noted above.
I also ran this same case with nothing in user_nl_cam and user_nl_clm and I see the same result.
Do we expect this type of test to work? create_test SMS_Ln5.ne30_ne30.F20TRC5AV1C-04P2
I did just try:
SMS_Ln5.ne30_ne30.F20TRC5AV1C-04P2
SMS_D_Ln5.ne30_ne30.F20TRC5AV1C-04P2
which both passed.
I also tested without COSP -- same failure. Then I tested using the current master as the repo -- same failure. Now I will attempt reproducing this with an F case and some of the key options here.
@ndkeen Thanks for helping me testing the code!
I tried multiple ways, like switching the master (newest vs the older one published in Aug. 2018), modifying the compiler option (back and forth), unlimit the stacksize and coredumpsize, the error changed to
Then I switched off the "-cosp" option in the CAM_CONFIG_OPTS, then the 3-day test run succeeded. Afterwards I switch on the "-cosp" but use cosp_lite=.true. in the namelist, so far so good. I finished one-year run and the excutable seems working now......
Hmmm. That's interesting that when I switched off COSP entirely, I still got the same error.
@ndkeen I tried to use the newest one but got the same error and switched back to the older master (Aug. 2018 version), it surprisingly passed. It IS interesting, isn't it?
When I run a F-case (ATM only) using the compset F20TRC5AV1C-04P2
it will run with current master as I stated above. Then I started adding some of the options that you have in your script. Primarily, these changes:
./xmlchange --id CLM_BLDNML_OPTS --val "-bgc bgc -nutrient cnp -nutrient_comp_pathway eca -soil_decomp century -methane -nitrif_denitrif"
With these, it will fail with a floating invalid error (run in DEBUG). It is a different error than posted above, but I can change the PE layout in the run_e3sm script and also get this same error:
45: forrtl: error (65): floating invalid
45: Image PC Routine Line Source
45: e3sm.exe 000000000C69BF8E Unknown Unknown Unknown
45: e3sm.exe 000000000BF35120 Unknown Unknown Unknown
45: e3sm.exe 000000000743B047 allocationmod_mp_ 2415 AllocationMod.F90
45: e3sm.exe 0000000007FA2F44 soillittdecompmod 410 SoilLittDecompMod.F90
45: e3sm.exe 0000000007A94E6D ecosystemdynmod_m 590 EcosystemDynMod.F90
45: e3sm.exe 00000000058E6B28 clm_driver_mp_clm 1012 clm_driver.F90
45: e3sm.exe 000000000AAEB3E3 Unknown Unknown Unknown
45: e3sm.exe 000000000AAA30D7 Unknown Unknown Unknown
45: e3sm.exe 000000000AA75BD4 Unknown Unknown Unknown
45: e3sm.exe 00000000058C063B clm_driver_mp_clm 560 clm_driver.F90
45: e3sm.exe 000000000588CC33 lnd_comp_mct_mp_l 509 lnd_comp_mct.F90
45: e3sm.exe 0000000000466A13 component_mod_mp_ 737 component_mod.F90
45: e3sm.exe 0000000000430284 cime_comp_mod_mp_ 2602 cime_comp_mod.F90
45: e3sm.exe 000000000044EA0D MAIN__ 133 cime_driver.F90
So then I started taking off some of those options to see if I could narrow anything down. I found that using:
./xmlchange --id CLM_BLDNML_OPTS --val "-bgc bgc -nutrient_comp_pathway eca"
Will cause the above error, while other combinations do not. It looks like I cannot try simply -nutrient_comp_pathway eca
-- it must also need -bgc bgc
.
The source where it is stopping is here:
dsolutionp_dt(c,j) = gross_pmin_vr(c,j) -potential_immob_p_vr(c,j) - &
col_plant_pdemand_vr(c,j) + biochem_pmin_vr_col(c,j) + &
primp_to_labilep_vr_col(c,j) + pdep_to_sminp(c) *ndep_prof(c,j)
Which makes me think it's another example of an array being used before init.
In github issue #3142, I'm now seeing that I get the same error with our "normal" F compset cases -- but only if I force 1 thread (pure MPI).
Hi @lxu16 and @ndkeen -- just wanted to let you know that I got an identical error message today while building/running an F compset in ne30 from the current maint-v1.0 branch on cori. The traceback indicates the failure is occurring at line 1584 in clubb_intr.F90, the same as @lxu16 's original report. I am testing the solution proposed in @ndkeen 's issue #3142 (which looks like it might be relevant to this issue, too). I'll keep you posted on how this goes, but also wanted to ask if you have made any further progress / resolved the issue in the meantime? Thanks!
Did you try to define cosp_lite=.true. in the namelist and use "-cosp" in the runscript? I used this strategy to solve the error.
Sent from Yahoo Mail for iPhone
On Wednesday, August 28, 2019, 12:20, susburrows notifications@github.com wrote:
Hi @lxu16 and @ndkeen -- just wanted to let you know that I got an identical error message today while building/running an F compset in ne30 from the current maint-v1.0 branch on cori. The traceback indicates the failure is occurring at line 1584 in clubb_intr.F90, the same as @lxu16 's original report. I am testing the solution proposed in @ndkeen 's issue #3142 (which looks like it might be relevant to this issue, too). I'll keep you posted on how this goes, but also wanted to ask if you have made any further progress / resolved the issue in the meantime? Thanks!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
@lxu16: I don't think setting cosp_lite to true is a solution to this problem. It may have allowed you to continue to run, but it could be dangerous. Also, I think you are using a PE layout meant for coupled case, while you appear to actually only be doing ATM-only.
@ndkeen I agree. This is the work around to the error. Please keep me posted if you find the better way to resolve this issue. Thanks!
@lxu16: I don't think setting cosp_lite to true is a solution to this problem. It may have allowed you to continue to run, but it could be dangerous. Also, I think you are using a PE layout meant for coupled case, while you appear to actually only be doing ATM-only.
I think the fix in PR3324 should work here, but I'm unable to run the same script as I did before (even after making Cori module changes). If someone could simply add the one line (to init qrl_idx=0) and try again that would be great? Or provide me an updated script that works on Cori and I can try.
I have been debugging this float invalid error for a while on the cori machine without any clues. I included e3sm.log error below. I can compile and run the code on edison without any problem and have this issue after switching to the cori machine. I felt the error is related to the r8 (double precision) flag and NetCDF file input because the code is as same as before. The error just pops out after switching the machine from edison to cori.
May I use some help in the E3SM community? I appreciate your suggestions.