Closed DusanJovic-NOAA closed 1 month ago
One of the ufs-weather-model debug regression tests is failing with floating-point invalid exception. Here is the backtrace:
96: ==== backtrace (tid:1684736) ====
96: 0 0x0000000000054d90 __GI___sigaction() :0
96: 1 0x0000000002f10983 nh_utils_mod_mp_riem_solver_c_.DIR.OMP.PARALLEL.LOOP.23721.split3724.split3747() /work/noaa/fv3-cam/djovic/ufs/support_intelllvm/ufs-weather-model/FV3/atmos_cubed_sphere/model/nh_utils.F90:433
96: 2 0x000000000015ff13 __kmp_invoke_microtask() ???:0
96: 3 0x00000000000cf664 _INTERNALdc23a976::__kmp_serial_fork_call() /nfs/site/proj/openmp/promo/20230118/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxilab153/../../src/kmp_runtime.cpp:1927
96: 4 0x00000000000cf664 __kmp_fork_call() /nfs/site/proj/openmp/promo/20230118/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxilab153/../../src/kmp_runtime.cpp:2188
96: 5 0x0000000000088d23 __kmpc_fork_call() /nfs/site/proj/openmp/promo/20230118/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxilab153/../../src/kmp_csupport.cpp:350
96: 6 0x0000000002e499d8 nh_utils_mod_mp_riem_solver_c_() /work/noaa/fv3-cam/djovic/ufs/support_intelllvm/ufs-weather-model/FV3/atmos_cubed_sphere/model/nh_utils.F90:404
96: 7 0x00000000024645d3 dyn_core_mod_mp_dyn_core_() /work/noaa/fv3-cam/djovic/ufs/support_intelllvm/ufs-weather-model/FV3/atmos_cubed_sphere/model/dyn_core.F90:628
96: 8 0x000000000256c864 fv_dynamics_mod_mp_fv_dynamics_() /work/noaa/fv3-cam/djovic/ufs/support_intelllvm/ufs-weather-model/FV3/atmos_cubed_sphere/model/fv_dynamics.F90:683
96: 9 0x000000000235d824 atmosphere_mod_mp_adiabatic_init_() /work/noaa/fv3-cam/djovic/ufs/support_intelllvm/ufs-weather-model/FV3/atmos_cubed_sphere/driver/fvGFS/atmosphere.F90:1889
96: 10 0x0000000002309421 atmosphere_mod_mp_atmosphere_init_() /work/noaa/fv3-cam/djovic/ufs/support_intelllvm/ufs-weather-model/FV3/atmos_cubed_sphere/driver/fvGFS/atmosphere.F90:574
96: 11 0x0000000001e82c2b atmos_model_mod_mp_atmos_model_init_() /work/noaa/fv3-cam/djovic/ufs/support_intelllvm/ufs-weather-model/FV3/atmos_model.F90:565
96: 12 0x0000000001b726f6 module_fcst_grid_comp_mp_fcst_initialize_() /work/noaa/fv3-cam/djovic/ufs/support_intelllvm/ufs-weather-model/FV3/module_fcst_grid_comp.F90:811
96: 13 0x0000000000ad58c4 ESMCI::FTable::callVFuncPtr() /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:2167
96: 14 0x0000000000ad983f ESMCI_FTableCallEntryPointVMHop() /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:824
96: 15 0x0000000001149e8a ESMCI::VMK::enter() /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Infrastructure/VM/src/ESMCI_VMKernel.C:1247
96: 16 0x00000000012dea88 ESMCI::VM::enter() /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Infrastructure/VM/src/ESMCI_VM.C:1216
96: 17 0x0000000000ad6d1a c_esmc_ftablecallentrypointvm_() /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:981
96: 18 0x00000000009e8e60 esmf_compmod_mp_esmf_compexecute_() /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Superstructure/Component/src/ESMF_Comp.F90:1252
96: 19 0x0000000000dc2db1 esmf_gridcompmod_mp_esmf_gridcompinitialize_() /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Superstructure/Component/src/ESMF_GridComp.F90:1419
96: 20 0x0000000001b470f5 fv3atm_cap_mod_mp_initializeadvertise_() /work/noaa/fv3-cam/djovic/ufs/support_intelllvm/ufs-weather-model/FV3/fv3_cap.F90:444
96: 21 0x000000000110af78 ESMCI::MethodElement::execute() /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:377
96: 22 0x000000000110aee7 ESMCI::MethodTable::execute() /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:563
96: 23 0x000000000110b574 c_esmc_methodtableexecuteef_() /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Superstructure/Component/src/ESMCI_MethodTable.C:347
96: 24 0x0000000000ac12bb esmf_attachmethodsmod_mp_esmf_methodgridcompexecute_() /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Superstructure/AttachMethods/src/ESMF_AttachMethods.F90:1280
96: 25 0x0000000003a137e6 nuopc_modelbase_mp_initializeipdvxp01_() /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/addon/NUOPC/src/NUOPC_ModelBase.F90:687
96: 26 0x0000000000ad58c4 ESMCI::FTable::callVFuncPtr() /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:2167
96: 27 0x0000000000ad983f ESMCI_FTableCallEntryPointVMHop() /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:824
96: 28 0x0000000001149c7a ESMCI::VMK::enter() /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Infrastructure/VM/src/ESMCI_VMKernel.C:2501
96: 29 0x00000000012dea88 ESMCI::VM::enter() /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Infrastructure/VM/src/ESMCI_VM.C:1216
96: 30 0x0000000000ad6d1a c_esmc_ftablecallentrypointvm_() /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:981
96: 31 0x00000000009e8e60 esmf_compmod_mp_esmf_compexecute_() /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Superstructure/Component/src/ESMF_Comp.F90:1252
96: 32 0x0000000000dc2db1 esmf_gridcompmod_mp_esmf_gridcompinitialize_() /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Superstructure/Component/src/ESMF_GridComp.F90:1419
96: 33 0x000000000098b020 nuopc_driver_mp_loopmodelcompss_() /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:2889
96: 34 0x00000000009b3948 nuopc_driver_mp_initializeipdv02p1_() /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:1346
96: 35 0x00000000009bd13b nuopc_driver_mp_initializegeneric_() /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/addon/NUOPC/src/NUOPC_Driver.F90:484
96: 36 0x0000000000ad58c4 ESMCI::FTable::callVFuncPtr() /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:2167
96: 37 0x0000000000ad983f ESMCI_FTableCallEntryPointVMHop() /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:824
96: 38 0x0000000001149c7a ESMCI::VMK::enter() /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Infrastructure/VM/src/ESMCI_VMKernel.C:2501
96: 39 0x00000000012dea88 ESMCI::VM::enter() /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Infrastructure/VM/src/ESMCI_VM.C:1216
96: 40 0x0000000000ad6d1a c_esmc_ftablecallentrypointvm_() /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Superstructure/Component/src/ESMCI_FTable.C:981
96: 41 0x00000000009e8e60 esmf_compmod_mp_esmf_compexecute_() /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Superstructure/Component/src/ESMF_Comp.F90:1252
96: 42 0x0000000000dc2db1 esmf_gridcompmod_mp_esmf_gridcompinitialize_() /work/noaa/epic/role-epic/spack-stack/hercules/spack-stack-1.6.0/cache/build_stage/spack-stage-esmf-8.6.0-rqrapepmgfb7kpri3ynqlxusquf6npfq/spack-src/src/Superstructure/Component/src/ESMF_GridComp.F90:1419
96: 43 0x000000000042e373 MAIN__() /work/noaa/fv3-cam/djovic/ufs/support_intelllvm/ufs-weather-model/driver/UFS.F90:389
96: 44 0x000000000042c83d main() ???:0
96: 45 0x000000000003feb0 __libc_start_call_main() ???:0
96: 46 0x000000000003ff60 __libc_start_main_alias_2() :0
96: 47 0x000000000042c755 _start() ???:0
96: =================================
96: forrtl: error (75): floating point exception
96: Image PC Routine Line Source
96: libc.so.6 00001547B04AED90 Unknown Unknown Unknown
96: ufs_model 0000000002F10983 nh_utils_mod_mp_r 433 nh_utils.F90
96: libiomp5.so 00001547B2BD7F13 __kmp_invoke_micr Unknown Unknown
96: libiomp5.so 00001547B2B47664 __kmp_fork_call Unknown Unknown
96: libiomp5.so 00001547B2B00D23 __kmpc_fork_call Unknown Unknown
96: ufs_model 0000000002E499D8 riem_solver_c 404 nh_utils.F90
96: ufs_model 00000000024645D3 dyn_core 628 dyn_core.F90
96: ufs_model 000000000256C864 fv_dynamics 683 fv_dynamics.F90
96: ufs_model 000000000235D824 adiabatic_init 1889 atmosphere.F90
96: ufs_model 0000000002309421 atmosphere_init 574 atmosphere.F90
96: ufs_model 0000000001E82C2B atmos_model_init 565 atmos_model.F90
96: ufs_model 0000000001B726F6 fcst_initialize 811 module_fcst_grid_comp.F90
96: ufs_model 0000000000AD58C4 Unknown Unknown Unknown
96: ufs_model 0000000000AD983F Unknown Unknown Unknown
96: ufs_model 0000000001149E8A Unknown Unknown Unknown
96: ufs_model 00000000012DEA88 Unknown Unknown Unknown
96: ufs_model 0000000000AD6D1A Unknown Unknown Unknown
96: ufs_model 00000000009E8E60 Unknown Unknown Unknown
96: ufs_model 0000000000DC2DB1 Unknown Unknown Unknown
96: ufs_model 0000000001B470F5 initializeadverti 444 fv3_cap.F90
96: ufs_model 000000000110AF78 Unknown Unknown Unknown
96: ufs_model 000000000110AEE7 Unknown Unknown Unknown
96: ufs_model 000000000110B574 Unknown Unknown Unknown
96: ufs_model 0000000000AC12BB Unknown Unknown Unknown
96: ufs_model 0000000003A137E6 Unknown Unknown Unknown
96: ufs_model 0000000000AD58C4 Unknown Unknown Unknown
96: ufs_model 0000000000AD983F Unknown Unknown Unknown
96: ufs_model 0000000001149C7A Unknown Unknown Unknown
96: ufs_model 00000000012DEA88 Unknown Unknown Unknown
96: ufs_model 0000000000AD6D1A Unknown Unknown Unknown
96: ufs_model 00000000009E8E60 Unknown Unknown Unknown
96: ufs_model 0000000000DC2DB1 Unknown Unknown Unknown
96: ufs_model 000000000098B020 Unknown Unknown Unknown
96: ufs_model 00000000009B3948 Unknown Unknown Unknown
96: ufs_model 00000000009BD13B Unknown Unknown Unknown
96: ufs_model 0000000000AD58C4 Unknown Unknown Unknown
96: ufs_model 0000000000AD983F Unknown Unknown Unknown
96: ufs_model 0000000001149C7A Unknown Unknown Unknown
96: ufs_model 00000000012DEA88 Unknown Unknown Unknown
96: ufs_model 0000000000AD6D1A Unknown Unknown Unknown
96: ufs_model 00000000009E8E60 Unknown Unknown Unknown
96: ufs_model 0000000000DC2DB1 Unknown Unknown Unknown
96: ufs_model 000000000042E373 ufs 389 UFS.F90
96: ufs_model 000000000042C83D Unknown Unknown Unknown
96: libc.so.6 00001547B0499EB0 Unknown Unknown Unknown
96: libc.so.6 00001547B0499F60 __libc_start_main Unknown Unknown
96: ufs_model 000000000042C755 Unknown Unknown Unknown
Line 433:
428 do k=2,km+1
429 do i=is1, ie1
430 pem(i,k) = pem(i,k-1) + dm(i,k-1)
431 #ifdef USE_COND
432 ! Excluding contribution from condensates:
433 peg(i,k) = peg(i,k-1) + dm(i,k-1)*(1.-q_con(i,j,k-1))
434 #endif
435 enddo
436 enddo
I found that q_con
array has some uninitialized (NaN) elements. If I explicitly initialize q_con
array in fv_arrays.F90:
diff --git a/model/fv_arrays.F90 b/model/fv_arrays.F90
index e6e65a6..d18ca37 100644
--- a/model/fv_arrays.F90
+++ b/model/fv_arrays.F90
@@ -1554,6 +1554,7 @@ contains
#else
allocate ( Atm%q_con(isd:isd,jsd:jsd,1) )
#endif
+ Atm%q_con = 0.0
! Notes by SJL
! Place the memory in the optimal shared mem space
regression test finishes successfully. Should I add this q_con initialization to zero?
Hi, @DusanJovic-NOAA Which specific test is failing? This may indicate that q_con is not correctly filled with the proper data in some situations.
You are right that it would be best to explicitly initialize q_con
although it should be done in the loops below to ensure the shared memory for OpenMP is set up right. It is also best to set it to real_big
; if q_con
is set to 0, a physically-valid value, then errors would not be detected.
Hi, @DusanJovic-NOAA Which specific test is failing? This may indicate that q_con is not correctly filled with the proper data in some situations.
You are right that it would be best to explicitly initialize
q_con
although it should be done in the loops below to ensure the shared memory for OpenMP is set up right. It is also best to set it toreal_big
; ifq_con
is set to 0, a physically-valid value, then errors would not be detected.
Test that fails is 'control_debug_p8'.
When I initialize q_con
to real_big
, like:
diff --git a/model/fv_arrays.F90 b/model/fv_arrays.F90
index e6e65a6..b2c1fc9 100644
--- a/model/fv_arrays.F90
+++ b/model/fv_arrays.F90
@@ -1566,6 +1566,9 @@ contains
Atm%va(i,j,k) = real_big
Atm%pt(i,j,k) = real_big
Atm%delp(i,j,k) = real_big
+#ifdef USE_COND
+ Atm%q_con(i,j,k) = real_big
+#endif
enddo
enddo
do j=jsd, jed+1
@@ -1616,6 +1619,9 @@ contains
Atm%ts(i,j) = 300.
Atm%phis(i,j) = real_big
+#ifndef USE_COND
+ Atm%q_con(i,j,1) = real_big
+#endif
enddo
enddo
model still fails with invalid floating point exception. If I add debug prints, like:
diff --git a/model/nh_utils.F90 b/model/nh_utils.F90
index 2b89aea..2a56f7e 100644
--- a/model/nh_utils.F90
+++ b/model/nh_utils.F90
@@ -430,6 +430,9 @@ CONTAINS
pem(i,k) = pem(i,k-1) + dm(i,k-1)
#ifdef USE_COND
! Excluding contribution from condensates:
+ if(i==is1 .and. j==(js-1) .and. k==2) then
+ write(0,*)i,j,k-1, q_con(i,j,k-1)
+ endif
peg(i,k) = peg(i,k-1) + dm(i,k-1)*(1.-q_con(i,j,k-1))
#endif
enddo
@@ -439,6 +442,9 @@ CONTAINS
do i=is1, ie1
dz2(i,k) = gz(i,j,k+1) - gz(i,j,k)
#ifdef USE_COND
+ if(i==is1 .and. j==(js-1) .and. k==1) then
+ write(0,*)i,j,k, peg(i,k),peg(i,k+1)
+ endif
pm2(i,k) = (peg(i,k+1)-peg(i,k))/log(peg(i,k+1)/peg(i,k))
#ifdef MOIST_CAPPA
I see that the q_con(is-1,js-1,:)
is not properly initialized (has real_big
value) and then model actually fails in the next loop where negative value of peg
is used in the log
function.
backtrace:
96: 0 0 1 1.0000000E+08
96: 0 0 1 0.9990000 -6.0600000E+07
96: [hercules-01-40:434242:0:434242] Caught signal 8 (Floating point exception: floating-point invalid operation)
96: ==== backtrace (tid: 434242) ====
96: 0 0x0000000000054d90 __GI___sigaction() :0
96: 1 0x00000000002a08d6 __libm_logf_e7() logf_ct.c:0
96: 2 0x0000000002f131d8 nh_utils_mod_mp_riem_solver_c_.DIR.OMP.PARALLEL.LOOP.24066.split4069.split4093() /work/noaa/fv3-cam/djovic/ufs/support_intelllvm/ufs-weather-model/FV3/atmos_cube
d_sphere/model/nh_utils.F90:448
96: 3 0x000000000015ff13 __kmp_invoke_microtask() ???:0
96: 4 0x00000000000cf664 _INTERNALdc23a976::__kmp_serial_fork_call() /nfs/site/proj/openmp/promo/20230118/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxilab153/../../src/kmp_runtim
e.cpp:1927
96: 5 0x00000000000cf664 __kmp_fork_call() /nfs/site/proj/openmp/promo/20230118/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxilab153/../../src/kmp_runtime.cpp:2188
96: 6 0x0000000000088d23 __kmpc_fork_call() /nfs/site/proj/openmp/promo/20230118/tmp/lin_32e-rtl_int_5_nor_dyn.rel.c0.s0.t1..h1.w1-fxilab153/../../src/kmp_csupport.cpp:350
96: 7 0x0000000002e499e4 nh_utils_mod_mp_riem_solver_c_() /work/noaa/fv3-cam/djovic/ufs/support_intelllvm/ufs-weather-model/FV3/atmos_cubed_sphere/model/nh_utils.F90:404
96: 8 0x00000000024645d3 dyn_core_mod_mp_dyn_core_() /work/noaa/fv3-cam/djovic/ufs/support_intelllvm/ufs-weather-model/FV3/atmos_cubed_sphere/model/dyn_core.F90:628
96: 9 0x000000000256c864 fv_dynamics_mod_mp_fv_dynamics_() /work/noaa/fv3-cam/djovic/ufs/support_intelllvm/ufs-weather-model/FV3/atmos_cubed_sphere/model/fv_dynamics.F90:683
. . .
Isn't better to initialize all arrays to NaN? That way model will fail as soon as NaN value is used, instead of initializing them to real_big
, and then have model fail later in the code where real_big
is used.
@DusanJovic-NOAA I think you are right; it is better to initialize to NaN so they can be caught immediately. Thank you for your suggestion.
Ok. Initializing q_con
to NaN
:
diff --git a/model/fv_arrays.F90 b/model/fv_arrays.F90
index e6e65a6..c239083 100644
--- a/model/fv_arrays.F90
+++ b/model/fv_arrays.F90
@@ -33,6 +33,12 @@ module fv_arrays_mod
public
+#ifdef OVERLOAD_R4
+ real, parameter:: real_snan=real(Z'FFBFFFFF')
+#else
+ real, parameter:: real_snan=real(Z'FFF7FFFFFFFFFFFF')
+#endif
+
integer, public, parameter :: R_GRID = r8_kind
!Several 'auxiliary' structures are introduced here. These are for
@@ -1566,6 +1572,10 @@ contains
Atm%va(i,j,k) = real_big
Atm%pt(i,j,k) = real_big
Atm%delp(i,j,k) = real_big
+#ifdef USE_COND
+ ! Atm%q_con(i,j,k) = real_big
+ Atm%q_con(i,j,k) = real_snan
+#endif
enddo
enddo
do j=jsd, jed+1
@@ -1616,6 +1626,10 @@ contains
Atm%ts(i,j) = 300.
Atm%phis(i,j) = real_big
+#ifndef USE_COND
+ ! Atm%q_con(i,j,1) = real_big
+ Atm%q_con(i,j,1) = real_snan
+#endif
enddo
enddo
and then printing all values of q_con(is1:ie1,Js1:je1,1)
in nh_utiils before it's used:
diff --git a/model/nh_utils.F90 b/model/nh_utils.F90
index 2b89aea..2502e94 100644
--- a/model/nh_utils.F90
+++ b/model/nh_utils.F90
@@ -400,6 +400,12 @@ CONTAINS
enddo
endif
+ do j=js-1, je+1
+ do i=is-1, ie+1
+ write(0,*)i,j,q_con(i,j,1)
+ end do
+ end do
+ call flush(0)
!$OMP parallel do default(none) shared(js,je,is1,ie1,km,delp,pef,ptop,gz,rgrav,w3,pt, &
#ifdef MULTI_GASES
@@ -430,6 +436,9 @@ CONTAINS
pem(i,k) = pem(i,k-1) + dm(i,k-1)
#ifdef USE_COND
! Excluding contribution from condensates:
+ ! if(i==is1 .and. j==(js-1) .and. k==2) then
+ ! write(0,*)i,j,k-1, q_con(i,j,k-1)
+ ! endif
peg(i,k) = peg(i,k-1) + dm(i,k-1)*(1.-q_con(i,j,k-1))
#endif
enddo
@@ -439,6 +448,9 @@ CONTAINS
do i=is1, ie1
dz2(i,k) = gz(i,j,k+1) - gz(i,j,k)
#ifdef USE_COND
+ ! if(i==is1 .and. j==(js-1) .and. k==1) then
+ ! write(0,*)i,j,k, peg(i,k),peg(i,k+1)
+ ! endif
pm2(i,k) = (peg(i,k+1)-peg(i,k))/log(peg(i,k+1)/peg(i,k))
#ifdef MOIST_CAPPA
I see:
$ grep 'NaN' err | sort -n
0: 0 0 NaN
2: 97 0 NaN
21: 0 97 NaN
23: 97 97 NaN
24: 0 0 NaN
26: 97 0 NaN
45: 0 97 NaN
47: 97 97 NaN
48: 0 0 NaN
50: 97 0 NaN
69: 0 97 NaN
71: 97 97 NaN
72: 0 0 NaN
74: 97 0 NaN
93: 0 97 NaN
95: 97 97 NaN
96: 0 0 NaN
98: 97 0 NaN
117: 0 97 NaN
119: 97 97 NaN
120: 0 0 NaN
122: 97 0 NaN
141: 0 97 NaN
143: 97 97 NaN
There are exactly 24 NaN values, 4 per tile, and 4 values are at the corners of each tile. For example:
0: 0 0 NaN
2: 97 0 NaN
21: 0 97 NaN
23: 97 97 NaN
are lower-left, lower-right, upper-left and upper-right tile corners. The layout is (3,8) 24 PES per tile, and ranks 0, 2, 21 and 23 are corner PES of tile 1. Should't these corner (halo) points be filled from neighboring tiles?
Now I think I understand what is happening. On the cubed sphere the values in the outermost corners are meaningless, since that is where three faces come to a point. So the column-wise calculations on these corner cells have no effect, but since they are filled with NaN an FPE is raised. To satisfy the compiler, we could initialize the corner values to 0 while the others are set to NaN; or alternately your original idea to just initialize q_con
everywhere to 0. would also work. Good catch!
Thanks, Lucas
Thank you Lucas. Please see 1a8820f
I am testing this code on my laptop with Intel 2024.2.0 and I'm getting these compile errors: in fv_mapz.F90:
[ 81%] Building Fortran object FV3/atmos_cubed_sphere/CMakeFiles/fv3.dir/model/fv_mapz.F90.o
/home/dusan/sufs/simple-ufs/src/model/FV3/atmos_cubed_sphere/model/fv_mapz.F90(683): error #9184: A list item in an OpenMP* data sharing attribute clause must not be an associate name. [KMP]
!$OMP parallel default(none) shared(is,ie,js,je,km,kmp,ptop,u,v,pe,ua,va,isd,ied,jsd,jed,kord_mt, &
---------------------------------------------------^
/home/dusan/sufs/simple-ufs/src/model/FV3/atmos_cubed_sphere/model/fv_mapz.F90(690): error #9184: A list item in an OpenMP* data sharing attribute clause must not be an associate name. [FAST_MP_CONSV]
!$OMP fast_mp_consv,kord_tm, pe4,npx,npy, ccn_cm3, &
------------------------------------^
compilation aborted for /home/dusan/sufs/simple-ufs/src/model/FV3/atmos_cubed_sphere/model/fv_mapz.F90 (code 1)
make[2]: *** [FV3/atmos_cubed_sphere/CMakeFiles/fv3.dir/build.make:205: FV3/atmos_cubed_sphere/CMakeFiles/fv3.dir/model/fv_mapz.F90.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:492: FV3/atmos_cubed_sphere/CMakeFiles/fv3.dir/all] Error 2
make: *** [Makefile:136: all] Error 2
and in fv_dynamics.F90:
/home/dusan/sufs/simple-ufs/src/model/FV3/atmos_cubed_sphere/model/fv_dynamics.F90(434): error #9184: A list item in an OpenMP* data sharing attribute clause must not be an associate name.
[DP1]
!$OMP parallel do default(none) shared(is,ie,js,je,isd,ied,jsd,jed,npz,dp1,zvir,nwat,q,q_con,sphum,liq_wat, &
-----------------------------------------------------------------------^
/home/dusan/sufs/simple-ufs/src/model/FV3/atmos_cubed_sphere/model/fv_dynamics.F90(452): error #9184: A list item in an OpenMP* data sharing attribute clause must not be an associate name.
[DP1]
!$OMP parallel do default(none) shared(is,ie,js,je,isd,ied,jsd,jed,npz,dp1,zvir,q,q_con,sphum,liq_wat, &
-----------------------------------------------------------------------^
/home/dusan/sufs/simple-ufs/src/model/FV3/atmos_cubed_sphere/model/fv_dynamics.F90(461): error #9184: A list item in an OpenMP* data sharing attribute clause must not be an associate name.
[CAPPA]
!$OMP cappa,kappa,rdg,delp,pt,delz,nwat) &
---------------------------------------^
/home/dusan/sufs/simple-ufs/src/model/FV3/atmos_cubed_sphere/model/fv_dynamics.F90(593): error #9184: A list item in an OpenMP* data sharing attribute clause must not be an associate name.
[DP1]
!$OMP parallel do default(none) shared(is,ie,js,je,npz,pt,dp1,pkz,q_con)
----------------------------------------------------------^
/home/dusan/sufs/simple-ufs/src/model/FV3/atmos_cubed_sphere/model/fv_dynamics.F90(621): error #9184: A list item in an OpenMP* data sharing attribute clause must not be an associate name.
[DTDT_M]
!$OMP parallel do default(none) shared(is,ie,js,je,npz,dtdt_m)
-------------------------------------------------------^
/home/dusan/sufs/simple-ufs/src/model/FV3/atmos_cubed_sphere/model/fv_dynamics.F90(655): error #9184: A list item in an OpenMP* data sharing attribute clause must not be an associate name.
[DP1]
!$OMP parallel do default(none) shared(isd,ied,jsd,jed,npz,dp1,delp)
-----------------------------------------------------------^
/home/dusan/sufs/simple-ufs/src/model/FV3/atmos_cubed_sphere/model/fv_dynamics.F90(846): error #9184: A list item in an OpenMP* data sharing attribute clause must not be an associate name.
[DTDT_M]
!$OMP parallel do default(none) shared(is,ie,js,je,npz,dtdt_m,bdt)
-------------------------------------------------------^
/home/dusan/sufs/simple-ufs/src/model/FV3/atmos_cubed_sphere/model/fv_dynamics.F90(980): error #9184: A list item in an OpenMP* data sharing attribute clause must not be an associate name.
[TE_2D]
!$OMP parallel do default(none) shared(is,ie,js,je,te_2d,teq,dt2,ps2,ps,idiag)
---------------------------------------------------^
compilation aborted for /home/dusan/sufs/simple-ufs/src/model/FV3/atmos_cubed_sphere/model/fv_dynamics.F90 (code 1)
make[2]: *** [FV3/atmos_cubed_sphere/CMakeFiles/fv3.dir/build.make:166: FV3/atmos_cubed_sphere/CMakeFiles/fv3.dir/model/fv_dynamics.F90.o] Error 1
make[2]: *** Waiting for unfinished jobs....
make[1]: *** [CMakeFiles/Makefile2:492: FV3/atmos_cubed_sphere/CMakeFiles/fv3.dir/all] Error 2
make: *** [Makefile:136: all] Error 2
Previously I was testing this one on few of the RDHPCS systems like Hercules, Hera, Gaea with slightly older compilers and I did not see these errors. I do not know if this is the regression in the latest compiler, or an actual bug fixed.
FYI @climbfuji
@DusanJovic-NOAA Can you check if the error is within the "if intel" CPP branch?
For background: Older Intel compilers up to 2021.1.0 had a bug that forced us to add all these #ifdef statements for Intel. GNU was following the Fortran standard for specifying or not specifying pointers in OMP directives, Intel didn't (the Intel compiler developers confirmed that). They fixed it in 2021.2.0, but kept their compilers backward compatible. We left the workaround for Intel in place, because we had (and still have ... wcoss2 ...) systems with compilers older than 2021.2.0.
If you are seeing the errors in the if intel
branch of the CPP statements, then that means that Intel removed the backward compatibility and now strictly follows the standard (and what GNU does).
Is that the case?
If I simply remove all variables that are 'an associate name' from all OMP directives code compiles without errors (I did not run the code yet!). Effectively all OMP directives for the gnu and non-gnu compilers are identical.
@DusanJovic-NOAA @climbfuji - I can't believe Intel would change anything in ifort
as it's basically in maintenance mode at this point.
@DusanJovic-NOAA @climbfuji - I can't believe Intel would change anything in
ifort
as it's basically in maintenance mode at this point.
@bensonr The errors I mentioned above are with ifx, not ifort.
@DusanJovic-NOAA @climbfuji - I can't believe Intel would change anything in
ifort
as it's basically in maintenance mode at this point.@bensonr The errors I mentioned above are with ifx, not ifort.
That would make sense to me! Time to add more logic to those ifdefs until WCOSS2 gets an updated software stack ;-)
@DusanJovic-NOAA @climbfuji - I can't believe Intel would change anything in
ifort
as it's basically in maintenance mode at this point.@bensonr The errors I mentioned above are with ifx, not ifort.
I expected those errors with ifx
and wanted to be sure that's what you were testing.
So, if I understand correctly we need two different OMP directives, one without variables that are an associate names listed as shared for GNU and IntelLLVM >= 2024.2.0, and the other one with those variables listed as shared for all Intel Classic and IntelLLVM < 2024.2.0.
Is my understanding correct?
So, if I understand correctly we need two different OMP directives, one without variables that are an associate names listed as shared for GNU and IntelLLVM >= 2024.2.0, and the other one with those variables listed as shared for all Intel Classic and IntelLLVM < 2024.2.0.
Is my understanding correct?
I think what you want is ifx
or ifort
. I think 2024.2.0 still has ifort
, and that compiler will very likely behave like the older compiler versions. Either there is a CPP directive that Intel sets for us that lets us distinguish between the two, or we need to do that in cmake
.
How about this:
if ( hydrostatic ) then
-#ifdef __GFORTRAN__
+#if defined(__GFORTRAN__) || __INTEL_LLVM_COMPILER >= 20240200
!$OMP parallel do default(none) shared(is,ie,js,je,isd,ied,jsd,jed,npz,zvir,nwat,q,q_con,sphum,liq_wat, &
#else
!$OMP parallel do default(none) shared(is,ie,js,je,isd,ied,jsd,jed,npz,dp1,zvir,nwat,q,q_con,sphum,liq_wat, &
Sorry for playing devil's advocate here. So you are saying the earlier versions of ifx
(not ifort
) behaved like ifort
, or at least accepted the logic we put in for ifort
?
Sorry for playing devil's advocate here. So you are saying the earlier versions of
ifx
(notifort
) behaved likeifort
, or at least accepted the logic we put in forifort
?
Yes. I started seeing these compile errors only when I tried using ifx 2024.2.0, which I have on my laptop. On Ursa test system (replacement for Hera) they have 2024.1.0 and I also did not see these errors on all other RDHPCS machines I tried hera(2021.5.1), orion(2021.9.0), hercules(2021.9.0), gaea (2023.1.0).
Based on this I guess the behavior changed in ifx between 2024.1.0 and 2024.2.0. It would be nice to confirm this with Intel.
Do we need these associate names? Can't we declare local pointers, point them to GFDL_interstitial arrays:
real, dimension(:,:,:), pointer :: cappa
real, dimension(:,:,:), pointer :: dp1
real, dimension(:,:,:), pointer :: dtdt_m
real, dimension(:,:), pointer :: te_2d
! ccpp_associate: associate( cappa => GFDL_interstitial%cappa, &
! dp1 => GFDL_interstitial%te0, &
! dtdt_m => GFDL_interstitial%dtdt, &
! last_step => GFDL_interstitial%last_step, &
! te_2d => GFDL_interstitial%te0_2d )
cappa => GFDL_interstitial%cappa
dp1 => GFDL_interstitial%te0
dtdt_m => GFDL_interstitial%dtdt
te_2d => GFDL_interstitial%te0_2d
and just use one set of OMP directives. Do 'associate names' behave differently than pointers?
Do we need these associate names? Can't we declare local pointers, point them to GFDL_interstitial arrays:
real, dimension(:,:,:), pointer :: cappa real, dimension(:,:,:), pointer :: dp1 real, dimension(:,:,:), pointer :: dtdt_m real, dimension(:,:), pointer :: te_2d ! ccpp_associate: associate( cappa => GFDL_interstitial%cappa, & ! dp1 => GFDL_interstitial%te0, & ! dtdt_m => GFDL_interstitial%dtdt, & ! last_step => GFDL_interstitial%last_step, & ! te_2d => GFDL_interstitial%te0_2d ) cappa => GFDL_interstitial%cappa dp1 => GFDL_interstitial%te0 dtdt_m => GFDL_interstitial%dtdt te_2d => GFDL_interstitial%te0_2d
and just use one set of OMP directives. Do 'associate names' behave differently than pointers?
I think they are both doing the same. But whether it makes a difference for how ifort
and ifx
treat them in OMP directives I don't know.
In this commit https://github.com/NOAA-GFDL/GFDL_atmos_cubed_sphere/pull/353/commits/b947391ed4d23ffa15f5c854e7d5c4849abefc40 I made changes to fv_mapz and fv_dynamics in an attempt to simplify the OMP directives and make them work with GNU, Intel classic and Intel LLVM compilers
In f_mapz: 1) remove associate statement, remove kmp variable (not used) and replace fast_mp_consv with GFDL_interstitial%fast_mp_consv 2) remove GFORTRAN ifdef block, OMP directives previously used for Intel only now work with GNU
in fv_dynamic: 1) replace associate statement and introduce local pointer arrays that point to corresponding arrays in GFDL_interstitial DDT. 2) remove GFORTRAN ifdef block, OMP directives previously used for Intel only now work with GNU 3) replace last_step with GFDL_interstitial%last_step
I ran few tests that use threading on wcoss2, and they passed, and full test passed on Hera (gnu and intel)
@DusanJovic-NOAA - can you check the performance. Some of the things I found online seemed to indicate pointers might be slower than an associate. I don't understand why that would be, but...
These are some of the regression test on Hera. Lines stating with '-' is current develop branch, lines starting with '+' are this branch. Second number is square brackets it wall clock time:
$ git diff logs/RegressionTests_hera.log | grep cpld_2threads_p8_intel
-PASS -- TEST 'cpld_2threads_p8_intel' [06:18, 05:34](3629 MB)
+PASS -- TEST 'cpld_2threads_p8_intel' [08:11, 05:34](3636 MB)
$ git diff logs/RegressionTests_hera.log | grep cpld_control_p8_intel
-PASS -- TEST 'cpld_control_p8_intel' [06:58, 06:07](3338 MB)
+PASS -- TEST 'cpld_control_p8_intel' [09:12, 06:09](3359 MB)
$ git diff logs/RegressionTests_hera.log | grep hrrr_control_2threads_gnu
-PASS -- TEST 'hrrr_control_2threads_gnu' [05:24, 05:01](916 MB)
+PASS -- TEST 'hrrr_control_2threads_gnu' [06:39, 05:04](908 MB)
$ git diff logs/RegressionTests_hera.log | grep rap_2threads_dyn32_phy32_gnu
-PASS -- TEST 'rap_2threads_dyn32_phy32_gnu' [09:05, 08:36](756 MB)
+PASS -- TEST 'rap_2threads_dyn32_phy32_gnu' [10:34, 08:35](750 MB)
Looks good and thanks for confirming.
All tests are done at https://github.com/ufs-community/ufs-weather-model/pull/2224. @laurenchilutti @bensonr @lharris4 @XiaqiongZhou-NOAA this pr can be merged.
Description
This PR adds support for Intel LLVM compiler.
Changes:
cmake/compiler_flags_IntelLLVM_Fortran.cmake
andcmake/fv3_compiler_flags.cmake
to define Intel LLVM compiler flagsdriver/fvGFS/atmosphere.F90
to read namelist from a file instead from a character variable to avoid internal compiler error.Fixes # (issue)
How Has This Been Tested?
ufs-weather-model regression test has been run on Hera. See https://github.com/ufs-community/ufs-weather-model/pull/2224
Checklist:
Please check all whether they apply or not