Closed MinsukJi-NOAA closed 3 years ago
Dusan pointed out a suspicious line in BNDLYR.f as:
Why is the if test at line 277 in BNDLYR.f checking the values of T,Q,UH,VH when those arrays are not used in this loop? When I comment out this if test (and else block) model finishes successfully.
! IF(T(I,J,LBND)<spval.and.Q(I,J,LBND)<spval.and.& ! UH(I,J,LBND)<spval.and.VH(I,J,LBND)<spval) THEN
@MinsukJi-NOAA I make some changes in my branch post_bndlyr at https://github.com/WenMeng-NOAA/EMC_post. The rt test option control_debug in ufs-weather can be run through on WCOSS. A new upp lib was built (not debug mode) at /scratch1/NCEPDEV/stmp2/Wen.Meng/upp_lib/EMC_post/tests/install, please test it. Thanks!
I verified that the control_debug rt test passes. I will do more tests (e.g. regional as suggested by @junwang-noaa) and post findings.
Minsuk, do you still keep the run directory for control_debug rt test? I can take a look at the results. Thanks
Minsuk, do you still keep the run directory for control_debug rt test? I can take a look at the results. Thanks
source dir: /scratch2/NCEPDEV/stmp1/Minsuk.Ji/upp_debug_test_2 run dir: /scratch1/NCEPDEV/stmp2/Minsuk.Ji/FV3_RT/rt_136089/control_debug
gnu compilation fails with the new upp library:
[ 98%] Building Fortran object FV3/CMakeFiles/fv3atm.dir/io/module_write_nemsio.F90.o
cd /scratch2/NCEPDEV/stmp1/Minsuk.Ji/upp_debug_test_2/tests/build_fv3_001/FV3 && /scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/gnu-9.2.0/mpich/3.3.2/bin/mpif90 -DDEBUG -DENABLE_QUAD_PRECISION -DESMF_VERSION_MAJOR=8 -DGFS_PHYS -DINTERNAL_FILE_NML -DOVERLOAD_R4 -DOVERLOAD_R8 -Duse_WRTCOMP -Duse_libMPI -Duse_netCDF -I/scratch2/NCEPDEV/stmp1/Minsuk.Ji/upp_debug_test_2/tests/build_fv3_001/FV3/ccpp/physics -I/scratch2/NCEPDEV/stmp1/Minsuk.Ji/upp_debug_test_2/tests/build_fv3_001/FV3/mod -I/scratch2/NCEPDEV/stmp1/Minsuk.Ji/upp_debug_test_2/tests/build_fv3_001/FV3/ccpp/mod -I/scratch2/NCEPDEV/stmp1/Minsuk.Ji/upp_debug_test_2/FV3/ccpp/framework/src -I/scratch2/NCEPDEV/stmp1/Minsuk.Ji/upp_debug_test_2/tests/build_fv3_001/stochastic_physics/mod -I/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/gnu-9.2.0/mpich-3.3.2/fms/2020.04.03/include_r4 -I/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/gnu-9.2.0/mpich-3.3.2/netcdf/4.7.4/include -I/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/gnu-9.2.0/mpich-3.3.2/esmf/8_1_1-debug/mod -I/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/gnu-9.2.0/mpich-3.3.2/esmf/8_1_1-debug/include -I/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/gnu-9.2.0/mpich-3.3.2/nemsio/2.5.2/include -I/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/gnu-9.2.0/mpich-3.3.2/w3emc/2.7.3/include_d -I/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/gnu-9.2.0/bacio/2.4.1/include_4 -I/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/gnu-9.2.0/w3nco/2.4.1/include_d -I/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/gnu-9.2.0/sp/2.3.3/include_d -I/scratch1/NCEPDEV/stmp2/Wen.Meng/upp_lib/EMC_post/tests/install/include -I/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/gnu-9.2.0/crtm/2.3.0/include -I/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/gnu-9.2.0/g2/3.4.2/include_4 -I/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/gnu-9.2.0/png/1.6.35/include -I/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/gnu-9.2.0/zlib/1.2.11/include -I/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/gnu-9.2.0/g2tmpl/1.10.0/include -I/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/gnu-9.2.0/ip/3.3.3/include_4 -ggdb -fbacktrace -cpp -fcray-pointer -ffree-line-length-none -fno-range-check -g -O0 -fno-unsafe-math-optimizations -frounding-math -fsignaling-nans -ffpe-trap=invalid,zero,overflow -fbounds-check -Jmod -fopenmp -c /scratch2/NCEPDEV/stmp1/Minsuk.Ji/upp_debug_test_2/FV3/io/module_write_nemsio.F90 -o CMakeFiles/fv3atm.dir/io/module_write_nemsio.F90.o
f951: Fatal Error: Reading module 'vrbls4d' at line 1 column 2: Unexpected EOF
compilation terminated.
make[2]: *** [FV3/CMakeFiles/fv3atm.dir/io/post_nems_routines.F90.o] Error 1
The current develop does not have this problem because the CI test control_debug uses gnu with WRITE_DOPOST turned on.
regional_quilt does not run with write_dopost turned on in debug mode: current upp: /scratch1/NCEPDEV/stmp2/Minsuk.Ji/FV3_RT/rt_252799/ Wen's new upp: /scratch1/NCEPDEV/stmp2/Minsuk.Ji/FV3_RT/rt_173460/
regional_quilt does run with write_dopost turned off in debug mode: /scratch1/NCEPDEV/stmp2/Minsuk.Ji/FV3_RT/rt_307868/
source: /scratch2/NCEPDEV/stmp1/Minsuk.Ji/upp_debug_test_2
It looks another issue in ALLOCATE_ALL.f. I will look into it.
It looks another issue in ALLOCATE_ALL.f. I will look into it.
Sorry, I forgot to mention that I am running regional_quilt in debug mode @WenMeng-NOAA
@MinsukJi-NOAA How to run regional_quilt in debug mode? I don't see the config. option in rt.conf.
@WenMeng-NOAA Here are the steps:
Modify rt.conf
diff --git a/tests/rt.conf b/tests/rt.conf
index 8f3597e..132efd1 100644
--- a/tests/rt.conf
+++ b/tests/rt.conf
@@ -134,6 +134,7 @@ RUN | control_thompson_extdiag_debug
COMPILE | -DAPP=ATM -DCCPP_SUITES=FV3_GFS_v15_thompson_mynn,FV3_GSD_v0,FV3_RRFS_v1beta,FV3_RRFS_v1alpha -D32BIT=ON -DDEBUG=ON | | fv3 |
RUN | regional_control_debug | | fv3 |
+RUN | regional_quilt_debug | | fv3 |
RUN | fv3_gsd_debug | | fv3 |
RUN | fv3_gsd_diag3d_debug | | fv3 |
RUN | fv3_rrfs_v1beta_debug | | fv3 |
I ran regional_quilt_debug test on WCOSS3 and reproduce @MinsukJi-NOAA 's failure as:
[0] in fcst run phase 2, na= 0
[60] forrtl: error (65): floating invalid
[60] Image PC Routine Line Source
[60] fv3.exe 000000000933522E Unknown Unknown Unknown
[60] libpthread-2.17.s 00002B8B07AA8630 Unknown Unknown Unknown
[60] fv3.exe 0000000008BC6F73 allocate_all_ 809 ALLOCATE_ALL.f
[60] libiomp5.so 00002B8B05EC17A3 __kmp_invoke_micr Unknown Unknown
[60] libiomp5.so 00002B8B05E8F9C7 __kmp_fork_call Unknown Unknown
[60] libiomp5.so 00002B8B05E58ABC __kmpc_fork_call Unknown Unknown
[60] fv3.exe 0000000008B7FBC5 allocate_all_ 798 ALLOCATE_ALL.f
[60] fv3.exe 0000000003445BA8 post_alctvars_ 122 post_nems_routines.F90
[60] fv3.exe 00000000030CE84A post_regional_mp_ 122 post_regional.F90
[60] fv3.exe 0000000002FC3F3A inline_post_mp_in 49 inline_post.F90
[60] fv3.exe 0000000002F61D2E module_wrt_grid_c 1489 module_wrt_grid_comp.F90
[60] fv3.exe 00000000012B0DE4 _ZN5ESMCI6FTable1 2036 ESMCI_FTable.C
[60] fv3.exe 00000000012ACADB ESMCI_FTableCallE 765 ESMCI_FTable.C
The line 798 and 809 in ALLOCATE_ALL.f are tracked as: https://github.com/NOAA-EMC/EMC_post/blob/develop/sorc/ncep_post.fd/ALLOCATE_ALL.f#L798
I don't see any obvious error in ALLOCATE_ALL.f. @junwang-noaa and @DusanJovic-NOAA Do you have any suggestions any routines I should look into? The ALLOCATE_ALL should be the first routine from UPP code in inline post. Thanks!
It was found there might be more computation violations (e.g. overflow, understand) with debug mode. I will work on further changes.
@MinsukJi-NOAA I built a upp lib at /scratch1/NCEPDEV/stmp2/Wen.Meng/upp_lib/EMC_post/tests/install on Hera. Can you test it with rt test control_debug? You would expect this upp lib working for control_debug only. There might be more efforts on making regional_quilt_debug. We will commit the current changes in develop branch and work on regional inline post in debug mode later.
@WenMeng-NOAA, regional_quilt_debug was able to compile and run to completion, but control_debug failed to compile. source directory: /scratch2/NCEPDEV/stmp1/Minsuk.Ji/upp_debug_test_3/tests log files: /scratch2/NCEPDEV/stmp1/Minsuk.Ji/upp_debug_test_3/tests/log_hera.intel_control_debug /scratch2/NCEPDEV/stmp1/Minsuk.Ji/upp_debug_test_3/tests/log_hera.intel_regional_quilt_debug
@MinsukJi-NOAA It is wired that your test results are opposite my test results. I checked out the latest ufs-weather-model and did the following tests with my upp lib from /scratch1/NCEPDEV/stmp2/Wen.Meng/upp_lib/EMC_post/tests/install : 1) control : PASS log: /scratch2/NCEPDEV/ovp/Wen.Meng/ufs_0805/ufs-weather-model/tests/log_hera.intel_control 2) control_debug: PASS (post grib2 files generated) log: /scratch2/NCEPDEV/ovp/Wen.Meng/ufs_0805/ufs-weather-model/tests/log_hera.intel_control_debug 3) regional_quilt: post grib2 files generated (need a new baseline) log: /scratch2/NCEPDEV/ovp/Wen.Meng/ufs_0805/ufs-weather-model/tests/log_hera.intel_regional_quilt 4) regional_quilt_debug: failed at inline post The ufs-weather-model source is at /scratch2/NCEPDEV/ovp/Wen.Meng/ufs_0805/ufs-weather-model
It seems to me that you change rt.sh in your test as: [Wen.Meng@hfe02 tests]$ pwd /scratch2/NCEPDEV/stmp1/Minsuk.Ji/upp_debug_test_3/tests [Wen.Meng@hfe02 tests]$ git diff rt.sh diff --git a/tests/rt.sh b/tests/rt.sh index 8283071..6db8387 100755 --- a/tests/rt.sh +++ b/tests/rt.sh @@ -53,7 +53,7 @@ rt_single() { break fi fi
@WenMeng-NOAA That was from an unrelated test I did today. I will change that back.
The fix is applied in SLP_new.f. That makes changed results in MSLET and 1000mb HGT in RRFS datasets when RRFS's output domain is larger than computation domain. The difference of MSLET shows:
It was indicated that the issue of sigma level temperature definition made failure of writing grib2 message in RRFS PRSLEV dataset when the regional inline post was conducted in debug mode. RRFS team confirmed that this sigma level temperature was inherited from the NAM products and can be removed in RRFS products. After RRFS control files were updated, the regional_quilt_debug test is able to run through.
The upp lib was rebuilt with the latest version of Wen's branch post_bndlyr at /scratch1/NCEPDEV/stmp2/Wen.Meng/upp_lib/EMC_post/tests/install on Hera. The following tests were completed:
control : logs: /scratch1/NCEPDEV/stmp2/Wen.Meng/ufs/ufs-weather-model/tests/control_log_hera.intel working directory: /scratch1/NCEPDEV/stmp2/Wen.Meng/FV3_RT/rt_296154/
control_debug: logs: /scratch1/NCEPDEV/stmp2/Wen.Meng/ufs/ufs-weather-model/tests/control_debug_log_hera.intel working directory: /scratch1/NCEPDEV/stmp2/Wen.Meng/FV3_RT/rt_166948
regional_quilt: logs: /scratch1/NCEPDEV/stmp2/Wen.Meng/ufs/ufs-weather-model/tests/regional_quilt_log_hera.intel working directory: /scratch1/NCEPDEV/stmp2/Wen.Meng/FV3_RT/rt_271380
regional_quilt_debug: logs: /scratch1/NCEPDEV/stmp2/Wen.Meng/ufs/ufs-weather-model/tests/regional_quilt_debug_log_hera.intel working directory: /scratch1/NCEPDEV/stmp2/Wen.Meng/FV3_RT/rt_217740
The ufs-weather-model source code of my tests is at /scratch1/NCEPDEV/stmp2/Wen.Meng/ufs/ufs-weather-model
@junwang-noaa and @MinsukJi-NOAA Can you take a look at my tests and let me know your comments. Thanks!
Wen, thank you very much for looking into this issue. I took a quick look, the post results in global tests (control and control_debug) look good to me, but in the regional_quilt test, I see:
1024:34525988:vt=2018101501:30-0 mb above ground:1 hour fcst:MCONV Horizontal Moisture Convergence [kg/kg/s]: ndata=53261:undef=6177:mean=4.81635e+18:min=0:max=9.99e+20
Are these values reasonable? All other fields look OK to me.
On Thu, Aug 19, 2021 at 4:09 PM WenMeng-NOAA @.***> wrote:
The upp lib was rebuilt with the latest version of Wen's branch post_bndlyr at /scratch1/NCEPDEV/stmp2/Wen.Meng/upp_lib/EMC_post/tests/install on Hera. The following tests were completed:
1.
control : logs: /scratch1/NCEPDEV/stmp2/Wen.Meng/ufs/ufs-weather-model/tests/control_log_hera.intel working directory: /scratch1/NCEPDEV/stmp2/Wen.Meng/FV3_RT/rt_205647/ 2.
control_debug: logs: /scratch1/NCEPDEV/stmp2/Wen.Meng/ufs/ufs-weather-model/tests/control_debug_log_hera.intel working directory: /scratch1/NCEPDEV/stmp2/Wen.Meng/FV3_RT/rt_174003 3.
regional_quilt: logs: /scratch1/NCEPDEV/stmp2/Wen.Meng/ufs/ufs-weather-model/tests/regional_quilt_log_hera.intel working directory: /scratch1/NCEPDEV/stmp2/Wen.Meng/FV3_RT/rt_111396 4.
regional_quilt_debug: logs: /scratch1/NCEPDEV/stmp2/Wen.Meng/ufs/ufs-weather-model/tests/regional_quilt_debug_log_hera.intel working directory: /scratch1/NCEPDEV/stmp2/Wen.Meng/FV3_RT/rt_179159
The ufs-weather-model source code of my tests is at /scratch1/NCEPDEV/stmp2/Wen.Meng/ufs/ufs-weather-model
@junwang-noaa https://github.com/junwang-noaa and @MinsukJi-NOAA https://github.com/MinsukJi-NOAA Can you take a look at my tests and let me know your comments. Thanks!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/EMC_post/issues/347#issuecomment-902208251, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI7D6TOV2BGUS3RMKRXBQELT5VQFTANCNFSM5ALPO5OA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
@junwang-noaa Thanks for catching this. I just fixed it. You would expect MCONV having values like: [Wen.Meng@hfe12 regional_quilt_debug]$ wgrib2 NATLEV.GrbF01 -match MCONV -stats 1424:41039794:ndata=53261:undef=7324:mean=8.9183e-09:min=-2.51e-06:max=3.96e-06:cos_wt_mean=1.10947e-08
I redo the tests and update my test locations above. Please let me know your comments. Thanks!
Great, thanks for fixing it!
On Fri, Aug 20, 2021 at 1:59 PM WenMeng-NOAA @.***> wrote:
@junwang-noaa https://github.com/junwang-noaa Thanks for catching this. I just fixed it. You would expect MCONV having values like: @.*** regional_quilt_debug]$ wgrib2 NATLEV.GrbF01 -match MCONV -stats
1424:41039794:ndata=53261:undef=7324:mean=8.9183e-09:min=-2.51e-06:max=3.96e-06:cos_wt_mean=1.10947e-08
I redo the tests and update my test locations above. Please let me know your comments. Thanks!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/EMC_post/issues/347#issuecomment-902862310, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI7D6TNHX4ZRUFBUCXPFNOTT52JYLANCNFSM5ALPO5OA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
Please see the related issue https://github.com/ufs-community/ufs-weather-model/issues/686