NOAA-EMC / UPP

Other
36 stars 97 forks source link

Runtime error when running a regression test in debug mode #347

Closed MinsukJi-NOAA closed 3 years ago

MinsukJi-NOAA commented 3 years ago

Please see the related issue https://github.com/ufs-community/ufs-weather-model/issues/686

WenMeng-NOAA commented 3 years ago

Dusan pointed out a suspicious line in BNDLYR.f as:

Why is the if test at line 277 in BNDLYR.f checking the values of T,Q,UH,VH when those arrays are not used in this loop? When I comment out this if test (and else block) model finishes successfully.

! IF(T(I,J,LBND)<spval.and.Q(I,J,LBND)<spval.and.& ! UH(I,J,LBND)<spval.and.VH(I,J,LBND)<spval) THEN

WenMeng-NOAA commented 3 years ago

@MinsukJi-NOAA I make some changes in my branch post_bndlyr at https://github.com/WenMeng-NOAA/EMC_post. The rt test option control_debug in ufs-weather can be run through on WCOSS. A new upp lib was built (not debug mode) at /scratch1/NCEPDEV/stmp2/Wen.Meng/upp_lib/EMC_post/tests/install, please test it. Thanks!

MinsukJi-NOAA commented 3 years ago

I verified that the control_debug rt test passes. I will do more tests (e.g. regional as suggested by @junwang-noaa) and post findings.

junwang-noaa commented 3 years ago

Minsuk, do you still keep the run directory for control_debug rt test? I can take a look at the results. Thanks

MinsukJi-NOAA commented 3 years ago

Minsuk, do you still keep the run directory for control_debug rt test? I can take a look at the results. Thanks

source dir: /scratch2/NCEPDEV/stmp1/Minsuk.Ji/upp_debug_test_2 run dir: /scratch1/NCEPDEV/stmp2/Minsuk.Ji/FV3_RT/rt_136089/control_debug

MinsukJi-NOAA commented 3 years ago

gnu compilation fails with the new upp library:

[ 98%] Building Fortran object FV3/CMakeFiles/fv3atm.dir/io/module_write_nemsio.F90.o
cd /scratch2/NCEPDEV/stmp1/Minsuk.Ji/upp_debug_test_2/tests/build_fv3_001/FV3 && /scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/gnu-9.2.0/mpich/3.3.2/bin/mpif90 -DDEBUG -DENABLE_QUAD_PRECISION -DESMF_VERSION_MAJOR=8 -DGFS_PHYS -DINTERNAL_FILE_NML -DOVERLOAD_R4 -DOVERLOAD_R8 -Duse_WRTCOMP -Duse_libMPI -Duse_netCDF -I/scratch2/NCEPDEV/stmp1/Minsuk.Ji/upp_debug_test_2/tests/build_fv3_001/FV3/ccpp/physics -I/scratch2/NCEPDEV/stmp1/Minsuk.Ji/upp_debug_test_2/tests/build_fv3_001/FV3/mod -I/scratch2/NCEPDEV/stmp1/Minsuk.Ji/upp_debug_test_2/tests/build_fv3_001/FV3/ccpp/mod -I/scratch2/NCEPDEV/stmp1/Minsuk.Ji/upp_debug_test_2/FV3/ccpp/framework/src -I/scratch2/NCEPDEV/stmp1/Minsuk.Ji/upp_debug_test_2/tests/build_fv3_001/stochastic_physics/mod -I/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/gnu-9.2.0/mpich-3.3.2/fms/2020.04.03/include_r4 -I/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/gnu-9.2.0/mpich-3.3.2/netcdf/4.7.4/include -I/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/gnu-9.2.0/mpich-3.3.2/esmf/8_1_1-debug/mod -I/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/gnu-9.2.0/mpich-3.3.2/esmf/8_1_1-debug/include -I/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/gnu-9.2.0/mpich-3.3.2/nemsio/2.5.2/include -I/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/gnu-9.2.0/mpich-3.3.2/w3emc/2.7.3/include_d -I/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/gnu-9.2.0/bacio/2.4.1/include_4 -I/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/gnu-9.2.0/w3nco/2.4.1/include_d -I/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/gnu-9.2.0/sp/2.3.3/include_d -I/scratch1/NCEPDEV/stmp2/Wen.Meng/upp_lib/EMC_post/tests/install/include -I/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/gnu-9.2.0/crtm/2.3.0/include -I/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/gnu-9.2.0/g2/3.4.2/include_4 -I/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/gnu-9.2.0/png/1.6.35/include -I/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/gnu-9.2.0/zlib/1.2.11/include -I/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/gnu-9.2.0/g2tmpl/1.10.0/include -I/scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/gnu-9.2.0/ip/3.3.3/include_4 -ggdb -fbacktrace -cpp -fcray-pointer -ffree-line-length-none -fno-range-check -g -O0 -fno-unsafe-math-optimizations -frounding-math -fsignaling-nans -ffpe-trap=invalid,zero,overflow -fbounds-check -Jmod -fopenmp -c /scratch2/NCEPDEV/stmp1/Minsuk.Ji/upp_debug_test_2/FV3/io/module_write_nemsio.F90 -o CMakeFiles/fv3atm.dir/io/module_write_nemsio.F90.o
f951: Fatal Error: Reading module 'vrbls4d' at line 1 column 2: Unexpected EOF
compilation terminated.
make[2]: *** [FV3/CMakeFiles/fv3atm.dir/io/post_nems_routines.F90.o] Error 1

The current develop does not have this problem because the CI test control_debug uses gnu with WRITE_DOPOST turned on.

MinsukJi-NOAA commented 3 years ago

regional_quilt does not run with write_dopost turned on in debug mode: current upp: /scratch1/NCEPDEV/stmp2/Minsuk.Ji/FV3_RT/rt_252799/ Wen's new upp: /scratch1/NCEPDEV/stmp2/Minsuk.Ji/FV3_RT/rt_173460/

regional_quilt does run with write_dopost turned off in debug mode: /scratch1/NCEPDEV/stmp2/Minsuk.Ji/FV3_RT/rt_307868/

source: /scratch2/NCEPDEV/stmp1/Minsuk.Ji/upp_debug_test_2

WenMeng-NOAA commented 3 years ago

It looks another issue in ALLOCATE_ALL.f. I will look into it.

MinsukJi-NOAA commented 3 years ago

It looks another issue in ALLOCATE_ALL.f. I will look into it.

Sorry, I forgot to mention that I am running regional_quilt in debug mode @WenMeng-NOAA

WenMeng-NOAA commented 3 years ago

@MinsukJi-NOAA How to run regional_quilt in debug mode? I don't see the config. option in rt.conf.

MinsukJi-NOAA commented 3 years ago

@WenMeng-NOAA Here are the steps:

  1. Modify rt.conf

    diff --git a/tests/rt.conf b/tests/rt.conf
    index 8f3597e..132efd1 100644
    --- a/tests/rt.conf
    +++ b/tests/rt.conf
    @@ -134,6 +134,7 @@ RUN     | control_thompson_extdiag_debug
    
    COMPILE | -DAPP=ATM -DCCPP_SUITES=FV3_GFS_v15_thompson_mynn,FV3_GSD_v0,FV3_RRFS_v1beta,FV3_RRFS_v1alpha -D32BIT=ON -DDEBUG=ON     |                                         | fv3 |
    RUN     | regional_control_debug                                                                                                  |                                         | fv3 |
    +RUN     | regional_quilt_debug                                                                                                    |                                         | fv3 |
    RUN     | fv3_gsd_debug                                                                                                           |                                         | fv3 |
    RUN     | fv3_gsd_diag3d_debug                                                                                                    |                                         | fv3 |
    RUN     | fv3_rrfs_v1beta_debug                                                                                                   |                                         | fv3 |
  2. copy /scratch2/NCEPDEV/stmp1/Minsuk.Ji/upp_debug_test_2/tests/tests/regional_quilt_debug
  3. ./rt.sh -n regional_quilt_debug -k >out 2>&1 &
WenMeng-NOAA commented 3 years ago

I ran regional_quilt_debug test on WCOSS3 and reproduce @MinsukJi-NOAA 's failure as:

[0]  in fcst run phase 2, na=           0
[60] forrtl: error (65): floating invalid
[60] Image              PC                Routine            Line        Source
[60] fv3.exe            000000000933522E  Unknown               Unknown  Unknown
[60] libpthread-2.17.s  00002B8B07AA8630  Unknown               Unknown  Unknown
[60] fv3.exe            0000000008BC6F73  allocate_all_             809  ALLOCATE_ALL.f
[60] libiomp5.so        00002B8B05EC17A3  __kmp_invoke_micr     Unknown  Unknown
[60] libiomp5.so        00002B8B05E8F9C7  __kmp_fork_call       Unknown  Unknown
[60] libiomp5.so        00002B8B05E58ABC  __kmpc_fork_call      Unknown  Unknown
[60] fv3.exe            0000000008B7FBC5  allocate_all_             798  ALLOCATE_ALL.f
[60] fv3.exe            0000000003445BA8  post_alctvars_            122  post_nems_routines.F90
[60] fv3.exe            00000000030CE84A  post_regional_mp_         122  post_regional.F90
[60] fv3.exe            0000000002FC3F3A  inline_post_mp_in          49  inline_post.F90
[60] fv3.exe            0000000002F61D2E  module_wrt_grid_c        1489  module_wrt_grid_comp.F90
[60] fv3.exe            00000000012B0DE4  _ZN5ESMCI6FTable1        2036  ESMCI_FTable.C
[60] fv3.exe            00000000012ACADB  ESMCI_FTableCallE         765  ESMCI_FTable.C

The line 798 and 809 in ALLOCATE_ALL.f are tracked as: https://github.com/NOAA-EMC/EMC_post/blob/develop/sorc/ncep_post.fd/ALLOCATE_ALL.f#L798

I don't see any obvious error in ALLOCATE_ALL.f. @junwang-noaa and @DusanJovic-NOAA Do you have any suggestions any routines I should look into? The ALLOCATE_ALL should be the first routine from UPP code in inline post. Thanks!

WenMeng-NOAA commented 3 years ago

It was found there might be more computation violations (e.g. overflow, understand) with debug mode. I will work on further changes.

WenMeng-NOAA commented 3 years ago

@MinsukJi-NOAA I built a upp lib at /scratch1/NCEPDEV/stmp2/Wen.Meng/upp_lib/EMC_post/tests/install on Hera. Can you test it with rt test control_debug? You would expect this upp lib working for control_debug only. There might be more efforts on making regional_quilt_debug. We will commit the current changes in develop branch and work on regional inline post in debug mode later.

MinsukJi-NOAA commented 3 years ago

@WenMeng-NOAA, regional_quilt_debug was able to compile and run to completion, but control_debug failed to compile. source directory: /scratch2/NCEPDEV/stmp1/Minsuk.Ji/upp_debug_test_3/tests log files: /scratch2/NCEPDEV/stmp1/Minsuk.Ji/upp_debug_test_3/tests/log_hera.intel_control_debug /scratch2/NCEPDEV/stmp1/Minsuk.Ji/upp_debug_test_3/tests/log_hera.intel_regional_quilt_debug

WenMeng-NOAA commented 3 years ago

@MinsukJi-NOAA It is wired that your test results are opposite my test results. I checked out the latest ufs-weather-model and did the following tests with my upp lib from /scratch1/NCEPDEV/stmp2/Wen.Meng/upp_lib/EMC_post/tests/install : 1) control : PASS log: /scratch2/NCEPDEV/ovp/Wen.Meng/ufs_0805/ufs-weather-model/tests/log_hera.intel_control 2) control_debug: PASS (post grib2 files generated) log: /scratch2/NCEPDEV/ovp/Wen.Meng/ufs_0805/ufs-weather-model/tests/log_hera.intel_control_debug 3) regional_quilt: post grib2 files generated (need a new baseline) log: /scratch2/NCEPDEV/ovp/Wen.Meng/ufs_0805/ufs-weather-model/tests/log_hera.intel_regional_quilt 4) regional_quilt_debug: failed at inline post The ufs-weather-model source is at /scratch2/NCEPDEV/ovp/Wen.Meng/ufs_0805/ufs-weather-model

It seems to me that you change rt.sh in your test as: [Wen.Meng@hfe02 tests]$ pwd /scratch2/NCEPDEV/stmp1/Minsuk.Ji/upp_debug_test_3/tests [Wen.Meng@hfe02 tests]$ git diff rt.sh diff --git a/tests/rt.sh b/tests/rt.sh index 8283071..6db8387 100755 --- a/tests/rt.sh +++ b/tests/rt.sh @@ -53,7 +53,7 @@ rt_single() { break fi fi

MinsukJi-NOAA commented 3 years ago

@WenMeng-NOAA That was from an unrelated test I did today. I will change that back.

WenMeng-NOAA commented 3 years ago

The fix is applied in SLP_new.f. That makes changed results in MSLET and 1000mb HGT in RRFS datasets when RRFS's output domain is larger than computation domain. The difference of MSLET shows:

image

WenMeng-NOAA commented 3 years ago

It was indicated that the issue of sigma level temperature definition made failure of writing grib2 message in RRFS PRSLEV dataset when the regional inline post was conducted in debug mode. RRFS team confirmed that this sigma level temperature was inherited from the NAM products and can be removed in RRFS products. After RRFS control files were updated, the regional_quilt_debug test is able to run through.

WenMeng-NOAA commented 3 years ago

The upp lib was rebuilt with the latest version of Wen's branch post_bndlyr at /scratch1/NCEPDEV/stmp2/Wen.Meng/upp_lib/EMC_post/tests/install on Hera. The following tests were completed:

  1. control : logs: /scratch1/NCEPDEV/stmp2/Wen.Meng/ufs/ufs-weather-model/tests/control_log_hera.intel working directory: /scratch1/NCEPDEV/stmp2/Wen.Meng/FV3_RT/rt_296154/

  2. control_debug: logs: /scratch1/NCEPDEV/stmp2/Wen.Meng/ufs/ufs-weather-model/tests/control_debug_log_hera.intel working directory: /scratch1/NCEPDEV/stmp2/Wen.Meng/FV3_RT/rt_166948

  3. regional_quilt: logs: /scratch1/NCEPDEV/stmp2/Wen.Meng/ufs/ufs-weather-model/tests/regional_quilt_log_hera.intel working directory: /scratch1/NCEPDEV/stmp2/Wen.Meng/FV3_RT/rt_271380

  4. regional_quilt_debug: logs: /scratch1/NCEPDEV/stmp2/Wen.Meng/ufs/ufs-weather-model/tests/regional_quilt_debug_log_hera.intel working directory: /scratch1/NCEPDEV/stmp2/Wen.Meng/FV3_RT/rt_217740

The ufs-weather-model source code of my tests is at /scratch1/NCEPDEV/stmp2/Wen.Meng/ufs/ufs-weather-model

@junwang-noaa and @MinsukJi-NOAA Can you take a look at my tests and let me know your comments. Thanks!

junwang-noaa commented 3 years ago

Wen, thank you very much for looking into this issue. I took a quick look, the post results in global tests (control and control_debug) look good to me, but in the regional_quilt test, I see:

1024:34525988:vt=2018101501:30-0 mb above ground:1 hour fcst:MCONV Horizontal Moisture Convergence [kg/kg/s]: ndata=53261:undef=6177:mean=4.81635e+18:min=0:max=9.99e+20

Are these values reasonable? All other fields look OK to me.

On Thu, Aug 19, 2021 at 4:09 PM WenMeng-NOAA @.***> wrote:

The upp lib was rebuilt with the latest version of Wen's branch post_bndlyr at /scratch1/NCEPDEV/stmp2/Wen.Meng/upp_lib/EMC_post/tests/install on Hera. The following tests were completed:

1.

control : logs: /scratch1/NCEPDEV/stmp2/Wen.Meng/ufs/ufs-weather-model/tests/control_log_hera.intel working directory: /scratch1/NCEPDEV/stmp2/Wen.Meng/FV3_RT/rt_205647/ 2.

control_debug: logs: /scratch1/NCEPDEV/stmp2/Wen.Meng/ufs/ufs-weather-model/tests/control_debug_log_hera.intel working directory: /scratch1/NCEPDEV/stmp2/Wen.Meng/FV3_RT/rt_174003 3.

regional_quilt: logs: /scratch1/NCEPDEV/stmp2/Wen.Meng/ufs/ufs-weather-model/tests/regional_quilt_log_hera.intel working directory: /scratch1/NCEPDEV/stmp2/Wen.Meng/FV3_RT/rt_111396 4.

regional_quilt_debug: logs: /scratch1/NCEPDEV/stmp2/Wen.Meng/ufs/ufs-weather-model/tests/regional_quilt_debug_log_hera.intel working directory: /scratch1/NCEPDEV/stmp2/Wen.Meng/FV3_RT/rt_179159

The ufs-weather-model source code of my tests is at /scratch1/NCEPDEV/stmp2/Wen.Meng/ufs/ufs-weather-model

@junwang-noaa https://github.com/junwang-noaa and @MinsukJi-NOAA https://github.com/MinsukJi-NOAA Can you take a look at my tests and let me know your comments. Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/EMC_post/issues/347#issuecomment-902208251, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI7D6TOV2BGUS3RMKRXBQELT5VQFTANCNFSM5ALPO5OA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

WenMeng-NOAA commented 3 years ago

@junwang-noaa Thanks for catching this. I just fixed it. You would expect MCONV having values like: [Wen.Meng@hfe12 regional_quilt_debug]$ wgrib2 NATLEV.GrbF01 -match MCONV -stats 1424:41039794:ndata=53261:undef=7324:mean=8.9183e-09:min=-2.51e-06:max=3.96e-06:cos_wt_mean=1.10947e-08

I redo the tests and update my test locations above. Please let me know your comments. Thanks!

junwang-noaa commented 3 years ago

Great, thanks for fixing it!

On Fri, Aug 20, 2021 at 1:59 PM WenMeng-NOAA @.***> wrote:

@junwang-noaa https://github.com/junwang-noaa Thanks for catching this. I just fixed it. You would expect MCONV having values like: @.*** regional_quilt_debug]$ wgrib2 NATLEV.GrbF01 -match MCONV -stats

1424:41039794:ndata=53261:undef=7324:mean=8.9183e-09:min=-2.51e-06:max=3.96e-06:cos_wt_mean=1.10947e-08

I redo the tests and update my test locations above. Please let me know your comments. Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/EMC_post/issues/347#issuecomment-902862310, or unsubscribe https://github.com/notifications/unsubscribe-auth/AI7D6TNHX4ZRUFBUCXPFNOTT52JYLANCNFSM5ALPO5OA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .