NOAA-EMC / UPP

Other
32 stars 95 forks source link

UPP bugfixes for inline post and g2tmpl 1.12.0 compatibility (and number concentration updates) #974

Closed SamuelTrahanNOAA closed 2 weeks ago

SamuelTrahanNOAA commented 3 weeks ago

Updates UPP to be compatible with the ufs-weather-model inline post. Also fixes a bug that produced thousands of error messages from g2tmpl.

This was originally a PR to add number concentration on pressure levels, but I had to fix those unrelated bugs to get this in the ufs-weather-model.

SamuelTrahanNOAA commented 3 weeks ago

The GCC Linux Build is failing due to an error in a spack stack script:

/home/runner/work/UPP/UPP/spack/lib/spack/env/gcc/gcc: 246: [[: not found

The script is a bash script, but claims to be sh. Hence, sh is rejecting the line with the optional feature [[. This script was probably developed on an RedHat-like system, and never tested with a more limited /bin/sh found on most other UNIX variants.

You must either add this as the first line:

#! /bin/bash

or rewrite the script to use only POSIX sh without its optional features.

WenMeng-NOAA commented 3 weeks ago

@SamuelTrahanNOAA Can you provide model output in netcdf for my UPP standalone test?

SamuelTrahanNOAA commented 3 weeks ago

Can you provide model output in netcdf for my UPP standalone test?

For testing what, specifically?

WenMeng-NOAA commented 3 weeks ago

Can you provide model output in netcdf for my UPP standalone test?

For testing what, specifically?

Run your branch with hafsarcontrol file to generate grib2 file including these number concentrations on pressure levels.

SamuelTrahanNOAA commented 3 weeks ago

EDIT: I have fixed the problem described in this comment.

The problem that was fixed. Presently, the UPP won't run my latest txt file because the xml_perl_data.f can't read the output of PostXMLPreprocessor.pl. ``` 221: forrtl: severe (59): list-directed I/O syntax error, unit 22, file /scratch1/NCEPDEV/stmp2/Samuel.Trahan/FV3_RT/rt_2870606/gnv1_nested_intel/postxconfig-NT_FH00.txt 221: Image PC Routine Line Source 221: fv3.exe 00000000048C0BE8 Unknown Unknown Unknown 221: fv3.exe 00000000048FDA02 Unknown Unknown Unknown 221: fv3.exe 00000000048FC467 Unknown Unknown Unknown 221: fv3.exe 000000000401343F xml_perl_data_mp_ 290 xml_perl_data.f 221: fv3.exe 0000000003F0FAB9 read_xml_ 59 READ_xml.f 221: fv3.exe 000000000226FEB5 post_fv3_mp_post_ 162 post_fv3.F90 221: fv3.exe 000000000222E842 module_wrt_grid_c 2036 module_wrt_grid_comp.F90 ``` It fails here: ```f90 read(22,*)paramset(i)%param(j)%scale_fact_1st_size read(22,*)paramset(i)%param(j)%scale_val_1st_size read(22,*)paramset(i)%param(j)%scale_fact_2nd_size read(22,*)paramset(i)%param(j)%scale_val_2nd_size ! <----- fails here read(22,*)paramset(i)%param(j)%typ_intvl_wvlen call filter_char_inp(paramset(i)%param(j)%typ_intvl_wvlen) ```
SamuelTrahanNOAA commented 3 weeks ago

This may be due to problems in my regression test changes. I'll investigate and report back soon...

EDIT: It was, but I have a new problem!

SamuelTrahanNOAA commented 3 weeks ago

EDIT: I have fixed the problem described in this comment.

The problem that has been fixed. My regression test was using an older postxconfig file. Now the inline post is failing due to missing deallocates: - https://github.com/NOAA-EMC/UPP/issues/975 I've fixed that, and now I seek the next bug...
SamuelTrahanNOAA commented 3 weeks ago

@WenMeng-NOAA - Do you know why the inline post is making this complaint? I'm seeing a lot of them.

234:  get_g2_fixedsurfacetypes key:          255  not found in table 4.5

You can find the output here:

LOG: /scratch1/NCEPDEV/stmp2/Samuel.Trahan/FV3_RT/rt_3037467/gnv1_nested_intel.log RUN: /scratch1/NCEPDEV/stmp2/Samuel.Trahan/FV3_RT/rt_3037467/gnv1_nested_intel

Many ranks give the same error many times, but it's always the same error.

SamuelTrahanNOAA commented 3 weeks ago

Code is here:

https://github.com/ufs-community/ufs-weather-model/pull/2326

The regression test is gnv1_nested, but you can only run on Hera. Other machines don't have g2tmpl 1.12.0 in the same Spack Stack as ufs-weather-model prerequisites.

WenMeng-NOAA commented 3 weeks ago

@SamuelTrahanNOAA Can you sync your branch with the UPP develop?

WenMeng-NOAA commented 3 weeks ago

The GCC Linux Build is failing due to an error in a spack stack script:

/home/runner/work/UPP/UPP/spack/lib/spack/env/gcc/gcc: 246: [[: not found

The script is a bash script, but claims to be sh. Hence, sh is rejecting the line with the optional feature [[. This script was probably developed on an RedHat-like system, and never tested with a more limited /bin/sh found on most other UNIX variants.

You must either add this as the first line:

#! /bin/bash

or rewrite the script to use only POSIX sh without its optional features.

@AlexanderRichert-NOAA Could you add the fix per @SamuelTrahanNOAA 's suggestion? Thanks!

WenMeng-NOAA commented 3 weeks ago

@WenMeng-NOAA - Do you know why the inline post is making this complaint? I'm seeing a lot of them.

234:  get_g2_fixedsurfacetypes key:          255  not found in table 4.5

You can find the output here:

LOG: /scratch1/NCEPDEV/stmp2/Samuel.Trahan/FV3_RT/rt_3037467/gnv1_nested_intel.log RUN: /scratch1/NCEPDEV/stmp2/Samuel.Trahan/FV3_RT/rt_3037467/gnv1_nested_intel

Many ranks give the same error many times, but it's always the same error.

@SamuelTrahanNOAA Is this test for global or regional domain?

WenMeng-NOAA commented 3 weeks ago

RUN: /scratch1/NCEPDEV/stmp2/Samuel.Trahan/FV3_RT/rt_3037467/gnv1_nested_intel

@SamuelTrahanNOAA Can you update itag in the run directory as

&MODEL_INPUTS
 MODELNAME='FV3R'
/
&NAMPGB
KPO=47,PO=1000.,975.,950.,925.,900.,875.,850.,825.,800.,775.,750.,725.,700.,675.,650.,625.,600.,575.,550.,525.,500.,475.,450.,425.,400.,375.,350.,325.,300.,275.,250.,225.,200.,175.,150.,125.,100.,70.,50.,30.,20.,10.,7.,5.,3.,2.,1.,
/
climbfuji commented 3 weeks ago

@SamuelTrahanNOAA The bug fix for the spack wrapper was merged. Submodule pointer update for spack-stack will follow this morning. Thanks for figuring this out. I had tested on Ubuntu, macOS, Oracle Linux and CentOS (both Red Hat derivates). What system is UPP CI running on?

SamuelTrahanNOAA commented 3 weeks ago

Is this test for global or regional domain?

"And," not "or."

There's a globe with a regional nest.

SamuelTrahanNOAA commented 3 weeks ago

Can you update itag in the run directory as

I think you're telling me the pressure levels in itag don't match the xml. That is because I started with the HAFS xml, but forgot to update the pressure levels. I'm going to make the opposite change to what you suggest: copy the GFS pressure levels into the hafs_ar xml.

EDIT: I meant to type "hafs_ar xml." This has been corrected.

SamuelTrahanNOAA commented 3 weeks ago

The itag and postxconfig files match, I've changed 'GFS' to 'FV3R', and removed rdaod=.true., but I still get the error:

237:  get_g2_fixedsurfacetypes key:          255  not found in table 4.5

The code and logs are in the same place as before.

LOG: /scratch1/NCEPDEV/stmp2/Samuel.Trahan/FV3_RT/rt_916423/gnv1_nested_intel.log RUN: /scratch1/NCEPDEV/stmp2/Samuel.Trahan/FV3_RT/rt_916423/gnv1_nested_intel/

CODE: https://github.com/ufs-community/ufs-weather-model/pull/2326

The regression test is gnv1_nested

SamuelTrahanNOAA commented 3 weeks ago

@WenMeng-NOAA - Can you give me updated xml and txt files for the ufs-weather-model regression tests? It would let me determine if the error I'm seeing is specific to the gnv1_nested case.

SamuelTrahanNOAA commented 3 weeks ago

SamuelTrahanNOAA The bug fix for the spack wrapper was merged. Submodule pointer update for spack-stack will follow this morning. Thanks for figuring this out. I had tested on Ubuntu, macOS, Oracle Linux and CentOS (both Red Hat derivates). What system is UPP CI running on?

@DomHeinzeller - I'm running my tests on Hera.

SamuelTrahanNOAA commented 3 weeks ago

@WenMeng-NOAA - I have to agree with the error message. A key of 255 is invalid in template 4.5. Even the NCEP internal templates don't have it.

I see 255 nowhere on this page:

https://www.nco.ncep.noaa.gov/pmb/docs/grib2/grib2_doc/grib2_temp4-5.shtml

A key of 255 in table 4.5 is erroneous, as the error message says.

237: get_g2_fixedsurfacetypes key: 255 not found in table 4.5

WenMeng-NOAA commented 3 weeks ago

@WenMeng-NOAA - I have to agree with the error message. A key of 255 is invalid in template 4.5. Even the NCEP internal templates don't have it.

I see 255 nowhere on this page:

https://www.nco.ncep.noaa.gov/pmb/docs/grib2/grib2_doc/grib2_temp4-5.shtml

A key of 255 in table 4.5 is erroneous, as the error message says.

237: get_g2_fixedsurfacetypes key: 255 not found in table 4.5

It sounds the surface type defined in your UPP control file 'postxconfig-NT.txt' is not found from table 4.5.

WenMeng-NOAA commented 3 weeks ago

@SamuelTrahanNOAA In /scratch1/NCEPDEV/stmp2/Samuel.Trahan/FV3_RT/rt_916423/gnv1_nested_intel, there are two sets of model outputs: 1) atmfnc and sfcfnc 2) atm.nest.fnc and sfc.nest.fnc Which set are you conducting for inline post?

SamuelTrahanNOAA commented 3 weeks ago

It sounds the surface type defined in your UPP control file 'postxconfig-NT.txt' is not found from table 4.5.

All of the surface fields were copied from other models' post xml files. I don't know which one could be causing a problem. Can you examine the xml file and let me know?

It is parm/postcntrl_hafs_ar_nosat.xml in the UPP repository.

https://github.com/SamuelTrahanNOAA/UPP/blob/number-concentration/parm/postcntrl_hafs_ar_nosat.xml

EDIT: I updated the link to point to the branch's xml file instead of a specific hash.

SamuelTrahanNOAA commented 3 weeks ago

Which set are you conducting for inline post?

Sorry, I misunderstood.

Both sets are sent, one at a time. It posts the global and the nest.

SamuelTrahanNOAA commented 3 weeks ago

The Intel Linux Build / setup (pull_request) github check failed. I don't see any errors; I think it hit its 15 minute wallclock limit while building Spack Stack. It stopped part way through jasper.

WenMeng-NOAA commented 3 weeks ago

@WenMeng-NOAA - Can you give me updated xml and txt files for the ufs-weather-model regression tests? It would let me determine if the error I'm seeing is specific to the gnv1_nested case.

@SamuelTrahanNOAA Please update the following files under ufs-weather-model/test/parm/ from my directory /home/Wen.Meng/stmp2/xml/xml_to_txt: postxconfig-NT-fv3lam.txt postxconfig-NT-gfs.txt postxconfig-NT-gfs_FH00.txt postxconfig-NT-hafs.txt

SamuelTrahanNOAA commented 3 weeks ago

@WenMeng-NOAA - Can you give me updated xml and txt files for the ufs-weather-model regression tests? It would let me determine if the error I'm seeing is specific to the gnv1_nested case.

@ SamuelTrahanNOAA Please update the following files under ufs-weather-model/test/parm/ from my directory /home/Wen.Meng/stmp2/xml/xml_to_txt: postxconfig-NT-fv3lam.txt postxconfig-NT-gfs.txt postxconfig-NT-gfs_FH00.txt postxconfig-NT-hafs.txt

I've made those changes and I'm generating baselines for the regression test suite now. My ufs-weather-model PR is updated with your new postxconfig files.

The tests are running here:

HERA: /scratch1/NCEPDEV/stmp2/Samuel.Trahan/FV3_RT/rt_1297148

SamuelTrahanNOAA commented 3 weeks ago

I'm seeing this error in numerous log files. The table 4.5 problem is not limited to the global static nest (gnv1_nested) case.

148: get_g2_fixedsurfacetypes key: 255 not found in table 4.5

This time, I'm running it here:

HERA: /scratch1/NCEPDEV/stmp2/Samuel.Trahan/FV3_RT/rt_1297148

CODE: https://github.com/ufs-community/ufs-weather-model/pull/2326

SamuelTrahanNOAA commented 3 weeks ago

I see the error in these files so far:

/scratch1/NCEPDEV/stmp2/Samuel.Trahan/FV3_RT/rt_1297148/control_flake_intel.log
/scratch1/NCEPDEV/stmp2/Samuel.Trahan/FV3_RT/rt_1297148/control_iovr4_intel.log
/scratch1/NCEPDEV/stmp2/Samuel.Trahan/FV3_RT/rt_1297148/control_iovr5_intel.log
/scratch1/NCEPDEV/stmp2/Samuel.Trahan/FV3_RT/rt_1297148/control_lndp_intel.log
/scratch1/NCEPDEV/stmp2/Samuel.Trahan/FV3_RT/rt_1297148/control_stochy_intel.log
/scratch1/NCEPDEV/stmp2/Samuel.Trahan/FV3_RT/rt_1297148/cpld_bmark_p8_intel.log
/scratch1/NCEPDEV/stmp2/Samuel.Trahan/FV3_RT/rt_1297148/cpld_control_gfsv17_iau_intel.log
/scratch1/NCEPDEV/stmp2/Samuel.Trahan/FV3_RT/rt_1297148/cpld_control_sfs_intel.log

The regression tests are still running, so there may be more errors over the next few hours.

SamuelTrahanNOAA commented 3 weeks ago

The tests have progressed, and I see the Table 4.5 error in regional cases:

/scratch1/NCEPDEV/stmp2/Samuel.Trahan/FV3_RT/rt_1297148/regional_control_intel.log
/scratch1/NCEPDEV/stmp2/Samuel.Trahan/FV3_RT/rt_1297148/regional_spp_sppt_shum_skeb_intel.log
/scratch1/NCEPDEV/stmp2/Samuel.Trahan/FV3_RT/rt_1297148/regional_wofs_intel.log

Thus, it is not limited to global postxconfig files.

I checked the MPI rank list, and it happens only in the write component, so it must be coming from UPP.

SamuelTrahanNOAA commented 3 weeks ago

@WenMeng-NOAA - I don't see how the post or its input files could generate this error directly. They never refer to table 4.5. That error is probably coming from g2 or g2tmpl. Is there a developer of those libraries who can examine this problem?

It is quite widespread; many model configurations get the same error from the inline post. I don't know if it happens with the offline post.

WenMeng-NOAA commented 3 weeks ago

@SamuelTrahanNOAA I checked the off-line post log /home/Wen.Meng/stmp2/fv3gefs_2022042400_pe_test/outpost_nems_2022042612. The warnings of 'get_g2_fixedsurfacetypes key' are also in off-line post. I am not sure if these warning are related to the PR #929. There are some changes in POST-XML-Library-NT.pl.

SamuelTrahanNOAA commented 3 weeks ago

@WenMeng-NOAA - We can test your hypothesis by reverting the changes #929 made to post_avblflds.xml and regenerating the txt files. From my reading of the changes in #929, that should revert the PR's effects. That is, if the Perl and Fortran code changes were correct.

WenMeng-NOAA commented 3 weeks ago

@WenMeng-NOAA - We can test your hypothesis by reverting the changes #929 made to post_avblflds.xml and regenerating the txt files. From my reading of the changes in #929, that should revert the PR's effects. That is, if the Perl and Fortran code changes were correct.

@SamuelTrahanNOAA I conducted two tests for hafs: 1) The UPP version: f7bc0cb committed before #929 old control file run: post_hafs_2022092800-before at /home/Wen.Meng/stmp2 log: /home/Wen.Meng/stmp2/post_hafs_2022092800-before/outpost_2022092809. No warnings.

2) The UPP version: 1916cb2 committed from #929 new control file run: post_hafs_2022092800-after at /home/Wen.Meng/stmp2 log: /home/Wen.Meng/stmp2/post_hafs_2022092800-after/outpost_2022092809 There are warnings of 'get_g2_fixedsurfacetypes key: 255 not found in table 4.5'

It seems to me these warnings were introduced by #929. However, grib2 files generated from two tests are bitwise identical.

SamuelTrahanNOAA commented 3 weeks ago

Nothing in this UPP pull request causes the problem, so I think we should go ahead with merging this.

However, we do need to decide whether to update the fv3atm and ufs-weather-model with this UPP version, or whether we should wait for the warnings to go away.

SamuelTrahanNOAA commented 3 weeks ago

Also, where are the issue trackers for the g2tmpl and g2 libraries?

That message is probably coming from one of those libraries.

WenMeng-NOAA commented 3 weeks ago

Also, where are the issue trackers for the g2tmpl and g2 libraries?

That message is probably coming from one of those libraries.

@SamuelTrahanNOAA The message is from g2tmpl. See here.

SamuelTrahanNOAA commented 3 weeks ago

This is the surface type the post requests for that variable:

<fixed_sfc1_type>surface</fixed_sfc1_type>

If you remove #929's changes to the avblflds, it still requests that surface type, but the warnings are gone.

SamuelTrahanNOAA commented 3 weeks ago

I examined deeper. This message indicates the code is looking for an empty key:

148: get_g2_fixedsurfacetypes key: 255 not found in table 4.5

It comes from here:

    print *, 'get_g2_fixedsurfacetypes key: ', trim(key), value,  &
         ' not found in table 4.5'

Note that the key is printed before the value (255). That means the whitespace before the 255 is the empty key string.

The value of 255 is defined:

data table4_5(66) /fixed_surface_types('missing',255)/

But for it to match, the key must be "missing" not the empty string.

       if (trim(table4_5(n)%fixedsurfacetypeskey).eq.trim(key)) then
          value=table4_5(n)%fixedsurfacetypesval
          return
       endif
WenMeng-NOAA commented 3 weeks ago

I examined deeper. This message indicates the code is looking for an empty key:

148: get_g2_fixedsurfacetypes key: 255 not found in table 4.5

It comes from here:

    print *, 'get_g2_fixedsurfacetypes key: ', trim(key), value,  &
         ' not found in table 4.5'

Note that the key is printed before the value (255). That means the whitespace before the 255 is the empty key string.

The value of 255 is defined:

data table4_5(66) /fixed_surface_types('missing',255)/

But for it to match, the key must be "missing" not the empty string.

       if (trim(table4_5(n)%fixedsurfacetypeskey).eq.trim(key)) then
          value=table4_5(n)%fixedsurfacetypesval
          return
       endif

@SamuelTrahanNOAA It's weird. If the 'fixed_surface_type' which read from postxconfig-NT.txt is 'missing' or space, how the correct surface type is encoded in grib2 data?

SamuelTrahanNOAA commented 3 weeks ago

It's weird. If the 'fixed_surface_type' which read from postxconfig-NT.txt is 'missing' or space, how the correct surface type is encoded in grib2 data?

It's only "missing" or the empty string if the value is unspecified. For example, no fixed_sfc2_type. Then the subroutine doesn't assign to its output argument (value) and the grib2 file has whatever was in that variable before the call. The fixed_sfc2_type comes after fixed_sfc1_type. Thus, a missing fixed_sfc2_type would have the value of fixed_sfc1_type. If fixed_sfc1_type was also invalid, it would be filled with whatever quantity was in the field before it (not a surface type).

WenMeng-NOAA commented 3 weeks ago

It's weird. If the 'fixed_surface_type' which read from postxconfig-NT.txt is 'missing' or space, how the correct surface type is encoded in grib2 data?

It's only "missing" or the empty string if the value is unspecified. For example, no fixed_sfc2_type. Then the subroutine doesn't assign to its output argument (value) and the grib2 file has whatever was in that variable before the call. The fixed_sfc2_type comes after fixed_sfc1_type. Thus, a missing fixed_sfc2_type would have the value of fixed_sfc1_type. If fixed_sfc1_type was also invalid, it would be filled with whatever quantity was in the field before it (not a surface type).

@SamuelTrahanNOAA So these warning messages come from the second call of get_g2_fixedsurfacetypes for fixed_sfc2_type? They are kind of false alerts.

    call get_g2_fixedsurfacetypes(lvl_type1, value, ierr)
    ipdstmpl8(10) = value
    ipdstmpl8(11) = scale_fac1
    ipdstmpl8(12) = scaled_val1
    !
    call get_g2_fixedsurfacetypes(lvl_type2, value, ierr)
    ipdstmpl8(13) = value

Might there be a condition implemented for "call get_g2_fixedsurfacetypes(lvl_type2, value, ierr)" in g2tmpl code?

SamuelTrahanNOAA commented 3 weeks ago

I deleted my last post since I see some of it was wrong. When running the gnv1_nested, I see 12360 warning messages about the invalid types. There's 8 files and 1550 grib records per file, which would be 12400 records.

If there are five records with a fixed_sfc2_type, then Wen's theory would explain the messages. The fix would be to add fixed_sfc2_type values for all records so the post never sends the empty string. I'll try that and see if the error vanishes.

SamuelTrahanNOAA commented 3 weeks ago

I can confirm the cause is unintialized fixed_sfc2_type. My branch has a fix for that: if there is no fixed_sfc2_type, it gets the value of fixed_sfc1_type.

This bug is broader than fixed_sfc2_type, so I made an issue here:

SamuelTrahanNOAA commented 3 weeks ago

I made an issue for the "255 not found in Table 4.5" message here, and I mentioned it is a special case of #977

This PR has a proper fix for the "255 not found in Table 4.5" bug, but it can't fix the broader problem of invalid keys sent from UPP to g2tmpl. That requires refactoring the xml and grib2 code on the UPP side, and adding better error handling to g2tmpl. Such work is far beyond the (ever-expanding) scope of this PR.

SamuelTrahanNOAA commented 3 weeks ago

I'm about to run the full ufs-weather-model regression test suite. This should give a broader confirmation that the "255 missing in Table 4.5" error is gone.

I need @WenMeng-NOAA to run the UPP regression tests on this branch to confirm there are no unintended changes.

WenMeng-NOAA commented 3 weeks ago

I'm about to run the full ufs-weather-model regression test suite. This should give a broader confirmation that the "255 missing in Table 4.5" error is gone.

I need @WenMeng-NOAA to run the UPP regression tests on this branch to confirm there are no unintended changes.

@SamuelTrahanNOAA I will run the UPP RT to test your fix in grib2_module.f.

SamuelTrahanNOAA commented 2 weeks ago

I'm testing @WenMeng-NOAA's alternative fix of using "missing" instead of fixed_sfc1_type. It's more than half way through the baseline generation with no messages yet. By tomorrow evening, I should have the full testing complete.

EDIT: My tests are only on Hera at the moment. The ufs-weather-model code managers will test on all platforms.

WenMeng-NOAA commented 2 weeks ago

@SamuelTrahanNOAA The bug fixes were implemented in the RRFS branch 'release/rrfs_v1' to solve the issue of unduplicated RRFS grib2 metadata. I will submit a PR to sync these changes to the develop branch before your PR is processed.

SamuelTrahanNOAA commented 2 weeks ago

The bug fixes were implemented in the RRFS branch 'release/rrfs_v1' to solve the issue of unduplicated RRFS grib2 metadata.

@WenMeng-NOAA - I looked at #979 and I don't see any of the bug fixes from this PR in that one. There are bug fixes, but they're entirely unrelated and do not overlap.