NOAA-EMC / GSI

Gridpoint Statistical Interpolation
GNU Lesser General Public License v3.0
66 stars 150 forks source link

Optimizations based on GFS v16 DA #106

Closed RussTreadon-NOAA closed 1 year ago

RussTreadon-NOAA commented 3 years ago

Examination of GFS v16 DA job run times identified potential areas of optimization. This issue is opened to document these areas and the changes made to reduce run times.

RussTreadon-NOAA commented 3 years ago

Optimization changes will be committed to branch feature/optimize in RussTreadon-NOAA/GSI

RussTreadon-NOAA commented 3 years ago

scripts/exgdas_atmos_chgres_forenkf.sh

Tests on WCOSS_D find that ObsProc wall times for prepobs_prepdata and syndat_syndata increase when reading GFS v16 atmf006.nc. Part of the wall time increase is due to the increase from 64 to 127 layers. Part of the wall time increase is due to uncompressing the atmf006.nc file during the read. It was suggested to run ObsProc using an uncompressed atmf006.nc.

GFS v16 ObsProc scripts were modified toward this end. In doing so it was noted that "cp" of large atmfXXX.nc files could be replaced with "ln". These two changes were tested for 8 gfs and gdas cycles covering the period 2021012200 through 2021012318. Below are the average prep step run times for GFS v16 (para v16), and the test (test v16):

job para v16 test v16
gfs prep 03:47 02:45
gdas prep 03:49 02:52

About 1 minute is saved for both the gfs and gdas prep steps. The prepbufr, prepbufr.acft_profiles, and nsstbufr files created by test v16 are b4b identical with their control (para v16) counterparts.

For these gains to be realized ObsProc prep scripts need to be updated to use the uncompressed atmf006.nc and "cp" needs to be replaced by "ln". These changes fall outside the scope of the NOAA-EMC/GSI repository.

Creation of the uncompressed atmf006.nc file can be done via a NOAA-EMC/GSI job. Upon examination of the workflow and job dependencies JGDAS_ATMOS_CHGRES_FORENKF was identified as a good location in which to uncompress atmf006.nc. This job runs chgres on atmfXXX.nc for XXX=003, 006, and 009. Script exgdas_atmos_chgres_forenkf.sh runs three realizations of chgres in parallel using CFP. A fourth command was added to the CFP command file. The fourth command uses NetCDF utility nccopy to uncompress atmf006.nc.

The option to uncompress atmf006.nc is controlled by script variable UNCOMPRESS_ATMF. By default UNCOMPRESS_ATMF is "NO", meaning no uncompression. If UNCOMPRESS_ATMF="YES", nccopy is executed. The uncompressed output file is suffixed ".uncompress". The locally modified ObsProc scripts pick up the ".uncompressed" version of atmf006.nc.

The average run time for job JGDAS_ATMOS_CHGRES_FORENKF is 04:21. NetCDF utility nccopy takes 01:48 to uncompress atmf006.nc. Thus, adding nccopy to JGDAS_ATMOS_CHGRES_FORENKF does not increase total job run time. However, adding nccopy to JGDAS_ATMOS_CHGRES_FORENKF does increase the job node count from 3 to 4 nodes. If using an additional node is deemed unacceptable, the details of how nccopy is implemented in exgdas_atmos_chgres_forenkf.sh can be reviewed.

The modified exgdas_atmos_chgres_forenkf.sh was committed to feature/optimize at cd49b72.

RussTreadon-NOAA commented 3 years ago

SATWND optimization

The addition of timers into GFS v16 source code file read_obs.F90 revealed that processing the satwnd dump file can take up to three minutes. This is significantly higher than the processing time for much larger radiance dump files. This finding is not surprising when one recalls that radiance dump files are processed in parallel whereas satwnd processing is serial.

The parallel paradigm used in radiance readers could be added to read_satwnd.f90. Doing so would likely require a major rewrite of read_satwnd.f90. This is not a bad thing but given the transition to JEDI, especially JEDI UFO, refactoring read_satwnd.f90 may not be the best use of DA staff time. Given this, an alternative option has been explored.

The satwnd bufr file is a collection of atmospheric motion vectors (AMVs) from various satellites and tracking algorithms - each of which is identified by a unique bufr subset. Both ObsProc and NCEPLIBS have very efficient utilities to split a bufr file into subset specific files. The ObsProc executable is named "gsb". The NCEPLIBS executable, split_by_subset.x, is described in NOAA-EMC / NCEPLIBS-bufr issue #89.

Script exglobal_atmos_analysis.sh was modified to execute split_by_subset.x on the satwnd dump file in the run time directory. The single satwnd entry in GSI namelist OBS_INPUT was replaced with multiple satwnd_NC005XXX. Thus, instead of one task reading the entire satwnd dump file, N tasks read N satwnd subset files in parallel.

The modified exglobal_atmos_analysis.sh was exercised in 8 gfs and gdas cycles covering the period 2021012200 through 2021012318. Below are the average gfs and gdas atmos_analysis run times for the control (NCO's v16 parallel) and the test:

job para v16 test v16
gfs anal 28:47 27:28
gdas anal 38:02 36:27

Processing satwnd subsets reduced the average gfs analysis run time by 01:19. The decrease was a bit larger for the gdas analysis, 01:35.

Examination of the test and control analysis increment files found them to NOT be b4b identical. Differences were found in the initial satwnd penalties the two sets of runs. These differences were traced to application of the duplicate check in setupw.f90.

The duplicate check adjusts the observation error for duplicate observations. Duplicate observations are those observations with the same {x, y, p}. Note: if logical twodvar_regional is .true., p (pressure) is not part of the duplicate check.

When a single satwnd dump file is processed, satwnd uv is listed once in OBS_INPUT. Thus, setupw is called once for all satwnd observations. All satwnd subsets pass through the duplicate check together. Those observations with the same {x, y, p} are flagged as duplicates. Some of these {x, y, p} duplicates are for different satwnd subsets. The observation errors for all obs flagged as a duplicate are adjusted.

When satwnd subsets are processed, satwnd uv is listed once for each subset. Subroutine satwnd is called once for each subtype. Since satwnd subsets are processed separately, cross subset duplicates are not found. As a result, not all the satwnd observations flagged as duplicates in the control run are flagged as duplicates in the test. Different observation errors in the test yield different penalties, different minimizations, and ultimately different analyses with respect to the control.

As a test the duplicate check in the control setupw was modified to include the AMV observation type. Those AMV observations for which {x, y, p} are the same were NOT flagged as a duplicate unless the AMV observation type was the same. With this modification the control identifies the same satwnd duplicates as the test. This was confirmed by running this test for 2021012512 gdas.

Iliana pointed out that some satwnd subsets are not processed even though they are in the satwnd dump file. She suggested another test. The split_by_subset.x executable was used to split satwnd into subsets. The subsets not processed by the GSI were removed. The remaining subsets were concatenated in the same order as found in the original satwnd dump file. With this change the resulting analysis increments were b4b identical with the control. The GSI wall time was slightly reduced with respect to the control. Much of the wall time gain found in the satwund subset run was erased.

job para v16 test v16 (remove unread)
gfs anal 28:47 28:31
gdas anal 38:02 37:44

The changes to scripts/exglobal_atmos_analysis.sh, src/gsi/read_obs.F90, and src/gsi/setupw.f90 to run these various tests were committed to feature/optimize at cd49b72.

Note that setupw.f90 at cd49b72 has lines used to run various tests commented out. This subroutine will be cleaned up and unnecessary code removed when a final approach is decided upon.

RussTreadon-NOAA commented 3 years ago

c7521f4 and ccfadb8

Use Fetch upstream button on RussTreadon-NOAA/GSI github page to bring in recent commits to the authoritative NOAA-EMC/GSI repo.

RussTreadon-NOAA commented 3 years ago

Merge RussTreadon-NOAA/GSI master at 911a6a3 into feature/optimize. Done at 5f00d2a.

ilianagenkova commented 3 years ago

I ran eight cycles, assimilating once the single satwnd dump files, and once the subsets of the original satwnd files. Here are the wall time results ( in seconds) showing the time saving when assimilating subsets in parallel: date || AN hour || one file || split file || saved time 20210503 18z 1858 1763 95 20210504 00z 1789 1729 60 20210504 06z 1793 1737 56 20210504 12z 1774 1729 45

20210510 18z 1854. 1731. 123
20210511 00z 1814 1722. 92 20210511 06z 1883. 1800. 83 20210511 12z 1876. 1819. 57**

RussTreadon-NOAA commented 3 years ago

Additional ObsProc Tests

NCO contacted EMC regarding variability in global prep step job run times. Setting SYNDATA=NO or DO_BOGUS=NO is not an acceptable solution to reduce job run time. Setting either variable to NO alters prepbufr and prepbufr.acft_profiles which, in turn, alters the analysis and the subsequent forecast.

While enhancement of program SYNDAT_SYNDATA is a worthwhile task along with optimization of this and other ObsProc codes, another option was described at the start of this issue. Much of the increased wall time for ObsProc executables when moving from L64 nemsio to L127 compressed netcdf files is due to file processing. As documented above, reading uncompressed netcdf files decreases ObsProc executable wall time. Replacing cp with ln for atmfXXX.nc files provides additional savings.

The following test was run on the production WCOSS_D:

  1. configure emc.glopara v16ops parallel slot to run the operational gfs.v16.1.2.
  2. run gdas prep for 2021072406, 2021072606, and 2021072706. These three cycles contain 3, 2, and 1 storms, respectively in the syndata.tcvitals file. The 2021072606 case was also run with a zero length syndata.tcvitals to simulate a cycle with no storms. Each run reproduced its operational counterpart, except (obviously) the zero storm 2021072606 test.
  3. the job run time and wall times for programs PREPOBS_PREPDATA and SYNDAT_SYNDATA were recorded for each run. Both of these programs read atmf006.nc.

The gdas.t00z.atmf006.nc file used for each gdas prep cycle was manually uncompressed using NetCDF utility nccopy. The following changes were made to a working copy of obsproc_prep_RB-5.4.0:

The v16ops config.base was updated to point at the modified obsproc_prep and the gdas prep cases rerun. Each run generated prepbufr and prepbufr.acft_profiles which were bit-4-bit identical with their operational counterparts.

Tabulated below are the prep job run time (minutes:seconds) for the control (operations) and test processing uncompressed netcdf files with ln.

cycle / storms control test
2021072406, 3 storms 06:04 04:12
2021072606, 2 storms 05:52 03:59
2021072706, 1 storm 06:04 04:05
2021072606, 0 storm 03:34 02:28
Below is a similar table but with wall times for executables PREPOBS_PREPDATA and SYNDAT_SYNDATA.
First, the wall times (seconds) for PREPOBS_PREPDATA: cycle / storms
control test
2021072406, 3 storms 141.816695 94.583356
2021072606, 2 storms 143.482867 93.776878
2021072706, 1 storm 144.274038 94.779594
2021072606, 0 storm 142.240700 93.808766
Second, the wall times (seconds) for SYNDAT_SYNDATA: cycle / storms control test
2021072406, 3 storms 143.622996 96.613691
2021072606, 2 storms 136.511219 88.841422
2021072706, 1 storm 132.424489 84.072876
2021072606, 0 storm 0.0 0.0

Since the GFS no longer runs vortex relocation the gdas atm[gm3, ges, gp3].nc files are direct copies of the previous cycle gdas atmf[003, 006, 009].nc files, respectively. A check of operational job log files found that neither the gfs nor gdas cycle of the GFS use gdas atm[gm3, ges, gp3].nc files. Thus, the sections of scripts/exglobal_makeprepbufr.sh.ecf in obsproc_global_RB-3.4.0 which copy the previous cycle sg*prep files to $COMOUT could be removed. That said, downstream applications or external users may use the atm[gm3, ges, gp3].nc files. These users should be informed to use the previous cycle atmf[003, 006, 009].nc files since this, in fact, is what they are currently doing.

Removing the sg*prep copies from scripts/exglobal_makeprepbufr.sh.ecf simplifies the script and may yield a small reduction in run time.

ilianagenkova commented 3 years ago

Here is @ShelleyMelchior-NOAA 's contribution to this work - comparing the time needed to prepare one satwnd dump file vs time for components bufr_d files. "...

I took a moment to look at the job run times for dumping satwnd as a dump group vs dumping satwnd individual components.

GFS | split (s) | whole (s) | delta (s) | GDAS | split (s) | whole (s) | delta (s) -- | -- | -- | -- | -- | -- | -- | -- 20210718 00 | 269 | 257 | -12 | 20210718 00 | 278 | 269 | -9 20210718 12 | 194 | 183 | -11 | 20210718 12 | 205 | 196 | -9 20210718 18 | 328 | 322 | -6 | 20210718 18 | 343 | 331 | -12 20210719 18 | 323 | 318 | -5 | 20210719 18 | 335 | 325 | -10 20210722 18 | 199 | 191 | -8 | 20210722 18 | 212 | 205 | -7 20210723 06 | 111 | 102 | -9 | 20210723 06 | 125 | 115 | -10 20210723 12 | 151 | 145 | -6 | 20210723 12 | 166 | 158 | -8   |   |   | -8.142857143 |   |   |   | -9.285714286

I spot checked a handful of cycles that Sudhir ran in his crons.  The conclusion, keeping the satwnd constituents grouped as an entity is, on average, 8 seconds faster for GFS and 9 seconds faster for GDAS, than busting them up into individual dump files.

I don't know how that timing compares to the duration it takes for GSI to bust up the satwnd dump file components.  If 8-9 seconds longer is still shorter than the GSI routine, then this is something we can consider doing in obsproc.