NOAA-EMC / obsproc

2 stars 7 forks source link

Revisit obsproc dump groups distribution and review run times #44

Open ilianagenkova opened 2 years ago

ilianagenkova commented 2 years ago

Before post-go-live obsproc upgrade (for v16.3), i.e. early July 2022,

once all new dumps (Sentinel-6, snow, gmi, etc) generation is tested on wcoss2, revisit the distribution of the load between the dumping groups. (See also emails from Steven Earle, subjects: cdas obsproc and obsproc nam, Apr 20, 2020 )

There are 2 factors here: 1 - look into redistributing the load so that each dump group more or less takes the same amount of time, as best as can be balanced. 2 - reorder the threads so that the longest one runs first and so that it can end more or less as the others finish up.

That rebalancing needs to be explored for all networks, not just global.

Q. In exglobal_dump.sh, one can see: $ushscript_dump/bufr_dump_obs.sh $dumptime 3.0 1 avcspm esmhs 1bmhs \ airsev atmsdb gome omi trkob gpsro $crisf4 airsev atmsdb gome omi trkob gpsro $crisf4 Why are some groups preceded by $ and some are not? A. The $ preceding some dump names is to test first whether there are any tanks available for this data before attempting to dump it. You can see how the variable is used further up in the script, prior to the call to run bufr_dump_obs.sh.

ShelleyMelchior-NOAA commented 2 years ago

One important thing to do prior to making any script adjustments is to have a baseline. Take some time to benchmark what the settings are right now in WCOSS2 operations with the current arrangement. Most everything you need can be gleaned from the output global dump log files. For example, grep for CFP RANK.

CFP RANK   0    TOTAL RANK RUN TIME: 133.0 sec    Return status: 00000000 hex
CFP RANK   1    TOTAL RANK RUN TIME: 207.5 sec    Return status: 00000000 hex
CFP RANK   2    TOTAL RANK RUN TIME: 207.2 sec    Return status: 00000000 hex
CFP RANK   3    TOTAL RANK RUN TIME:  83.7 sec    Return status: 00000000 hex
CFP RANK   4    TOTAL RANK RUN TIME: 110.8 sec    Return status: 00000000 hex
CFP RANK   5    TOTAL RANK RUN TIME:  77.5 sec    Return status: 00000000 hex
CFP RANK   6    TOTAL RANK RUN TIME:   3.8 sec    Return status: 00000000 hex
CFP RANK   7    TOTAL RANK RUN TIME:  23.2 sec    Return status: 00000000 hex
CFP RANK   8    TOTAL RANK RUN TIME:   0.6 sec    Return status: 00000000 hex
CFP RANK   9    TOTAL RANK RUN TIME:   0.0 sec    Return status: 00000000 hex
CFP RANK  10    TOTAL RANK RUN TIME:   0.0 sec    Return status: 00000000 hex
CFP RANK  11    TOTAL RANK RUN TIME:   0.0 sec    Return status: 00000000 hex
CFP RANK  12    TOTAL RANK RUN TIME:   0.0 sec    Return status: 00000000 hex
CFP RANK  13    TOTAL RANK RUN TIME:   0.0 sec    Return status: 00000000 hex

Next you need to map threads (dump groups) to CFP RANK. To do this, grep a pattern out of the log file. For thread_1: grep "s:thread_1:L" logfilename | grep set You will get a return along these lines:

++ 0s:thread_1:L6 + set +x
++ 208s:thread_1:L149 + set +x

For thread_2: grep "s:thread_2:L" logfilename | grep set You will get a return along these lines:

++ 0s:thread_2:L6 + set +x
++ 78s:thread_2:L109 + set +x

From this you can conclude the following thread_1 ran as CFP RANK 1 (207.5 sec) and thread_2 ran as CFP RANK 5 (77.5 sec)

Rinse and repeat for all 13 (for gdas/gfs; other networks have different numbers of threads).

ilianagenkova commented 1 year ago

Grouping and wall/run time were investigated by @dmerkova and @AshleyStanfield-NOAA . Results were presented to NCO/Steven Earle. Alternative grouping did not help. Obsproc v1.1.0 (Nov 30-Dec 1, 2022) is the agreed grouping for now in operations.

@AshleyStanfield-NOAA split the radiances to two groups (introducing a new one) in the global dump section. That code is added to obsproc v1.2.0 (still under development)