Refine TCDIAG output from TC-Pairs as needed

JohnHalleyGotway commented 2 years ago

Describe the Enhancement

Pull request #2315 for issue #392 added a new TCDIAG line type to the output of TC-Pairs. That issue contained the majority of the work but some cleanup tasks remain. Please see below:

[x] @jvigh write the Storm Diagnostics section of the tc-pairs.rst chapter.
[ ] @jvigh (Unrelated to the MET repository) Consider adding METplus use case(s) to demonstrate the use of these diagnostics in TC-Pairs. When doing so, consider documenting details about these diagnostic data sources in the METplus Verification Datasets Guide.
[x] @JohnHalleyGotway update docs to remove references to the -tcdiag and -lsdiag command line options that were consolidated into a single -diag command line option.
[ ] @JohnHalleyGotway investigate why tc-stat is relatively slow reading the tc-pairs output and try to speed it up.
[ ] @KathrynNewman this is somewhat related. MET distributes an old version of the watch/warning file. Should we update that with each release?
[x] @JohnHalleyGotway modify usage statement to indicate that LSDIAG_DEV is not currently supported but will be added in the future.
[x] @jvigh will coordinate with CSU folks to figure out what we should name their diagnostics data source. Currently there's a namespace conflict between calling the CSU data "TCDIAG" as well as the output line type "TCDIAG" and the proposed new MET tool "TC-Diag".

Do NOT plan to do this for version 11.0.0:

[ ] @jvigh and @KathrynNewman continue learning about the realtime and development LSDIAG data sources. Consider how realtime LSDIAG track information should be handled.
[ ] @JohnHalleyGotway add support for the developmental LSDIAG data source which contains 20+ years of data in a single file.
- [ ] Consider adding use_diag_track = TRUE/FALSE; config file option. If true, use the diag lat/lon location to define the track rather than matching to an existing ADECK track. This idea requires more thought. Would this be a benefit or just cause confusion We could add logic to bool TrackInfoArray::add_diag_data(DiagFile &diag_file, const StringArray &diag_name) for this.

Time Estimate

Estimate the amount of work required here. Issues should represent approximately 1 to 3 days of work.

Sub-Issues

Consider breaking the enhancement down into sub-issues.

[ ] Add a checkbox for each sub-issue here.

Relevant Deadlines

List relevant project deadlines here or state NONE.

Funding Source

Define the source of funding and account keys here or state NONE.

Define the Metadata

Assignee

[x] Select engineer(s) or no engineer required
[x] Select scientist(s) or no scientist required

Labels

[x] Select component(s)
[x] Select priority
[x] Select requestor(s)

Projects and Milestone

[x] Select Repository and/or Organization level Project(s) or add alert: NEED PROJECT ASSIGNMENT label
[x] Select Milestone as the next official version or Future Versions

Define Related Issue(s)

Consider the impact to the other METplus components.

[x] METplus, MET, METdataio, METviewer, METexpress, METcalcpy, METplotpy No impacts for this set of changes.

Enhancement Checklist

See the METplus Workflow for details.

[ ] Complete the issue definition above, including the Time Estimate and Funding Source.
[ ] Fork this repository or create a branch of develop. Branch name: feature_<Issue Number>_<Description>
[ ] Complete the development and test your changes.
[ ] Add/update log messages for easier debugging.
[ ] Add/update unit tests.
[ ] Add/update documentation.
[ ] Push local changes to GitHub.
[ ] Submit a pull request to merge into develop. Pull request: feature <Issue Number> <Description>
[ ] Define the pull request metadata, as permissions allow. Select: Reviewer(s) and Linked issues Select: Repository level development cycle Project for the next official release Select: Milestone as the next official version
[ ] Iterate until the reviewer(s) accept and merge your changes.
[ ] Delete your fork or branch.
[ ] Close this issue.

jvigh commented 2 years ago

After consultation with CIRA, it was decided to revise the naming schema for the DIAG_SOURCE values to include the group/center that is responsible for the diagnostics (alternatively, the actual model name). The DEV suffix indicates that a perfect prog technique is used to derive the diagnostics (based on analyses). The RT suffix indicates that the diagnostics are computed along some forecast track in real-time.

The DIAG_SOURCE column should support the following values at this time:

SHIPS_DIAG_DEV (renamed from LSDIAG_DEV)
SHIPS_DIAG_RT (renamed from LSDIAG_RT)
CIRA_DIAG_DEV
CIRA_DIAG_RT (renamed from TCDIAG)

Because a stretch goal for this project is to allow for the diagnostics to be computed along any arbitrary specified track (in which the model vortex would be first removed), as well as to allow for different resolutions of models fields to be used, several additional attributes are needed to fully define the source, track, and resolution of the diagnostics.

Thus, the following additional attributes need to be added to the TCDIAG line type in TCPAIRS:

TRACK_SOURCE - a string specifying the model whose track is used to compute the diagnostics. Example values could include ATCF TECH IDs such as: GFS, HWRF, AC00, AP01,...,AP30, or user-specified strings specifying an experimental model/tracker/combo (e.g., EXP_model_GRIPE_tracker_GFDLv5beta3).
FIELD_SOURCE - a string specifying the model whose fields are used to compute the diagnostics. Example values could include ATCF TECH IDs such as: GFS, HWRF, AC00, AP01,...,AP30, , or user-specified strings specifying an experimental model (e.g., EXP_model_GRIPE).
FIELD_RESOLUTION - the resolution of the fields being used to compute the diagnostics. Example values could include: 0p50, 0p25.

jvigh commented 2 years ago

John and Jonathan discussed this offline. The plan is to meet Monday to hash all this out and figure out the best solution.

Issues for discussion:

Should TCpairs be able to attach multiple sets of diagnostics to the same track object? (currently only one is permitted)
What is the best way to solve the namespace conflicts if a user uploads multiple sets of diagnostics from the same source, but from different models/tracks/resolutions?
Are these additional columns needed? If so, how should they be populated? From the config file?

JohnHalleyGotway commented 2 years ago

As discussed on 11/10/22, we could add new output column(s) between the existing DIAG_SOURCE and N_DIAG columns to indicate what data was used to create the model diagnostics.

Ideally we'd be able to extract this metadata from the diagnostic data files being read. But they do NOT currently contain this info. So that's not a viable option. Although it is possible that a future version of the CIRA diagnostics could add this metadata. For now, recommend that we define it via the TC-Pairs configuration file.

At the TC-Diag project meeting on 11/14/2022, @jvigh, @KathrynNewman, and @JohnHalleyGotway met and decided to make the following changes:

[x] Modify diagnostic source names as indicated using SHIPS_DIAG_DEV, SHIPS_DIAG_RT, CIRA_DIAG_DEV, and CIRA_DIAG_RT. Actually, just support SHIPS_DIAG_RT and CIRA_DIAG_RT.
[x] Add 2 new columns to the TCDIAG line type: TRACK_SOURCE and FIELD_SOURCE. While the input model resolution may be important, it could easily be indicated in the FIELD_SOURCE string. None of the other MET tools explicitly state the model resolution in the output, but we provide the model and desc config options to indicate that type of info... or any other important info that would need to be distinguished in the output... e.g. GFS_0p5 versus GFS_1p0.

[x] Move the existing existing diag_name config file entry into a larger diag_info_map entry where each of these settings (including the requested diag names) can be specified separately for each diagnostic source. The default diag_name setting of an entry list just processes ALL diagnostics found in the input.

diag_info_map = [
{ 
 diag_source = "SHIPS_DIAG_RT";       // Defines behavior for `-diag SHIPS_DIAG_DEV path` on the command line
 track_source = "OFCL";                // Written to TRACK_SOURCE output column
 field_source = "GFS_0p50";              // Written to MODEL_SOURCE output column
 match_to_track = [ "OFCL" ];       // Matches these diagnostic values to both of these tracks
 diag_name = [];                       // List of requested diagnostics (empty means "all")
},
{ 
 diag_source = "CIRA_DIAG_RT";       // Defines behavior for `-diag SHIPS_DIAG_DEV path` on the command line
 track_source = "GFS";                // Written to TRACK_SOURCE output column
 field_source = "GFS_0p50";                 // Written to MODEL_SOURCE output column
 match_to_track = [ "GFS" ]; // Matches these diagnostic values to both of these tracks
 diag_name = [];                       // List of requested diagnostics (empty means "all")
}
];

[x] REMOVE the "model=string" handling for -diag from the command line since match_to_track replaces its functionality and all metadata should be defined in the same spot.
[x] Let the default TCPairsConfig file define the default values for each known source. So we need to define these for each source as accurately as possible.
[x] In diag_convert_map rename source to diag_source for consistency with diag_info_map and the DIAG_SOURCE output column name. But it'll be easy for users to define metadata for new or modified diagnostic sources.
[x] In diag_info_map and diag_convert_map, allow for partial string matching of the diag_source name. That means we won't have to duplicate all the conversion functions for SHIPS_DIAG_DEV and SHIPS_DIAG_RT. Instead we'd just set diag_source = "SHIPS_DIAG"

We could eventually support python embedding in this context, where the user provides diagnostics in any format but also provides a python script for parsing that format! That python script would serve up a known data structure consisting of lists of diagnostic names and values with some things like lat, lon, and lead_time being required elements.

dtcenter / MET