Enhance TC-Gen to verify genesis probabilities from ATCF e-deck files.

JohnHalleyGotway commented 3 years ago

Describe the New Feature

Please see the attached slides to illustrate 2 main changes that are required for TC-Genesis verification. This issue describes the first of those 2 enhancements. Enhance tc_gen to verify genesis probabilities from ATCF e-deck files. NHC identifies disturbances and issues 3 probability forecasts for how likely it is that disturbance will develop into a tropical storm. The probabilities cover 3 time windows, with 48, 120, or 168 hours.

@halperin-erau has provided some sample data containing these e-deck probabilities. However, as of May 2021, their format is still under development. In the existing and historical versions of these files, both the lat,lon location and valid timestamps are absent. If any of those columns are missing from the input data, tc_gen should print a warning message and ignore that input.

This task is to enhance tc_gen to parse those e-deck probabilities and use them to populate Nx2 probabilistic contingency tables. Create 1 table for each of the time windows (48, 120, and 168) but make that a user-configurable option. Write the resulting probabilistic output.

Be sure to subset output by basin, time window, and perhaps forecaster initials. Support the application of both the development and operational logic, but note that the operational logic will be used by NHC.

Acceptance Testing

List input data types and sources. Describe tests required for new functionality.

Time Estimate

Estimate the amount of work required here. Issues should represent approximately 1 to 3 days of work.

Sub-Issues

Consider breaking the new feature down into sub-issues.

[ ] Add a checkbox for each sub-issue here.

Relevant Deadlines

The project for 7790901 originally ended in August, 2021. After the no-cost extension, the updated date is February 28th, 2022.

Funding Source

7790901

Define the Metadata

Assignee

[x] Select engineer(s) or no engineer required
[x] Select scientist(s) or no scientist required

Labels

[x] Select component(s)
[x] Select priority
[x] Select requestor(s)

Projects and Milestone

[x] Select Repository and/or Organization level Project(s) or add alert: NEED PROJECT ASSIGNMENT label
[x] Select Milestone as the next official version or Future Versions

Define Related Issue(s)

Consider the impact to the other METplus components.

[x] METplus, MET, METdatadb, METviewer, METexpress, METcalcpy, METplotpy Recommend adding a METplus use-case to demonstrate this new functionality, assuming enough data is available to do so.

New Feature Checklist

See the METplus Workflow for details.

[ ] Complete the issue definition above, including the Time Estimate and Funding source.
[ ] Fork this repository or create a branch of develop. Branch name: feature_<Issue Number>_<Description>
[ ] Complete the development and test your changes.
[ ] Add/update log messages for easier debugging.
[ ] Add/update unit tests.
[ ] Add/update documentation.
[ ] Push local changes to GitHub.
[ ] Submit a pull request to merge into develop. Pull request: feature <Issue Number> <Description>
[ ] Define the pull request metadata, as permissions allow. Select: Reviewer(s) and Linked issues Select: Repository level development cycle Project for the next official release Select: Milestone as the next official version
[ ] Iterate until the reviewer(s) accept and merge your changes.
[ ] Delete your fork or branch.
[ ] Close this issue.

JohnHalleyGotway commented 3 years ago

On 5/24/21, Dan H, Tara, and John HG met to discuss these details. Dan provided the additional followup information below:

Thank you for the discussion today. I've attached some sample data that may be helpful as development continues. The al*.dat files are TWO sample files. I have included at least one developing and one non-developing disturbance during the years when NHC issued 48 h, 48/120 h, and 48/120/168 h probabilities. eal152020.dat is the e-deck that NHC wrote without any modification by me. Only the lines with "GN" in the 4th column are relevant to us. Note that the lat/lon information is blank and there is no information regarding forecast genesis time. eal152020-model.dat is a modified version of eal152020.dat to illustrate a hypothetical probabilistic forecast from post-processed GFSO output. This is the format that TC-Gen would use. Only the GN lines are included. The storm number in the 2nd column was changed to an arbitrary number. Lat/lon information are included in the 7th/8th columns. Forecast genesis valid time is included in the 13th column. Let me know if you have any questions about the data.

I looked in a few e-deck files and did not see any of the "GS" (genesis shape) entries. I'll confirm with NHC regarding whether they would like to verify the shape files.

edeck-two-sample-data.tar.txt

JohnHalleyGotway commented 2 years ago

@Kathryn Newman I want to get going on MET #1809. Reading through the details, the first thing I want to figure out is WHERE I should do this work. It could be in tc_pairs or tc_gen. tc_pairs already includes an -edeck command line option for verifying probability of RI and writes probabilistic vx output for that. tc_gen already includes the genesis algorithm and is meant to handle genesis "stuff" but it does NOT currently include an -edeck command line argument. So should we have 1 tool (tc_pairs) that processes all the -edeck data? Or split that across 2 tools (tc_pairs and tc_gen) depending on the contents of the -edeck data file? (edited)

JohnHalleyGotway commented 2 years ago

Held a project meeting on 10/28/21 and laid out plans:

@JohnHalleyGotway proceeds with MET #1809 followed by MET #1810.
MET #1809:
- The publicly available NHC E-deck data does NOT include a column for the forecast genesis valid time.
- @halperin-erau can provide sample data that does have that column... be sure to check that the forecast valid time column is actually present and contains sane values. If not, error out with a useful error message.
For MET #1810:
- Enhance tc_gen to verify the probabilities contained within the tropical wx shapefiles.
- Use the shapefile areas, not the points or lines.
- In general, NHC will verify basin-wide but we should think about how to handle user-specified masking regions.
- Write probabilistic output to a .stat file.
@KathrynNewman and @halperin-erau will coordinate on refining and/or creating a METplus Use Case that demonstrates running tc_gen for this project.
May need to create METplus GitHub issue for that new use case.
Ideally, also create a METplus-Training module to demonstrate running that use case, and potentially modifying it in some way, and talk about interpreting results.

JohnHalleyGotway commented 2 years ago

@halperin-erau question about the sample files you provided:

From eal152020-model.dat, I understand that the line below is a forecast that genesis will occur at (31, -78.5) on 8/31/2020 at 06Z. And there's a 20% chance this will actually occur within 120 hours of the issue time of 8/29/2020 at 18Z.
```
AL, 77, 2020082918, GN, GFSO, 120, 310N,  785W,  20,  120, JHT, genFcst, 2020083106, ,  0, 034,
```
From eal152020.dat, I understand that the line below is also a genesis forecast but the predicted genesis location and time are NOT included. And tc_gen should just ignore these lines.
```
AL, 15, 2020082918, GN, OFCL, 120,     ,      ,  20,  120, JLB, genFcst, , ,  0, 034, 
```
QUESTION 1: Is it the location that really matters rather than the time? For example, if the location is given but no time, should we include it in the verification? QUESTION 2:
From eal152020.dat, I'm wondering about these 2 lines:
```
AL, 15, 2020083018, GN, OFCL,   0,     ,      , 100,    0,    , invest, 2020083018, al902020,  1, 034, 
AL, 15, 2020083118, GN, OFCL,   0,     ,      , 100,    0,    , genesis, 2020083118, al152020,  1, 034,
```
These have GN in the 4-th column indicating genesis edeck info. But column 12 has "invest" and "genesis" instead of the "genFcst". I assume we want to only verify GN lines that have "genFcst" in the 12-th column. @halperin-erau can you please confirm?

Checking the edeck documentation, I see the following details:

TC GENESIS PROBABILITY

ProbItem - time period, ie genesis during next xxx hours, 0 for genesis or dissipate event, 0 - 240 hrs,  4 char.
Initials - forecaster initials,  3 char.
GenOrDis - "invest", "genFcst", "genesis", "disFcst" or "dissipate"
DTG - Genesis or dissipated event Date-Time-Group, yyyymmddhhmm: 0000010100 through 9999123123,  12 char.
stormID - cyclone ID if the genesis developed into an invest area or cyclone ID of dissipated TC, e.g. al032014
min - minutes, associated with DTG in common fields (3rd field in record), 0 - 59 min
genesisNum - genesis number, if spawned from a genesis area (1-999)
undefined - TBD

halperin-erau commented 2 years ago

Hi John,

That's correct -- we only want to verify GN lines with "genFcst" in the 12th column.

Thanks, Dan

JohnHalleyGotway commented 2 years ago

Notes from 11/11/21 project meeting:

Consider adding a PROBGEN_MPR output line to list individual matched pair info.
TC-Gen is now writing probabilistic output, although the results are not yet accurate.
Need to correctly determine whether each forecast probability actually verifies using the Dev and Ops methodology.
Need to commit the unit_tc_gen.xml updates.
Need to update the user's guide documentation.
Need to check if we need a METplus issue for handling the new -edeck option for tc_gen.
Also need a METplus issue for a new use case demonstrating this using data provided by @halperin-erau.

JohnHalleyGotway commented 2 years ago

@KathrynNewman and @halperin-erau, I'm working on the scoring logic for the genesis probabilities and am having a tough time understanding the difference between the DEV and OPS methods and whether on not they apply to this data.

Let's work through an example. Here's a group of genesis probabilities:

AL, 77, 2020083000, GN, GFSO,  48, 319N,  770W,  20,   48, JHT, genFcst, 2020083112, ,  0, 034, 
AL, 77, 2020083000, GN, GFSO, 120, 319N,  770W,  30,  120, JHT, genFcst, 2020083112, ,  0, 034, 
AL, 77, 2020083000, GN, GFSO, 168, 319N,  770W,  30,  168, JHT, genFcst, 2020083112, ,  0, 034,

So at 8/30/2020 at 00Z the GFSO model predicts that genesis will occur at (31.9, -77) on 8/31/2020 at 12Z. There's a 20% chance it'll happen with 48 hours: i.e. between 2020083000 and 2020090100. There's a 30% chance it'll happen with 120 hours: i.e. between 2020083000 and 2020090400. There's a 30% chance it'll happen with 168 hours: i.e. between 2020083000 and 2020090600.

We inspect the BEST tracks and see this BEST genesis event:

DEBUG 6: [Genesis 1 of 1] GenesisInfo: StormId = "AL152020", Technique = "BEST", GenesisTime = "20200831_060000", InitTime = "NA", LeadTime = "000000", Lat = 30.60000, Lon = -78.20000, DLand = 157.81879

We apply the genesis_match_radius and see that the BEST track genesis location (30.6, -78.2) is within 500 km of the predicted location (31.9, -77) which falls within the genesis_match_radius:

genesis_match_radius = 500;

In addition, this genesis event occurs in all 3 time windows listed above. So the event verifies for all 3 probabilities. This seems like the simplest logic to me.

It really only uses the "genesis_match_radius" configuration option and not any of the others like "genesis_match_window", "dev_hit_radius", "dev_hit_window", or "ops_hit_window". In addition, it only uses the predicted genesis location and NOT the predicted genesis time.

This logic would result in a single Nx2 probabilistic contingency table rather than separate ones for a DEV method vs OPS method.

Should I proceed with this simple logic? Or should I actually be including the predicted genesis time in the verification in some way... along with the other dev/ops configuration options?

halperin-erau commented 2 years ago

Hi John,

We'll want to use the same matching logic that we have in the OPS scoring method, but you're correct that the dev_hit_radius and *hit_window config options will not be used. Working through the example you provided, our forecast data are:

AL, 77, 2020083000, GN, GFSO, 48, 319N, 770W, 20, 48, JHT, genFcst, 2020083112, , 0, 034, AL, 77, 2020083000, GN, GFSO, 120, 319N, 770W, 30, 120, JHT, genFcst, 2020083112, , 0, 034, AL, 77, 2020083000, GN, GFSO, 168, 319N, 770W, 30, 168, JHT, genFcst, 2020083112, , 0, 034,

So at 8/30/2020 at 00Z the GFSO model predicts that genesis will occur at (31.9, -77) at 8/31/2020 at 12Z. There's a 20% chance it'll happen with 48 hours: i.e. between 2020083000 and 2020090100. There's a 30% chance it'll happen with 120 hours: i.e. between 2020083000 and 2020090400. There's a 30% chance it'll happen with 168 hours: i.e. between 2020083000 and 2020090600.

For most applications, we'll assume that the genesis_match_window should begin and end at zero (i.e., the a- or b-deck files should contain an exact time match with the forecast genesis valid time). Here our forecast genesis valid time from the e-deck is 2020083112. After checking all available b-decks, we find in bal152020.dat:

AL, 15, 2020083112, , BEST, 0, 315N, 774W, 30, 1009, TD, 34, NEQ, 0, 0, 0, 0, 1013, 100, 60, 35, 0, L, 0, , 0, 0, INVEST, S, 0, , 0, 0, 0, 0, genesis-num, 034,

TC-Gen should use the genesis_match_radius to compare the forecast genesis location (31.9, -77.0) with the location of the storm at the corresponding time in the b-deck (31.5, -77.4). This distance is within our

genesis_match_radius = 500;

so we can match the forecast to AL152020. Now, we find the best-track genesis information for AL152020:

DEBUG 6: [Genesis 1 of 1] GenesisInfo: StormId = "AL152020", Technique = "BEST", GenesisTime = "20200831_060000", InitTime = "NA", LeadTime = "000000", Lat = 30.60000, Lon = -78.20000, DLand = 157.81879

Then we verify whether best-track genesis occurred within the time periods specified by the data (48, 120, 168 h) by comparing the forecast init time to the best-track genesis time. The time periods in the e-deck file effectively replace our ops_hit_window, where the window always begins at zero and ends at the time listed in the e-deck.

We need to apply the matching logic because (1) there will be forecasts that occur before an Invest is declared operationally, and (2) there will be forecasts of genesis that are too early. The forecast genesis valid time may be so early that there is no corresponding entry at that time in the b-decks. If that's the case, the CARQ entries at forecast hour zero in the a-decks should also be checked for a potential match, as TC-Gen does for the deterministic genesis forecast matching.

Dan

JohnHalleyGotway commented 2 years ago

@halperin-erau great, thanks for clarifying. I'll use the same BEST track/operational track logic that we're using for the categorical approach. And as you point out, that logic uses these config options (default values listed):

genesis_match_point_to_track = TRUE;
genesis_match_radius = 500;
genesis_match_window = { beg = 0; end = 0 };

So each genesis probability forecast either matches a BEST track or it doesn't. If not, then the forecast obviously does not verify. If it does, then in order to verify the BEST track genesis time must occur between the forecast initialization time and the lead time (i.e. 48, 120, or 168 hours). And that's it.

So we never actually compare the predicted genesis location with actual BEST track genesis location... making sure that they're close enough to each other? Right?

So there's only one set of verification logic, not two, not DEV and OPS, right? If so, I'll plan to set FCST_VAR = OBS_VAR = "PROBGENESIS".

halperin-erau commented 2 years ago

So we never actually compare the predicted genesis location with actual BEST track genesis location... making sure that they're close enough to each other? Right?

Correct -- we only compare the predicted genesis location to the storm location in the a- or b-decks at the forecast genesis valid time for matching purposes.

So there's only one set of verification logic, not two, not DEV and OPS, right? If so, I'll plan to set FCST_VAR = OBS_VAR = "PROBGENESIS".

Correct -- the verification logic here is similar to the OPS method, except that instead of defining genesis_hit_window in the config file, the verification window is defined by the forecast period in the e-deck file.

JohnHalleyGotway commented 2 years ago

Needed doc updates:

[x] Added -edeck command line option.
[x] New TC-Gen config options for prob_genesis_thresh and output_flag (pct, pstd, pjc, prc).
[x] Added PROB_LEAD and PROB_VAL to the GENMPR line type.
[x] Description of probgen vx.

dtcenter / MET