dtcenter / MET

Model Evaluation Tools
https://dtcenter.org/community-code/model-evaluation-tools-met
Apache License 2.0
77 stars 24 forks source link

Enhance TC-Gen to verify genesis probabilities from ATCF e-deck files. #1809

Closed JohnHalleyGotway closed 2 years ago

JohnHalleyGotway commented 3 years ago

Describe the New Feature

tc_gen_probabilistic_algorithm_v2.pdf

Please see the attached slides to illustrate 2 main changes that are required for TC-Genesis verification. This issue describes the first of those 2 enhancements. Enhance tc_gen to verify genesis probabilities from ATCF e-deck files. NHC identifies disturbances and issues 3 probability forecasts for how likely it is that disturbance will develop into a tropical storm. The probabilities cover 3 time windows, with 48, 120, or 168 hours.

@halperin-erau has provided some sample data containing these e-deck probabilities. However, as of May 2021, their format is still under development. In the existing and historical versions of these files, both the lat,lon location and valid timestamps are absent. If any of those columns are missing from the input data, tc_gen should print a warning message and ignore that input.

This task is to enhance tc_gen to parse those e-deck probabilities and use them to populate Nx2 probabilistic contingency tables. Create 1 table for each of the time windows (48, 120, and 168) but make that a user-configurable option. Write the resulting probabilistic output.

Be sure to subset output by basin, time window, and perhaps forecaster initials. Support the application of both the development and operational logic, but note that the operational logic will be used by NHC.

Acceptance Testing

List input data types and sources. Describe tests required for new functionality.

Time Estimate

Estimate the amount of work required here. Issues should represent approximately 1 to 3 days of work.

Sub-Issues

Consider breaking the new feature down into sub-issues.

Relevant Deadlines

The project for 7790901 originally ended in August, 2021. After the no-cost extension, the updated date is February 28th, 2022.

Funding Source

7790901

Define the Metadata

Assignee

Labels

Projects and Milestone

Define Related Issue(s)

Consider the impact to the other METplus components.

New Feature Checklist

See the METplus Workflow for details.

JohnHalleyGotway commented 3 years ago

On 5/24/21, Dan H, Tara, and John HG met to discuss these details. Dan provided the additional followup information below:

Thank you for the discussion today. I've attached some sample data that may be helpful as development continues. The al*.dat files are TWO sample files. I have included at least one developing and one non-developing disturbance during the years when NHC issued 48 h, 48/120 h, and 48/120/168 h probabilities. eal152020.dat is the e-deck that NHC wrote without any modification by me. Only the lines with "GN" in the 4th column are relevant to us. Note that the lat/lon information is blank and there is no information regarding forecast genesis time. eal152020-model.dat is a modified version of eal152020.dat to illustrate a hypothetical probabilistic forecast from post-processed GFSO output. This is the format that TC-Gen would use. Only the GN lines are included. The storm number in the 2nd column was changed to an arbitrary number. Lat/lon information are included in the 7th/8th columns. Forecast genesis valid time is included in the 13th column. Let me know if you have any questions about the data.

I looked in a few e-deck files and did not see any of the "GS" (genesis shape) entries. I'll confirm with NHC regarding whether they would like to verify the shape files.

edeck-two-sample-data.tar.txt

JohnHalleyGotway commented 2 years ago

@Kathryn Newman I want to get going on MET #1809. Reading through the details, the first thing I want to figure out is WHERE I should do this work. It could be in tc_pairs or tc_gen. tc_pairs already includes an -edeck command line option for verifying probability of RI and writes probabilistic vx output for that. tc_gen already includes the genesis algorithm and is meant to handle genesis "stuff" but it does NOT currently include an -edeck command line argument. So should we have 1 tool (tc_pairs) that processes all the -edeck data? Or split that across 2 tools (tc_pairs and tc_gen) depending on the contents of the -edeck data file? (edited)

JohnHalleyGotway commented 2 years ago

Held a project meeting on 10/28/21 and laid out plans:

JohnHalleyGotway commented 2 years ago

@halperin-erau question about the sample files you provided:

Checking the edeck documentation, I see the following details:

TC GENESIS PROBABILITY

ProbItem - time period, ie genesis during next xxx hours, 0 for genesis or dissipate event, 0 - 240 hrs,  4 char.
Initials - forecaster initials,  3 char.
GenOrDis - "invest", "genFcst", "genesis", "disFcst" or "dissipate"
DTG - Genesis or dissipated event Date-Time-Group, yyyymmddhhmm: 0000010100 through 9999123123,  12 char.
stormID - cyclone ID if the genesis developed into an invest area or cyclone ID of dissipated TC, e.g. al032014
min - minutes, associated with DTG in common fields (3rd field in record), 0 - 59 min
genesisNum - genesis number, if spawned from a genesis area (1-999)
undefined - TBD
halperin-erau commented 2 years ago

Hi John,

That's correct -- we only want to verify GN lines with "genFcst" in the 12th column.

Thanks, Dan

JohnHalleyGotway commented 2 years ago

Notes from 11/11/21 project meeting:

JohnHalleyGotway commented 2 years ago

@KathrynNewman and @halperin-erau, I'm working on the scoring logic for the genesis probabilities and am having a tough time understanding the difference between the DEV and OPS methods and whether on not they apply to this data.

Let's work through an example. Here's a group of genesis probabilities:

AL, 77, 2020083000, GN, GFSO,  48, 319N,  770W,  20,   48, JHT, genFcst, 2020083112, ,  0, 034, 
AL, 77, 2020083000, GN, GFSO, 120, 319N,  770W,  30,  120, JHT, genFcst, 2020083112, ,  0, 034, 
AL, 77, 2020083000, GN, GFSO, 168, 319N,  770W,  30,  168, JHT, genFcst, 2020083112, ,  0, 034, 

So at 8/30/2020 at 00Z the GFSO model predicts that genesis will occur at (31.9, -77) on 8/31/2020 at 12Z. There's a 20% chance it'll happen with 48 hours: i.e. between 2020083000 and 2020090100. There's a 30% chance it'll happen with 120 hours: i.e. between 2020083000 and 2020090400. There's a 30% chance it'll happen with 168 hours: i.e. between 2020083000 and 2020090600.

We inspect the BEST tracks and see this BEST genesis event:

DEBUG 6: [Genesis 1 of 1] GenesisInfo: StormId = "AL152020", Technique = "BEST", GenesisTime = "20200831_060000", InitTime = "NA", LeadTime = "000000", Lat = 30.60000, Lon = -78.20000, DLand = 157.81879

We apply the genesis_match_radius and see that the BEST track genesis location (30.6, -78.2) is within 500 km of the predicted location (31.9, -77) which falls within the genesis_match_radius:

genesis_match_radius = 500;

In addition, this genesis event occurs in all 3 time windows listed above. So the event verifies for all 3 probabilities. This seems like the simplest logic to me.

It really only uses the "genesis_match_radius" configuration option and not any of the others like "genesis_match_window", "dev_hit_radius", "dev_hit_window", or "ops_hit_window". In addition, it only uses the predicted genesis location and NOT the predicted genesis time.

This logic would result in a single Nx2 probabilistic contingency table rather than separate ones for a DEV method vs OPS method.

Should I proceed with this simple logic? Or should I actually be including the predicted genesis time in the verification in some way... along with the other dev/ops configuration options?

halperin-erau commented 2 years ago

Hi John,

We'll want to use the same matching logic that we have in the OPS scoring method, but you're correct that the dev_hit_radius and *hit_window config options will not be used. Working through the example you provided, our forecast data are:

AL, 77, 2020083000, GN, GFSO, 48, 319N, 770W, 20, 48, JHT, genFcst, 2020083112, , 0, 034, AL, 77, 2020083000, GN, GFSO, 120, 319N, 770W, 30, 120, JHT, genFcst, 2020083112, , 0, 034, AL, 77, 2020083000, GN, GFSO, 168, 319N, 770W, 30, 168, JHT, genFcst, 2020083112, , 0, 034,

So at 8/30/2020 at 00Z the GFSO model predicts that genesis will occur at (31.9, -77) at 8/31/2020 at 12Z. There's a 20% chance it'll happen with 48 hours: i.e. between 2020083000 and 2020090100. There's a 30% chance it'll happen with 120 hours: i.e. between 2020083000 and 2020090400. There's a 30% chance it'll happen with 168 hours: i.e. between 2020083000 and 2020090600.

For most applications, we'll assume that the genesis_match_window should begin and end at zero (i.e., the a- or b-deck files should contain an exact time match with the forecast genesis valid time). Here our forecast genesis valid time from the e-deck is 2020083112. After checking all available b-decks, we find in bal152020.dat:

AL, 15, 2020083112, , BEST, 0, 315N, 774W, 30, 1009, TD, 34, NEQ, 0, 0, 0, 0, 1013, 100, 60, 35, 0, L, 0, , 0, 0, INVEST, S, 0, , 0, 0, 0, 0, genesis-num, 034,

TC-Gen should use the genesis_match_radius to compare the forecast genesis location (31.9, -77.0) with the location of the storm at the corresponding time in the b-deck (31.5, -77.4). This distance is within our

genesis_match_radius = 500;

so we can match the forecast to AL152020. Now, we find the best-track genesis information for AL152020:

DEBUG 6: [Genesis 1 of 1] GenesisInfo: StormId = "AL152020", Technique = "BEST", GenesisTime = "20200831_060000", InitTime = "NA", LeadTime = "000000", Lat = 30.60000, Lon = -78.20000, DLand = 157.81879

​ ​Then we verify whether best-track genesis occurred within the time periods specified by the data (48, 120, 168 h) by comparing the forecast init time to the best-track genesis time. The time periods in the e-deck file effectively replace our ops_hit_window, where the window always begins at zero and ends at the time listed in the e-deck.

We need to apply the matching logic because (1) there will be forecasts that occur before an Invest is declared operationally, and (2) there will be forecasts of genesis that are too early. The forecast genesis valid time may be so early that there is no corresponding entry at that time in the b-decks. If that's the case, the CARQ entries at forecast hour zero in the a-decks should also be checked for a potential match, as TC-Gen does for the deterministic genesis forecast matching.

Dan

JohnHalleyGotway commented 2 years ago

@halperin-erau great, thanks for clarifying. I'll use the same BEST track/operational track logic that we're using for the categorical approach. And as you point out, that logic uses these config options (default values listed):

genesis_match_point_to_track = TRUE;
genesis_match_radius = 500;
genesis_match_window = { beg = 0; end = 0 };

So each genesis probability forecast either matches a BEST track or it doesn't. If not, then the forecast obviously does not verify. If it does, then in order to verify the BEST track genesis time must occur between the forecast initialization time and the lead time (i.e. 48, 120, or 168 hours). And that's it.

So we never actually compare the predicted genesis location with actual BEST track genesis location... making sure that they're close enough to each other? Right?

So there's only one set of verification logic, not two, not DEV and OPS, right? If so, I'll plan to set FCST_VAR = OBS_VAR = "PROBGENESIS".

halperin-erau commented 2 years ago

So we never actually compare the predicted genesis location with actual BEST track genesis location... making sure that they're close enough to each other? Right?

Correct -- we only compare the predicted genesis location to the storm location in the a- or b-decks at the forecast genesis valid time for matching purposes.

So there's only one set of verification logic, not two, not DEV and OPS, right? If so, I'll plan to set FCST_VAR = OBS_VAR = "PROBGENESIS".

Correct -- the verification logic here is similar to the OPS method, except that instead of defining genesis_hit_window in the config file, the verification window is defined by the forecast period in the e-deck file.

JohnHalleyGotway commented 2 years ago

Needed doc updates: