JCSDA-internal / eva

Evaluation and Verification of the Analysis
Apache License 2.0
5 stars 12 forks source link

Add avg and sdv calculations across a given time span #76

Closed EdwardSafford-NOAA closed 1 year ago

EdwardSafford-NOAA commented 1 year ago

Description

For DA Monitoring 2.0 we need the ability to calculate avg and std values in a dataset for a given variable across a specified time span. Typically these time spans are 1 day (4 cycles) and 30 days (180 cycles) and are plotted by channel (x axis).

Add necessary function(s) in utilities/stats.py to accomplish this and extend data/mon_data_space.py as necessary.

@EdwardSafford-NOAA will complete this work.

Requirements

Given inputs of a dataset containing data from multiple cycles, a variable in the dataset, and a range, return datasets containing the average and standard deviation values for the variable. The range is to start with the latest cycle in the input dataset.

Additionally data/mon_data_space.py will need to be extended to parse yaml instructions to create these avg and sdv values and then create a line plot with those values by channel (or other specified variable).

Acceptance Criteria (Definition of Done)

Given a yaml file containing data files for a range of cycle times and instructions for plotting the avg and sdv values of a variable (i.e. obs count) for a time span that includes some and all of the cycle times (2 cases), create a line plot with the values arranged by channel.

Dependencies

None

EdwardSafford-NOAA commented 1 year ago

Progress update.

I'm working on constructing summary plots for the monitoring. They include both single cycle plots (typically last 4 cycles) and averages (4 cycles, 28 cycles, 120 cycles). I'm starting plotting the last 4 cycles together. I've figured out what I need is most a transform operation as contained in src/eva/transform. It doesn't look like what I need is fully implemented, but it's hard to tell since I'm really just learning both xarray and the mapping from yaml to src/eva/transform. So I'm hard coding my way through the xarray manipulation within src/eva/data/mon_data_space to get what I need working and then I'm going to map that to the capabilities in transform and extend the transform capabilities as needed.

I'm reading in multiple binary files and paring down the dataset to the requested vars from the yaml, which is just obs count to keep it simple. What I need to do is select the count variable for the first 4 cycles and merge that into the dataset as a new variable so it can be plotted. What I'm doing is:

ds_cnt = ds[['GsiIeee::count']] ds_cnt1 = ds_cnt.sel(times='2015051418') ds_cnt1 = ds_cnt1.rename(name_dict={'GsiIeee::count':'GsiIeee::count1'}) ds = ds.merge(ds_cnt1)

And that works:

xarray.Dataset { dimensions: times = 2 ; channels = 19 ; regions = 1 ;

variables: float32 GsiIeee::count(times, channels, regions) ; datetime64[ns] GsiIeee::cycle(times, channels, regions) ; float64 GsiIeee::channel(times, channels, regions) ; int64 channels(channels) ; int64 regions(regions) ; <U10 times(times) ; float32 GsiIeee::count1(channels, regions) ;

But when I try to make a lineplot with x=channels and y=count1 things go pear shaped: IndexError: boolean index did not match indexed array along dimension 0; dimension is 19 but corresponding boolean dimension is 38

I assume that what's happening is that the channel coordinate is getting duplicated during the merge. I've tried a number of options on the merge, so far without success. I'll keep pounding sand next week.

EdwardSafford-NOAA commented 1 year ago

Some success: image

This the first of 4 plots in the RadMon summary plots. The next step is to generalize this solution using src/transforms and yaml. I'm reasonably confident this same solution can be expanded to create the multi-cycle stats needed for the other summary plots.

EdwardSafford-NOAA commented 1 year ago

Got a little smarter and figured out how to add channel to the dataset as a single dimension variable, if 'channel' is a member of the variable list. I'm including that in changes to data/mon_data_space.py.

It's a little tricky because the channel information comes from the control file and channel is a dimension for all data in the binary file(s). But if plotting by channel is desired then channel needs to be included the variable list in the yaml file and the resulting data_collection.

EdwardSafford-NOAA commented 1 year ago

I've gotten a new transform function, select_single_cycle, working and am able to generate the 4 cycle plot (above) using it (and yaml). Next step is to use that same approach to generate multi-cycle avd and sdv values and add them to data_collections for plotting.

EdwardSafford-NOAA commented 1 year ago

This is the second type of RadMon summary plot (lower figure) showing the total bias correction for the last cycle and the average of the last 4 cycles. The data for the single cycle and the 4 cycle average were created by transform.

summary hirs4_metop-a(1)

The logic to selecting a variable for a single time and creating an average over a time span is similar enough that I'd like to combine them into a single routine. I'm going to work on that next.

EdwardSafford-NOAA commented 1 year ago

In the process of trying to generate the 3rd panel of the RadMon summary plots a long-standing bug was discovered which produces inaccurate standard deviation plots. The problem is that the extracted RadMon data does not include standard deviation data, nor does it appear to ever have. Issue 73 (https://github.com/NOAA-EMC/GSI-Monitor/issues/73) has been opened to address this. It might be worth ignoring for now though (in the entire RadMon image suite there is just the single panel on the summary plots that displays any std information) and focusing efforts on Monitoring 2.0.

At any rate the new transform select_time is working and allows for time-based selection by single cycle or by slice.

EdwardSafford-NOAA commented 1 year ago

Per above comments the possible work on this has been done. Anything more will require re-engineering the contents of the RadMon binary data files.