AusClimateService / unseen

Code for UNprecedented Simulated Extremes using ENsembles (UNSEEN) analysis.
GNU General Public License v3.0
2 stars 4 forks source link

Add option to save independence test outputs to a netcdf file #73

Closed stellema closed 19 hours ago

stellema commented 6 days ago

The independence test results outputs (mean_correlations and null_correlation_bounds) are usually saved as a scatter plot, but it would be useful to store these results (e.g., to mask dependent lead times at each grid point). Here is an example of how we could convert the dictionaries to a single dataset:

import numpy as np
import xarray as xr
from unseen.independence import run_tests

# Create example data array (with init_dates every 6 months)
coords = dict(
    ensemble=np.arange(5),
    lead_time=np.arange(10),
    init_date=xr.cftime_range(start="2000-07-01", periods=10, freq="6MS"),
    lat=np.arange(0, 30),
    lon=np.arange(0, 30),
)
# DataArray with dimensions=(ensemble, lead_time, init_date, lat, lon)
da = xr.DataArray(
    np.random.uniform(size=[len(v) for v in coords.values()]),
    coords=coords,
    dims=[k for k in coords],
)

# Run independence tests
mean_correlations, null_correlation_bounds = run_tests(
    da,
    init_dim='init_date',
    lead_dim='lead_time',
    ensemble_dim='ensemble',
)
print(mean_correlations)
print(null_correlation_bounds)

# Create dataset
da_mean_cor = xr.concat(
    [da.assign_coords({"month": k}) for k, da in mean_correlations.items()], "month"
)
ds = da_mean_cor.to_dataset(name="mean_correlations")

ds.coords["bounds"] = ["lower", "upper"]
ds["null_correlation_bounds"] = xr.DataArray(
    [[v for v in null_correlation_bounds[k]] for k in null_correlation_bounds],
    dims=["month", "bounds"],
)
print(ds)

output:

# mean_correlations
{1: <xarray.DataArray (lead_time: 10, lat: 30, lon: 30)> Size: 72kB
dask.array<mean_agg-aggregate, shape=(10, 30, 30), dtype=float64, chunksize=(10, 30, 30), chunktype=numpy.ndarray>
Coordinates:
  * lead_time  (lead_time) int64 80B 0 1 2 3 4 5 6 7 8 9
  * lat        (lat) int64 240B 0 1 2 3 4 5 6 7 8 ... 21 22 23 24 25 26 27 28 29
  * lon        (lon) int64 240B 0 1 2 3 4 5 6 7 8 ... 21 22 23 24 25 26 27 28 29, 7: <xarray.DataArray (lead_time: 10, lat: 30, lon: 30)> Size: 72kB
dask.array<mean_agg-aggregate, shape=(10, 30, 30), dtype=float64, chunksize=(10, 30, 30), chunktype=numpy.ndarray>
Coordinates:
  * lead_time  (lead_time) int64 80B 0 1 2 3 4 5 6 7 8 9
  * lat        (lat) int64 240B 0 1 2 3 4 5 6 7 8 ... 21 22 23 24 25 26 27 28 29
  * lon        (lon) int64 240B 0 1 2 3 4 5 6 7 8 ... 21 22 23 24 25 26 27 28 29}

# null_correlation_bounds
{1: (-0.21999999999999997, 0.38), 7: (-0.21999999999999997, 0.38)}

# ds
<xarray.Dataset> Size: 145kB
Dimensions:                  (lead_time: 10, lat: 30, lon: 30, month: 2,
                              bounds: 2)
Coordinates:
  * lead_time                (lead_time) int64 80B 0 1 2 3 4 5 6 7 8 9
  * lat                      (lat) int64 240B 0 1 2 3 4 5 ... 24 25 26 27 28 29
  * lon                      (lon) int64 240B 0 1 2 3 4 5 ... 24 25 26 27 28 29
  * month                    (month) int64 16B 1 7
  * bounds                   (bounds) <U5 40B 'lower' 'upper'
Data variables:
    mean_correlations        (month, lead_time, lat, lon) float64 144kB dask.array<chunksize=(1, 10, 30, 30), meta=np.ndarray>
    null_correlation_bounds  (month, bounds) float64 32B -0.22 0.38 -0.22 0.38
stellema commented 6 days ago

Alternatively, we could modify independence.run_tests to return this dataset?

Btw, null_correlation_bounds has the wrong dimensions in that example. It's an easy fix - I just need to specify dim='k' in the quantiles at the end of independence._get_null_correlation_bounds.

DamienIrving commented 6 days ago

That looks like a nice solution!

I think we do want to modify independence.run_tests to return the dataset.

So in _main() in independence.py I think we want:

ds_independence = run_tests(
    da_fcst,
    init_dim=args.init_dim,
    lead_dim=args.lead_dim,
    ensemble_dim=args.ensemble_dim,
)

and then instead of a call to create_plot we want _main() to end like similarity.py does with whatever code is needed to write the dataset to file:

infile_log = {args.fcst_file: ds_fcst.attrs["history"]}
ds_independence.attrs["history"] = fileio.get_new_log(infile_logs=infile_logs)

if args.output_chunks:
    ds_independence = ds_independence.chunk(args.output_chunks)

if "zarr" in args.outfile:
    fileio.to_zarr(ds_independence, args.outfile)
else:
    ds_independence.to_netcdf(args.outfile)

That means people can either use the independence.py command line program to get a netCDF (or Zarr) file with the independence test numbers in it, or they can open a notebook and call independence.run_tests() to get an xarray Dataset with the test numbers in it.

How does that sound?

stellema commented 5 days ago

Yup, that sounds good. What do you think would be the best default cmd line behaviour in terms of plotting vs saving the dataset (or both):

DamienIrving commented 5 days ago

Hmm. Not 100% sure what the best approach would be.

Maybe it's easiest if it just produces a data file in all cases, and if someone wants to plot the data they can read in that data file, import independence and then use independence.create_plot()?

DamienIrving commented 5 days ago

Actually, we probably need an independence.point_plot() function (i.e. the existing create_plot() function) and a new independence.spatial_plot() function that plots a spatial map showing the first lead time that passes the independence test.