Add option to save independence test outputs to a netcdf file

stellema commented 6 days ago

The independence test results outputs (mean_correlations and null_correlation_bounds) are usually saved as a scatter plot, but it would be useful to store these results (e.g., to mask dependent lead times at each grid point). Here is an example of how we could convert the dictionaries to a single dataset:

import numpy as np
import xarray as xr
from unseen.independence import run_tests

# Create example data array (with init_dates every 6 months)
coords = dict(
    ensemble=np.arange(5),
    lead_time=np.arange(10),
    init_date=xr.cftime_range(start="2000-07-01", periods=10, freq="6MS"),
    lat=np.arange(0, 30),
    lon=np.arange(0, 30),
)
# DataArray with dimensions=(ensemble, lead_time, init_date, lat, lon)
da = xr.DataArray(
    np.random.uniform(size=[len(v) for v in coords.values()]),
    coords=coords,
    dims=[k for k in coords],
)

# Run independence tests
mean_correlations, null_correlation_bounds = run_tests(
    da,
    init_dim='init_date',
    lead_dim='lead_time',
    ensemble_dim='ensemble',
)
print(mean_correlations)
print(null_correlation_bounds)

# Create dataset
da_mean_cor = xr.concat(
    [da.assign_coords({"month": k}) for k, da in mean_correlations.items()], "month"
)
ds = da_mean_cor.to_dataset(name="mean_correlations")

ds.coords["bounds"] = ["lower", "upper"]
ds["null_correlation_bounds"] = xr.DataArray(
    [[v for v in null_correlation_bounds[k]] for k in null_correlation_bounds],
    dims=["month", "bounds"],
)
print(ds)

output:

# mean_correlations
{1: <xarray.DataArray (lead_time: 10, lat: 30, lon: 30)> Size: 72kB
dask.array<mean_agg-aggregate, shape=(10, 30, 30), dtype=float64, chunksize=(10, 30, 30), chunktype=numpy.ndarray>
Coordinates:
  * lead_time  (lead_time) int64 80B 0 1 2 3 4 5 6 7 8 9
  * lat        (lat) int64 240B 0 1 2 3 4 5 6 7 8 ... 21 22 23 24 25 26 27 28 29
  * lon        (lon) int64 240B 0 1 2 3 4 5 6 7 8 ... 21 22 23 24 25 26 27 28 29, 7: <xarray.DataArray (lead_time: 10, lat: 30, lon: 30)> Size: 72kB
dask.array<mean_agg-aggregate, shape=(10, 30, 30), dtype=float64, chunksize=(10, 30, 30), chunktype=numpy.ndarray>
Coordinates:
  * lead_time  (lead_time) int64 80B 0 1 2 3 4 5 6 7 8 9
  * lat        (lat) int64 240B 0 1 2 3 4 5 6 7 8 ... 21 22 23 24 25 26 27 28 29
  * lon        (lon) int64 240B 0 1 2 3 4 5 6 7 8 ... 21 22 23 24 25 26 27 28 29}

# null_correlation_bounds
{1: (-0.21999999999999997, 0.38), 7: (-0.21999999999999997, 0.38)}

# ds
<xarray.Dataset> Size: 145kB
Dimensions:                  (lead_time: 10, lat: 30, lon: 30, month: 2,
                              bounds: 2)
Coordinates:
  * lead_time                (lead_time) int64 80B 0 1 2 3 4 5 6 7 8 9
  * lat                      (lat) int64 240B 0 1 2 3 4 5 ... 24 25 26 27 28 29
  * lon                      (lon) int64 240B 0 1 2 3 4 5 ... 24 25 26 27 28 29
  * month                    (month) int64 16B 1 7
  * bounds                   (bounds) <U5 40B 'lower' 'upper'
Data variables:
    mean_correlations        (month, lead_time, lat, lon) float64 144kB dask.array<chunksize=(1, 10, 30, 30), meta=np.ndarray>
    null_correlation_bounds  (month, bounds) float64 32B -0.22 0.38 -0.22 0.38

stellema commented 6 days ago

Alternatively, we could modify independence.run_tests to return this dataset?

Btw, null_correlation_bounds has the wrong dimensions in that example. It's an easy fix - I just need to specify dim='k' in the quantiles at the end of independence._get_null_correlation_bounds.

DamienIrving commented 6 days ago

That looks like a nice solution!

I think we do want to modify independence.run_tests to return the dataset.

So in _main() in independence.py I think we want:

ds_independence = run_tests(
    da_fcst,
    init_dim=args.init_dim,
    lead_dim=args.lead_dim,
    ensemble_dim=args.ensemble_dim,
)

and then instead of a call to create_plot we want _main() to end like similarity.py does with whatever code is needed to write the dataset to file:

infile_log = {args.fcst_file: ds_fcst.attrs["history"]}
ds_independence.attrs["history"] = fileio.get_new_log(infile_logs=infile_logs)

if args.output_chunks:
    ds_independence = ds_independence.chunk(args.output_chunks)

if "zarr" in args.outfile:
    fileio.to_zarr(ds_independence, args.outfile)
else:
    ds_independence.to_netcdf(args.outfile)

That means people can either use the independence.py command line program to get a netCDF (or Zarr) file with the independence test numbers in it, or they can open a notebook and call independence.run_tests() to get an xarray Dataset with the test numbers in it.

How does that sound?

stellema commented 5 days ago

Yup, that sounds good. What do you think would be the best default cmd line behaviour in terms of plotting vs saving the dataset (or both):

Assume that if the data can be plotted (i.e., data is already spatially aggregated; see this example), we should try (instead of explicitly asking).
Add a --save_dataset or --data_output argument (args.outfile refers to the plot file name)
If the former, we could assume args.outfile works for the plot or data file and change the extension as needed (e.g., '.png' to '.nc' and vice versa)?

DamienIrving commented 5 days ago

Hmm. Not 100% sure what the best approach would be.

Maybe it's easiest if it just produces a data file in all cases, and if someone wants to plot the data they can read in that data file, import independence and then use independence.create_plot()?

DamienIrving commented 5 days ago

Actually, we probably need an independence.point_plot() function (i.e. the existing create_plot() function) and a new independence.spatial_plot() function that plots a spatial map showing the first lead time that passes the independence test.

AusClimateService / unseen

Add option to save independence test outputs to a netcdf file #73