Add "UNSEEN trends" functionality

DamienIrving commented 10 months ago

The UNSEEN trends method (Kelder et al 2020) involves fitting a time varying GEV.

DamienIrving commented 10 months ago

We currently use scipy.stats.genextreme when fitting GEVs; e.g. https://github.com/AusClimateService/unseen/blob/master/unseen/indices.py#L134

There are other options though; see the developer notes in the frequency analysis repo I put together for some CSIRO Climate Innovation Hub work.

stellema commented 10 months ago

I'm not sure how correct the method is, but here is an example of how we could implement this based on Robin and Ribes (2020):

import numpy as np
import scipy
import xarray as xr

class NonStationaryGEV(scipy.stats.rv_continuous):
    """Wrapper for non-stationary GEV distribution."""

    def _params_init_guess(self, data):
        """Initial guess of GEV parameters c0, c1, mu0, mu1, sigma0, sigma1."""
        return 0.1, 0, np.mean(data), 0.1, np.std(data), 0.1

    def fit(self, params, data, t):
        """Fit non-stationary GEV distribution parameters based on the data.

        Parameters
        ----------
        params : tuple of floats
            c0, c1, mu0, mu1, sigma0, sigma1
        data : xarray.DataArray
            Data time series
        t : int
            Time index
        """
        c0, c1, mu0, mu1, sigma0, sigma1 = params

        # Shape
        c = c0 + c1 * t

        # Location
        mu = mu0 + mu1 * t

        # Scale
        sigma = np.exp(sigma0 + sigma1 * t)

        # Subset datarray at time t
        x = data.isel(time=t) 

        # See Appendix A - https://ascmo.copernicus.org/articles/6/205/2020/ascmo-6-205-2020.pdf
        Z = 1 + c * (x - mu) / sigma
        f = (1 + 1 / c) * np.log(Z) + Z**(-1 / c) + np.log(sigma)
        return f

def nllf(params, data, times):
    """NS-GEV negative log-likelihood function for each time (i.e., cost function)."""
    nsgev = NonStationaryGEV()
    f_cost = []
    for t in times:
        f_cost.append(nsgev.fit(params, data, t))
    f_cost_sum = np.sum(f_cost)
    return f_cost_sum

def optimise_nsgev_fit(data, times):
    """Minimise the NS-GEV params over the negative log-likelihood function."""
    nsgev = NonStationaryGEV()
    params_i = nsgev._params_init_guess(data)
    # bounds = [(None, None), [0, 0], ...]. 
    result = scipy.optimize.minimize(nllf, params_i, args=(data, times), method='Nelder-Mead')
    return result

# Test dataset.
times = np.arange(500)
ds = xr.Dataset(coords={'time': times}). 
var = 'tmp_var_name'
ds[var] = (('time'), scipy.stats.gumbel_r.rvs(loc=0.5, scale=1, size=ds.time.size) + times * 1e-2) # Noisy increasing line.
ds[var] = ds[var].where(ds[var] > 0, 0)
da = ds[var] 

# Calculate parameters.
nsgev = NonStationaryGEV()
params = nsgev._params_init_guess(da)
res = optimise_nsgev_fit(da, times)
print(', '.join(['{}={:.04f}'.format(k, v) for k, v in zip(['c0', 'c1', 'μ0', 'μ1', 'σ0', 'σ1'], res.x)]))

DamienIrving commented 10 months ago

Awesome - at first glance that looks like a good general setup (cc'ing @dougiesquire).

I guess the challenge now is to figure out how correct the method is. A few initial thoughts:

The shape parameter is currently allowed to vary with time (c = c0 + c1 * t), whereas in the Kelder paper it isn't.
Your test data at the moment is a noisy increasing line... if you changed it to (or had a second test dataset that was) a noisy steady line (i.e. no trend over time) then presumably the estimates for sigma1 and mu1 should be very close to zero and the estimates for c, sigma0 and mu0 should be very close to what you'd get from running:
```
from scipy.stats import genextreme as gev
```

c, mu0, sigma0 = gev.fit(data, loc=mu0_estimate, scale=sigma0_estimate)


 (The only difference might be the sign of c. For some reason scipy uses a reverse sign convention for the shape parameter to most other GEV implementations.)

stellema commented 10 months ago

* The shape parameter is currently allowed to vary with time (`c = c0 + c1 * t`), whereas in the Kelder paper it isn't.

Yeah, to keep the shape parameter time-independent using this method, c1 would have to stay zero (i.e., by restricting its bounds in scipy.optimize.minimize). We could do something similar to rv_continuous.fit, where parameters with an 'f' before the keyword argument are held fixed. Or we remove the flexibility and change c0 and c1 to c.

* Your test data at the moment is a noisy increasing line... if you changed it to (or had a second test dataset that was) a noisy steady line (i.e. no trend over time) then presumably the estimates for sigma1 and mu1 should be very close to zero and the estimates for c, sigma0 and mu0 should be very close to what you'd get from running:

I gave this a try and sigma1 and mu1 are small (at least compared to the line with a trend) and fairly similar the scipy.stats.genextremes results.

For example, using times = np.arange(500) (and keeping c constant):

Noisy increasing line

da = xr.DataArray(scipy.stats.gumbel_r.rvs(loc=0.5, scale=1, size=ds.time.size) + t * 1e-2, coords={'time': times})

output:
c=0.1962, μ0=0.5452, μ1=0.0097, σ0=0.0512, σ1=-0.0001

Noisy stable line

da_stable = xr.DataArray(scipy.stats.gumbel_r.rvs(loc=0.5, scale=1, size=ds.time.size), coords={'time': times})

output:
c=0.391, μ0=0.363, μ1=0.000, σ0=0.132, σ1=0.000

Noisy stable line (genextreme)

# Test genextremes on da_stable using the same initial guesses 
params_i = NonStationaryGEV()._params_init_guess(da_stable)
c, mu0, sigma0 = gev.fit(da_stable, loc=params_i[1], scale=params_i[3], method='MLE', optimizer=scipy.optimize.fmin)
print(*['{}={:.03f}'.format(k, v) for k, v in zip(['c', 'μ0', 'σ0'], [c, mu0, sigma0])])

output:
c=-0.542 μ0=0.461 σ0=0.612

# Test genextremes on da_stable using the ns-gev results as initial guesses
c, mu0, sigma0 = gev.fit(da_stable, loc=res2.x[1], scale=res2.x[3], method='MLE', optimizer=scipy.optimize.fmin)
print(*['{}={:.03f}'.format(k, v) for k, v in zip(['c', 'μ0', 'σ0'], [c, mu0, sigma0])])

output:
c=-0.342 μ0=0.317 σ0=0.146

Also, do you think that the argument t in fit should be an integer or a slice?

DamienIrving commented 10 months ago

I reckon remove the flexibility and change c0 and c1 to c. In Appendix A of that paper you reference they fit the coefficients θ = (μ0, μ1, σ0, σ1, ξ0) - there's no ξ1.

It might be worth plotting the stable case to see what the distributions look like. Something like:

import matplotlib.pyplot as plt
import numpy as np

fig, ax = plt.subplots(figsize=[10, 6])
xvals = np.arange(0, 700)

# Plot data
da_stable.plot.hist(bins=40, density=True, alpha=0.5, label='Histogram')

# Plot genextreme GEV fit
pdf_genext = gev.pdf(xvals, c_genext, mu0_genext, sigma0_genext)
plt.plot(xvals, pdf_genext, label='GEV fit using genextreme')

# Plot NonStationaryGEV fit
pdf_nsgev = gev.pdf(xvals, c_nsgev, mu0_nsgev, sigma0_nsgev)
plt.plot(xvals, pdf_nsgev, label='GEV fit using NonStationaryGEV')

plt.ylabel('probability')
plt.legend()
plt.show()

Not sure about t... you might have to talk me through the NonStationaryGEV class on Monday.

stellema commented 10 months ago

I get mixed results when plotting the distributions. I found that the results can vary with the optimisation initial guesses, so I worked on improving the way we initialise the parameters. I followed the method in Robin and Ribes (2020) (Appendix 1), which involves quantile regression and calculating L-moments, but it doesn't seem to produce better parameter guesses. The μ1 and σ1 values aren't always close to zero for the non-trended data either, so it's hard to compare the results with the genext resuts.

If possible, it would be good to get your thoughts on how I should proceed (i.e., find a new initialisation method, change the optimisation method, use larger/different data samples for testing, and improve performance).

DamienIrving commented 10 months ago

In terms of larger data samples for testing:

Larger sounds like a good idea since I guess maybe the μ1 and σ1 values aren't always close to zero because the random sample size (500) is relatively small and thus there's enough noise for there to be a "trend" sometimes?
I guess you could just go with a longer timeseries than 500, but a more realistic case for our application (to forecast data) would be to increase the sample size by having many values for each year instead of just one.

In terms of different data samples for testing:

We might eventually want to generate test data using scipy.stats.genextreme.rvs instead of scipy.stats.gumbel_r.rvs so we can test different shapes (i.e. not just cases where the shape parameter is zero)

To test whether our code is working as expected, we might also want to completely remove μ1 and σ1 (i.e. produce class StationaryGEV) and just make sure we can replicate the output of scipy.stats.genextreme in the stationary case.

In terms of initialisation, the only thing I've tried in the past is first fitting to a subset of the data, and then using the parameters of that fit as estimates for a fit to the full dataset (which is a pretty primitive approach). I have seen other papers note that they scanned for (and threw out) cases where the fit gave crazy parameter values, so clearly other authors also struggle with getting sensible convergence in every case.

stellema commented 9 months ago

I found much better results if I didn't use a log-link transform to make sure the scale parameter is positive and instead specify bounds=(0, None) in scipy.optimize.minimize.

In the negative log-likelihood function, I also added the same penalty system as scipy for instances when the parameters don't fit the distribution (instead of just returning infinity - even if it failed for only one time step).

def nllf(self, params, data, times):
    """NS-GEV penalized negative log-likelihood function.

    Parameters
    ----------
    params : tuple of floats
        Shape, location and scale parameters (shape, loc0, loc1, scale0, scale1).
    data : xarray.DataArray
        Data time series.
    times : numpy.array
        Indexes of covariate.
    """
    # Non-stationary GEV parameters.
    shape, loc0, loc1, scale0, scale1 = params
    loc = loc0 + loc1 * times
    scale = scale0 + scale1 * times

    # Negative Log-likelihood of GEV probability density function.
    s = (data - loc) / scale

    if shape != 0:
        Z = 1 + shape * s
        f = np.log(scale) + (1 + 1 / shape) * np.ma.log(Z) + np.ma.power(Z, (-1 / shape))
        f = np.where((scale > 0) & (Z > 0), f, np.inf)

    elif shape == 0:
        f = np.log(scale) + s + np.exp(-s)

    total, n_bad = scipy.stats._distn_infrastructure._sum_finite(f)
    return total + n_bad * scipy.stats._constants._LOGXMAX * 100

(N.B. repository for the non-stationary GEV parameter fitting function here: https://github.com/stellema/non-stationary-gev)

The results are pretty good: (i) It returns similar results to scipy.stats.genextreme.fit for non-trended data (when optimizing 3 and 5 parameters), (ii) For the trended data, the shape, loc0, scale0 distribution parameters will be similar to the non-trended parameters, only the loc1 and scale1 parameters will be different, and (iii) The non-stationary parameters describe the data at different times (see below example). Although, it doesn't do a very good job of describing the distribution of trended data as a whole or if there is a strong trend.

For example, using:

da = scipy.stats.genextreme.rvs(0.8, loc=3.2, scale=0.5, size=1000, random_state=0)
da_trended = da + np.arange(da.size) * 3e-4

Non-trended data

ns-gev: c=-0.80036, μ0=3.19587, μ1=0.00000, σ0=0.50169, σ1=-0.00000
genext fit: c=0.7977, μ=3.1969, σ=0.5011

Trended data

ns-gev: c=-0.80015, μ0=3.19056, μ1=0.00031, σ0=0.50583, σ1=-0.00001
genext fit: c= 0.6142, μ=3.3015, σ=0.5017

The standard parameters are similar to the ones used to create the data (and previous stationary results) which is good. It doesn't match the histogram for the entire trended dataset, though.

But it's pretty good at describing the data distribution for subsets near the start and end of the time series.

AusClimateService / unseen