wrong averaging of AR standard deviations

mathause commented 11 months ago

In train_gv_AR we take the average over the AR params - including the standard deviation. However, this is wrong: you have to average the covariances (or even take care of the size of the samples: https://stats.stackexchange.com/q/25848)

https://github.com/MESMER-group/mesmer/blob/acd80e40ef12ab584dff438fa56fd54f6ba43b51/mesmer/calibrate_mesmer/train_gv.py#L195-L208

These are used here:

https://github.com/MESMER-group/mesmer/blob/acd80e40ef12ab584dff438fa56fd54f6ba43b51/mesmer/create_emulations/create_emus_gv.py#L176

Originally posted by @mathause in https://github.com/MESMER-group/mesmer/issues/306#issuecomment-1736922860

mathause commented 11 months ago

cc @leabeusch

I defer this issue to v1.0 as I want to not add "numerically changing" results in v0.9

mathause commented 11 months ago

We may have to add nobs from AR_result to be able to calculate a more correct mean covariance:

https://github.com/MESMER-group/mesmer/blob/acd80e40ef12ab584dff438fa56fd54f6ba43b51/mesmer/stats/auto_regression.py#L227-L230

veni-vidi-vici-dormivi commented 1 month ago

I started looking into this. At the moment, this is what we do:

https://github.com/MESMER-group/mesmer/blob/f8287ba17e6b4221e51cfa5dff64f7c93d4f2358/mesmer/stats/_auto_regression.py#L60-L105

What we need to do to get the average std is average the variances, taking into account the number of observations (in the residuals):

$avg(\sigma) = \sqrt{\frac{\sum_{l=0}^{k} (n_l-1) \cdot \sigmal^2}{\sum{l=0}^k n_l - 1}}$ with $k$ the ensemble member (for one scenario (?, see below)).

Now I have a question. We first average over ensemble members and then over scenarios. This means that we give all scenarios an equal weight, no matter how many ensemble members it holds. Or, from a different perspective, we give different ensemble members different weight, depending on how many of them are in one scenario. Is that what we want? From a coding perspective it would be easier to give each ensemble member equal weight, since then we can simply stack the ensemble and scenario dimension and then do the averaging. Otherwise we would also have to think about how to do the "second averaging" for the scenarios for the standard deviation. Given that we assume that the variance of the driving white noise process should be the same for each ensemble member and scenario I think stacking the dimensions would be most intuitive (and actually I think the same holds true for the AR process).

mathause commented 1 month ago

For the linear regression we stack all the data and we weight each scenario equally (that means we down-weight ensemble members for scenarios with many ensemble members - but that's not yet implemented for the new mesmer code path...). The idea is that we don't want the hist or ssp585 scenarios to dominate the estimates. (I think this argument is less important for the AR process and the std.)

We cannot stack the data for the AR process as there are boundaries (we could stack the residuals). Lukas would maybe argue to do it anyways.

So we could go two ways (which is my difficulty here)

Follow the formula and weight the variances by the number of data points (n_year * n_ens).
Do a weighted variance where we weight the variance by 1 / (n_year * n_ens)

:confused:

So are we actually correct by taking the mean of the variances?

Not sure if that answers your questions?

Question: in our application the mean of the residuals is 0. Because the mean plays into the mean variance...

$$\mathrm{Var}(Y) = \mathrm{E}[\mathrm{Var}(Y \mid X)] + \mathrm{Var}(\mathrm{E}[Y \mid X])$$

import numpy as np

# ---
# same size a & b - mean != 0

a = np.random.randn(100)
b = np.random.randn(100)

var_dir = np.var(np.concatenate([a, b]))
var_via = (a.var() + b.var()) / 2 + ((a.mean() - b.mean()) / 2)**2

np.testing.assert_allclose(var_dir, var_via)

# ---
# different size a & b - mean == 0

a = np.random.randn(100)
b = np.random.randn(20)

a -= a.mean()
b -= b.mean()

var_dir = np.var(np.concatenate([a, b]))
var_via = (a.var()*a.size + b.var() * b.size) / (a.size + b.size)

np.testing.assert_allclose(var_dir, var_via)

I did not manage to find a formula for different size and mean =! 0.

veni-vidi-vici-dormivi commented 1 month ago

We cannot stack the data for the AR process as there are boundaries

I don't mean stacking them for fitting but simply stacking them for the averaging. So I just wanted to say that we could just compute the average over all ensemble members, regardless of which scenario they are in.

The idea is that we don't want the hist or ssp585 scenarios to dominate the estimates. (I think this argument is less important for the AR process and the std.)

Hm I read Leas paper again and I understand this better now. Did she/you ever try to weight the ensemble members equally? That would be interesting. But for now I guess we should keep it as is. Then I would opt for averaging the variances with n_years and n_ens, as in your one.

So are we actually correct by taking the mean of the variances?

I mean, since the ensemble members generally have the same numbers of time steps, it's not very wrong. But averaging the standard deviations definitely is.

I did not manage to find a formula for different size and mean =! 0.

I guess the approach is to always transform to a mean of 0 first, which we do, so that is fine, no?

veni-vidi-vici-dormivi commented 3 weeks ago

So as I see it there are now several options to do the weighting according to number of time steps and ensemble members. And when we do implement the weighting we do not only need to do it for the variances but also for the coefficients and intercepts!

Option one:

Leave at is, meaning scenarios are weighted equally, not considering the number of ensemble members or time steps
Weigh each scenario by number of ensemble members but not by number of time steps
Weigh each scenario by number of time steps and number of ensemble members
Weigh each ensemble member by the number of time steps it has but then not by the number of members for the scenario.

🙃

I prefer either one or three. We could also implement a choice between the two.

veni-vidi-vici-dormivi commented 1 week ago

Or let the user give weights?

MESMER-group / mesmer

wrong averaging of AR standard deviations #307