random samples from a multivariate normal distribution in _draw_auto_regression_correlated_np

MESMER-group / mesmer

spatially-resolved ESM-specific multi-scenario initial-condition ensemble emulator

https://mesmer-emulator.readthedocs.io/en/latest/

GNU General Public License v3.0

22 stars 17 forks source link

random samples from a multivariate normal distribution in _draw_auto_regression_correlated_np #168

Open jschwaab opened 2 years ago

jschwaab commented 2 years ago

Currently the innovations are drawn in the following way:

innovations = np.random.multivariate_normal(...)

This process is quite slow and consumes a lot of computational resources. It also seems to be a major reason for why creating emulations can take while.

1.) It might be an option to switch to using the function multivariate_normal (https://numpy.org/doc/stable/reference/random/generated/numpy.random.multivariate_normal.html) of a default_rng() instance ((https://numpy.org/doc/stable/reference/random/generated/numpy.random.Generator.multivariate_normal.html#numpy.random.Generator.multivariate_normal) ). That would allow to specify the method used and changing from the standard "svd" (single value decomposition) to "cholesky" or "eigh" (eigen) decomposition. In a first test the cholesky method seemed to be 40 times faster than the svd method, but there could be a trade-off since the cholesky method is stated to be less robust (the eigh method was still 3 times as fast as svd).

2.) Since the process takes a lot of resources (so much as to slow down the whole machine), I am wondering, if there is a way to actively control the number of threads for the linear algebra operations in numpy. I was able to do this via specifying OMP_NUM_THREADS (os.environ["OMP_NUM_THREADS"] = "5"), but I am wondering if there would be a better/different way?

mathause commented 2 years ago

Just to mention - this code has recently moved here:

https://github.com/MESMER-group/mesmer/blob/9291c09bb05ae646a3ee6449617da67374299da8/mesmer/core/auto_regression.py#L63

This probably also happens in

https://github.com/MESMER-group/mesmer/blob/9291c09bb05ae646a3ee6449617da67374299da8/mesmer/calibrate_mesmer/train_lv.py#L346

1) I think this is a good idea. The rng.multivariate_normal is the more modern approach and we should move to this at some point anyhow. And I am always up for making things faster ;-) I am not experienced enough to comment on the choice of decomposition algorithm, though. However, when we do it we should expose the method keyword, so that it can be chosen by the user. One issue of the change is that this will be a backwards-incompatible change (I think) and lead to different emulations than before.

2) You want numpy to use less threads?

From https://numpy.org/doc/stable/reference/routines.linalg.html:

Because those libraries are multithreaded and processor dependent, environmental variables and external packages such as threadpoolctl may be needed to control the number of threads or specify the processor architecture.

jschwaab commented 2 years ago

Thanks for the help with the numpy threads!

Yes, I agree that it will probably not be backwards-compatible. Even when using the same method (svd) rng.multivariate_normal seems to yield slightly different results than random.multivariate_normal.

Exposing the method keyword would definitely be great.

mathause commented 9 months ago

I looked in to this a bit today. Some comments

as mentioned the modern random number generator allow to select the method to find the factor A (such that A @ A.T = cov) required to estimate the multivariate normal (see multivariate normal distribution
The new code is at numpy/random/_generator.pyx#L3598 (the old code is at numpy/random/mtrand.pyx#L4079)
numpy/numpy#14193 gives a discussion of the trade-offs between the methods (although it does not seem conclusive)
The three methods lead to different random samples because they decompose cov differently (makes sense in hindsight...)

mathause commented 9 months ago

To compare between the methods, I think it makes sense to compare A and not the random samples. To compute the factor A the new function - numpy.random.Generator.multivariate_normal uses the following code:

def multivariate_normal_factor(cov, method, check_valid="raise", tol=1e-8):

    # code from
    # https://github.com/numpy/numpy/blob/35b14fe3ffe18ca56c3896d441dd5edffcd177db/numpy/random/_generator.pyx#L3795-L3832

    import warnings

    # GH10839, ensure double to make tol meaningful
    cov = cov.astype(np.double)
    if method == 'svd':
        from numpy.linalg import svd
        (u, s, vh) = svd(cov)
    elif method == 'eigh':
        from numpy.linalg import eigh
        # could call linalg.svd(hermitian=True), but that calculates a vh we don't need
        (s, u)  = eigh(cov)
    else:
        from numpy.linalg import cholesky
        l = cholesky(cov)

    # make sure check_valid is ignored when method == 'cholesky'
    # since the decomposition will have failed if cov is not valid.
    if check_valid != 'ignore' and method != 'cholesky':
        if check_valid != 'warn' and check_valid != 'raise':
            raise ValueError(
                "check_valid must equal 'warn', 'raise', or 'ignore'")
        if method == 'svd':
            psd = np.allclose(np.dot(vh.T * s, vh), cov, rtol=tol, atol=tol)
        else:
            psd = not np.any(s < -tol)
        if not psd:
            if check_valid == 'warn':
                warnings.warn("covariance is not symmetric positive-semidefinite.",
                                RuntimeWarning)
            else:
                raise ValueError("covariance is not symmetric positive-semidefinite.")

    if method == 'cholesky':
        _factor = l
    elif method == 'eigh':
        # if check_valid == 'ignore' we need to ensure that np.sqrt does not
        # return a NaN if s is a very small negative number that is
        # approximately zero or when the covariance is not positive-semidefinite
        _factor = u * np.sqrt(abs(s))
    else:
        _factor = u * np.sqrt(s)

    return _factor

mathause commented 9 months ago

The reason why I look at this is that I found very different emulations for exactly 1 gridpoint between the original method and my new one. I was able to trace this back to the svd decomposition of multivariate_normal.

the random samples are very similar - except for one grid cell
the two cov matrices are very similar
the two A matrices are very similar - except for one entry

diff_rnd diff_cov_factor

```python mean = np.zeros(118) rng = np.random.default_rng(seed=0) samples_old = rng.multivariate_normal(mean, params_lv['loc_ecov_AR1_innovs']["tas"], size=251, method="svd") rng = np.random.default_rng(seed=0) samples_new = rng.multivariate_normal(mean, localized_covariance_adjusted, size=251, method="svd") factor_old = multivariate_normal_factor(params_lv['loc_ecov_AR1_innovs']["tas"], method="svd", check_valid="raise", tol=1e-8) factor_new = multivariate_normal_factor(localized_covariance_adjusted, method="svd", check_valid="raise", tol=1e-8) f, ax = plt.subplots() diff = samples_new - samples_old h = ax.pcolormesh(diff.T, cmap="RdBu", vmax=0.5, vmin=-0.5) plt.colorbar(h, ax=ax, orientation="horizontal") ax.set_title("difference in random samples") ax.set_ylabel("grid point") ax.set_xlabel("time") plt.savefig("diff_rnd.png", dpi=300, bbox_inches="tight") f, axs = plt.subplots(1, 2, sharey=True) ax = axs[0] diff = localized_covariance_adjusted - params_lv['loc_ecov_AR1_innovs']["tas"] h = ax.pcolormesh(diff, cmap="RdBu", vmax=1e-15, vmin=-1e-15) plt.colorbar(h, ax=ax, orientation="horizontal") ax = axs[1] diff = factor_new - factor_old # h = ax.pcolormesh(diff, cmap="RdBu", vmax=1e-14, vmin=-1e-14) h = ax.pcolormesh(diff, cmap="RdBu", vmax=0.1, vmin=-0.1) plt.colorbar(h, ax=ax, orientation="horizontal") axs[0].set_title("difference in cov") axs[1].set_title("difference in A") axs[0].set_ylabel("grid point") axs[0].set_xlabel("grid point") axs[1].set_xlabel("grid point") plt.savefig("diff_cov_factor.png", dpi=300, bbox_inches="tight") ```

mathause commented 9 months ago

Given the above we can (1) compare how stable the factorization is (i.e. given two cov that are very similar, how much does A differ?) and (2) how well it reproduces the covariance (as A @ A.T = cov).

~For this example svd fares quite bad and cholesky the best. (Which would be nice as cholesky is also the fastest). But I of course don't know if this generalizes and if these are even good metrics.~

EDIT: I had one formula wrong. See below.

``` Compare difference between the two localized cov matrices rmse_cov = 5.08e-17 check stability of factorization rmse_svd = 1.67e-03 rmse_eigh= 2.56e-15 rmse_chol= 5.92e-17 check error of `A @ A.T` rmse_factor_cov_svd = 1.20e-01 rmse_factor_cov_eigh= 8.81e-02 rmse_factor_cov_chol= 2.94e-02 ``` ```print factor_old_svd = multivariate_normal_factor(params_lv['loc_ecov_AR1_innovs']["tas"], method="svd") factor_new_svd = multivariate_normal_factor(localized_covariance_adjusted, method="svd") factor_old_eigh = multivariate_normal_factor(params_lv['loc_ecov_AR1_innovs']["tas"], method="eigh") factor_new_eigh = multivariate_normal_factor(localized_covariance_adjusted, method="eigh") factor_old_chol = multivariate_normal_factor(params_lv['loc_ecov_AR1_innovs']["tas"], method="cholesky") factor_new_chol = multivariate_normal_factor(localized_covariance_adjusted, method="cholesky") print("Compare difference between the two localized cov matrices") rmse_cov = np.sqrt(np.mean((params_lv['loc_ecov_AR1_innovs']["tas"] - localized_covariance_adjusted)**2)) print(f"{rmse_cov = :0.2e}") print() print("check stability of factorization") rmse_svd = np.sqrt(np.mean((factor_old_svd - factor_new_svd)**2)) rmse_eigh = np.sqrt(np.mean((factor_old_eigh - factor_new_eigh)**2)) rmse_chol = np.sqrt(np.mean((factor_old_chol - factor_new_chol)**2)) print(f"{rmse_svd = :0.2e}") print(f"{rmse_eigh= :0.2e}") print(f"{rmse_chol= :0.2e}") print() print("check error of `A @ A.T`") rmse_factor_cov_svd = np.sqrt(np.mean(((factor_new_svd.T @ factor_new_svd) - localized_covariance_adjusted)**2)) rmse_factor_cov_eigh = np.sqrt(np.mean(((factor_new_eigh.T @ factor_new_eigh) - localized_covariance_adjusted)**2)) rmse_factor_cov_chol = np.sqrt(np.mean(((factor_new_chol.T @ factor_new_chol) - localized_covariance_adjusted)**2)) print(f"{rmse_factor_cov_svd = :0.2e}") print(f"{rmse_factor_cov_eigh= :0.2e}") print(f"{rmse_factor_cov_chol= :0.2e}") ```

mathause commented 8 months ago

The numbers above were wrong because I had the formula wrong (factor.T @ factor instead of factor @ factor.T). So the reconstructed arrays are much closer than shown above.

Compare difference between the two localized cov matrices
rmse_cov = 6.63e-16

check stability of factorization
rmse_svd = 2.13e-13
rmse_eigh= 1.78e-03
rmse_chol= 7.72e-15

check error of `A @ A.T`
rmse_factor_cov_svd = 2.34e-16
rmse_factor_cov_eigh= 1.89e-16
rmse_factor_cov_chol= 1.14e-17

```python def rmse(a, b): return np.sqrt(np.mean((a - b)**2)) print("Compare difference between the two localized cov matrices") rmse_cov = rmse(params_lv['loc_ecov_AR1_innovs']["tas"], localized_covariance_adjusted) print(f"{rmse_cov = :0.2e}") print() print("check stability of factorization") rmse_svd = rmse(factor_old_svd, factor_svd) rmse_eigh = rmse(factor_old_eigh, factor_eigh) rmse_chol = rmse(factor_old_chol, factor_chol) print(f"{rmse_svd = :0.2e}") print(f"{rmse_eigh= :0.2e}") print(f"{rmse_chol= :0.2e}") print() print("check error of `A @ A.T`") rmse_factor_cov_svd = rmse(factor_svd @ factor_svd.T, localized_covariance_adjusted) rmse_factor_cov_eigh = rmse(factor_eigh @ factor_eigh.T, localized_covariance_adjusted) rmse_factor_cov_chol = rmse(factor_chol @ factor_chol.T, localized_covariance_adjusted) print(f"{rmse_factor_cov_svd = :0.2e}") print(f"{rmse_factor_cov_eigh= :0.2e}") print(f"{rmse_factor_cov_chol= :0.2e}") ```

mathause commented 8 months ago

One more aspect: to find_localized_empirical_covariance we use scipy.stats.multivariate_normal, i.e. a different library.

https://github.com/MESMER-group/mesmer/blob/616c8f2d69b0ca0cd88e108711414d4633752ec4/mesmer/stats/_localized_covariance.py#L270

This function uses scipy.linalg.eigh (see scipy/stats/_multivariate.py).

veni-vidi-vici-dormivi commented 5 months ago

but there could be a trade-off since the cholesky method is stated to be less robust

Not actually sure what they mean with that in the numpy documentation. But I guess they mean it is less robust because it needs the covariance matrix to be positive definite instead of just positive semidefinite like eigh and svd.

However, they also write in their documentation: "This function internally uses linear algebra routines, and thus results may not be identical (even up to precision) across architectures, OSes, or even builds. For example, this is likely if cov has multiple equal singular values and method is 'svd' (default). In this case, method='cholesky' may be more robust." This is actually the problem we run into in #402.