Power Transformer Stability

veni-vidi-vici-dormivi commented 1 month ago

I played around a bit with our power transformer. Try the following example:

import numpy as np
import scipy
import matplotlib.pyplot as plt
from mesmer.mesmer_m.power_transformer import PowerTransformerVariableLambda, lambda_function

np.random.seed(0)
n_years = 10_000

# make skewed data
skew = -5
local_monthly_residuals = scipy.stats.skewnorm.rvs(skew, size=n_years)
yearly_T = np.random.randn(n_years)

# find coefficients
pt = PowerTransformerVariableLambda()
pt.coeffs_ = pt._yeo_johnson_optimize_lambda(local_monthly_residuals, yearly_T)

# transform data
lmbdaleft = lambda_function(pt.coeffs_, yearly_T)
transformed = pt._yeo_johnson_transform(local_monthly_residuals, lmbdaleft)

# plot
fig, axs = plt.subplots(3)
axs[0].hist(local_monthly_residuals, bins=100)
axs[0].set_title('original (left skewed) data, skew = {}'.format(round(scipy.stats.skew(local_monthly_residuals), 2)))
axs[1].hist(transformed, bins=100)
axs[1].set_title('transformed data, skew = {}'.format(round(scipy.stats.skew(transformed),2)))
axs[2].hist(local_monthly_residuals-transformed, bins=100)
axs[2].set_title('difference')
plt.tight_layout()
plt.show()

The Data is nicely de-skewed.

But let's try it with 250 data points, which is how many datapoints we have for each month and gridcell. The output looks like this: The power transformer fails to normalize the data.

Arguably, our data is probably never skewed so heavily but I wanted to leave this here for future reference, in case we want to revisit this in the future.

edited to not have years x 12 since this is not what we have in the application

veni-vidi-vici-dormivi commented 1 month ago

Apparently it's not only the sample size but it is generally rather sensitive to initial conditions, similar things happen for some variations of the bounds, the first guess or the yearly_T.

mathause commented 1 month ago

Interesting - what coeffs do you get out?

veni-vidi-vici-dormivi commented 1 month ago

In the first case around (0, 0.1) and in the second (0.96, 0.013). In this case I set the bounds to (0, 1) for the first and to (-0.1, 0.1) for the second coefficient. First guess is (1,0). When I do the same without any bounds for example, also the example with 10.000 samples is much less normalized with coefficients coming out to be (0.6, 0.2). When I try with bounds of (0, 100) and (-100, 100) the data is more skewed after the transform than before and coefficients are (1.4, 96.2).

veni-vidi-vici-dormivi commented 1 month ago

Maybe we should check the scipy.optimize.OptimizeResult.success attribute...

veni-vidi-vici-dormivi commented 1 month ago

@mathause and I looked into this a little more.

The first interesting thing is that, as Mathias mentioned already, lambda does not need to be between 0 and 2. Thus, our lambda function is maybe not very well suited to estimate lambda. We should look into the possible benefits of making it a rational function (having possible $\xi_0 < 0$) or maybe a completely different lambda function.

Secondly, the definition of the loss function https://github.com/MESMER-group/mesmer/blob/7e7ac269cb807b2dfd2fcf87df89775869d3c112/mesmer/mesmer_m/power_transformer.py#L127-L135 is not intuitively clear to us. It is however the same as in sklearn. But in our opinion not the same as in the original paper:

mathause commented 1 month ago

Additional comments:

a unexplained function is passed as jacobian to the minimize function for the power transformer - should that be removed? https://github.com/MESMER-group/mesmer/blob/7e7ac269cb807b2dfd2fcf87df89775869d3c112/mesmer/mesmer_m/power_transformer.py#L156
$\lambda$ is modeled as a logistic function of the local global T. The logistic function restricts $\lambda \in [0, 2]$ which may not be the correct choice as $\lambda$ can be any real number - see the discussion in https://github.com/MESMER-group/mesmer/pull/427#issuecomment-2085316027

mathause commented 1 month ago

Can you try your example with n_years = 250 again, commenting the jac=rosen_der part again? That just worked for me.

veni-vidi-vici-dormivi commented 1 month ago

True, works for me too. But it's at least not a universal fix because when I do it with skew = 5 it does not work anymore.

mathause commented 1 month ago

Can you try with the following diff (from main):

diff --git a/mesmer/mesmer_m/power_transformer.py b/mesmer/mesmer_m/power_transformer.py
index d746921..a3020cc 100644
--- a/mesmer/mesmer_m/power_transformer.py
+++ b/mesmer/mesmer_m/power_transformer.py
@@ -144,7 +144,7 @@ class PowerTransformerVariableLambda(PowerTransformer):
         local_yearly_T = local_yearly_T[~np.isnan(local_yearly_T)]

         # choosing bracket -2, 2 like for boxcox
-        bounds = np.c_[[0, -0.1], [1, 0.1]]
+        bounds = np.c_[[0, -0.1], [np.inf, 0.1]]
         # first guess is that data is already normal distributed
         first_guess = np.array([1, 0])

@@ -153,7 +153,7 @@ class PowerTransformerVariableLambda(PowerTransformer):
             first_guess,
             bounds=bounds,
             method="SLSQP",
-            jac=rosen_der,
+            # jac=rosen_der,
         ).x

     def _yeo_johnson_transform(self, local_monthly_residuals, lambdas):

veni-vidi-vici-dormivi commented 1 month ago

Aha! I had already tried the infinite upper bound but not without rosen_der that seems to work well! (The infinite upper bound should have been implemented in #427)

veni-vidi-vici-dormivi commented 1 month ago

I actually had a feeling that the PRs above are not the last straw here. Turns out that using method = "SLSQP" leads to numerical instabilities in the coefficients (between operating systems, see #430). It is here in the code:

https://github.com/MESMER-group/mesmer/blob/73779460232d30edda805ad1828a68d04b771a9d/mesmer/mesmer_m/power_transformer.py#L152-L157

The scipy optimize.minimize documentation says that bounds can be used with methods Nelder-Mead, L-BFGS-B, TNC, SLSQP, Powell, trust-constr, and COBYLA. However, SLSQP, trust-constr, and COBYLA are later listed under "constrained minimization". The others are listed under "bound-constrained minimization". I think the other methods are preferred when one passes an actual constraint to the constraint parameter of optimize.minimize.

Concerning the bound-constraint methods, Nelder-Mead is listed as the most robust one but "if numerical computation of derivative can be trusted, other algorithms using the first and/or second derivatives information might be preferred for their better performance in general".

I am not exactly sure if this is the case here. We minimize the negative loglikelihood which takes in the variance of the transformed residuals (a constant), the original residuals and lambda (derived from a logistic function). In general, it is not guaranteed that the negloglikelihood is twice differentiable see here.

I tried the other options and

TNC is the slowest of the four options,
L-BFGS-B is also slower than Nelder-Mead (this would be the default option)
Powell is faster than Nelder-Mead (also does not need a differantiable function)

All of them are actually faster than SLSQP. Moreover, testing on the coarse grid data I get several warnings with SLSQP that the coefficients went outside the bounds, which does not seem to be a problem with the other methods.

I think our best option for now is to use Nelder-Mead because it seems to be the safest/most robust one judging from the documentation and it solves the instability issue. Of course it would be better to read into all of this and make a more informed choice but at the moment I think it is not worth the time.

mathause commented 1 month ago

Thanks for looking into this! I tried to figure out the difference between bounds and constraints. If I understand correctly, bounds restrict the values individual params can take, while constraints restrict the values of all params (e.g. x[0] + x[1] < 50 or so). Se we need bounds but not constraints (although I think bounds could be expressed as constraints).

Not for now, but one thing we could play around with: would it be more efficient to reformulate the logistic regression such that no bounds are needed? I.e. write it as $2 / (1 + \exp(- (\beta_0 + \beta_1 \cdot x)))$ instead of $2 / (1 + \beta_0 \cdot \exp(\beta_1 \cdot x))$ and then use a unbounded optimization? (As this formulation ensures the expression cannot be negative).

veni-vidi-vici-dormivi commented 1 month ago

That's a very interesting point! I'll leave the issue open for now.

MESMER-group / mesmer

Power Transformer Stability #440