Open veni-vidi-vici-dormivi opened 1 month ago
Apparently it's not only the sample size but it is generally rather sensitive to initial conditions, similar things happen for some variations of the bounds, the first guess or the yearly_T.
Interesting - what coeffs
do you get out?
In the first case around (0, 0.1) and in the second (0.96, 0.013). In this case I set the bounds to (0, 1) for the first and to (-0.1, 0.1) for the second coefficient. First guess is (1,0). When I do the same without any bounds for example, also the example with 10.000 samples is much less normalized with coefficients coming out to be (0.6, 0.2). When I try with bounds of (0, 100) and (-100, 100) the data is more skewed after the transform than before and coefficients are (1.4, 96.2).
Maybe we should check the scipy.optimize.OptimizeResult.success
attribute...
@mathause and I looked into this a little more.
The first interesting thing is that, as Mathias mentioned already, lambda does not need to be between 0 and 2. Thus, our lambda function is maybe not very well suited to estimate lambda. We should look into the possible benefits of making it a rational function (having possible $\xi_0 < 0$) or maybe a completely different lambda function.
Secondly, the definition of the loss function https://github.com/MESMER-group/mesmer/blob/7e7ac269cb807b2dfd2fcf87df89775869d3c112/mesmer/mesmer_m/power_transformer.py#L127-L135 is not intuitively clear to us. It is however the same as in sklearn. But in our opinion not the same as in the original paper:
Additional comments:
minimize
function for the power transformer - should that be removed?
https://github.com/MESMER-group/mesmer/blob/7e7ac269cb807b2dfd2fcf87df89775869d3c112/mesmer/mesmer_m/power_transformer.py#L156Can you try your example with n_years = 250
again, commenting the jac=rosen_der
part again? That just worked for me.
True, works for me too. But it's at least not a universal fix because when I do it with skew = 5
it does not work anymore.
Can you try with the following diff (from main):
diff --git a/mesmer/mesmer_m/power_transformer.py b/mesmer/mesmer_m/power_transformer.py
index d746921..a3020cc 100644
--- a/mesmer/mesmer_m/power_transformer.py
+++ b/mesmer/mesmer_m/power_transformer.py
@@ -144,7 +144,7 @@ class PowerTransformerVariableLambda(PowerTransformer):
local_yearly_T = local_yearly_T[~np.isnan(local_yearly_T)]
# choosing bracket -2, 2 like for boxcox
- bounds = np.c_[[0, -0.1], [1, 0.1]]
+ bounds = np.c_[[0, -0.1], [np.inf, 0.1]]
# first guess is that data is already normal distributed
first_guess = np.array([1, 0])
@@ -153,7 +153,7 @@ class PowerTransformerVariableLambda(PowerTransformer):
first_guess,
bounds=bounds,
method="SLSQP",
- jac=rosen_der,
+ # jac=rosen_der,
).x
def _yeo_johnson_transform(self, local_monthly_residuals, lambdas):
Aha! I had already tried the infinite upper bound but not without rosen_der that seems to work well! (The infinite upper bound should have been implemented in #427)
I actually had a feeling that the PRs above are not the last straw here. Turns out that using method = "SLSQP"
leads to numerical instabilities in the coefficients (between operating systems, see #430). It is here in the code:
The scipy optimize.minimize documentation says that bounds
can be used with methods Nelder-Mead, L-BFGS-B, TNC, SLSQP, Powell, trust-constr, and COBYLA. However, SLSQP, trust-constr, and COBYLA are later listed under "constrained minimization". The others are listed under "bound-constrained minimization". I think the other methods are preferred when one passes an actual constraint to the constraint
parameter of optimize.minimize
.
Concerning the bound-constraint methods, Nelder-Mead is listed as the most robust one but "if numerical computation of derivative can be trusted, other algorithms using the first and/or second derivatives information might be preferred for their better performance in general".
I am not exactly sure if this is the case here. We minimize the negative loglikelihood which takes in the variance of the transformed residuals (a constant), the original residuals and lambda (derived from a logistic function). In general, it is not guaranteed that the negloglikelihood is twice differentiable see here.
I tried the other options and
All of them are actually faster than SLSQP
. Moreover, testing on the coarse grid data I get several warnings with SLSQP
that the coefficients went outside the bounds, which does not seem to be a problem with the other methods.
I think our best option for now is to use Nelder-Mead because it seems to be the safest/most robust one judging from the documentation and it solves the instability issue. Of course it would be better to read into all of this and make a more informed choice but at the moment I think it is not worth the time.
Thanks for looking into this! I tried to figure out the difference between bounds and constraints. If I understand correctly, bounds restrict the values individual params can take, while constraints restrict the values of all params (e.g. x[0] + x[1] < 50
or so). Se we need bounds but not constraints (although I think bounds could be expressed as constraints).
Not for now, but one thing we could play around with: would it be more efficient to reformulate the logistic regression such that no bounds are needed? I.e. write it as $2 / (1 + \exp(- (\beta_0 + \beta_1 \cdot x)))$ instead of $2 / (1 + \beta_0 \cdot \exp(\beta_1 \cdot x))$ and then use a unbounded optimization? (As this formulation ensures the expression cannot be negative).
That's a very interesting point! I'll leave the issue open for now.
I played around a bit with our power transformer. Try the following example:
But let's try it with 250 data points, which is how many datapoints we have for each month and gridcell. The output looks like this:
The power transformer fails to normalize the data.
Arguably, our data is probably never skewed so heavily but I wanted to leave this here for future reference, in case we want to revisit this in the future.
edited to not have years x 12 since this is not what we have in the application