Why do we do this? It is because of the sentence "A scaling of the equations should always be done to ensure a well-conditioned inverse" just below Eqn (51)? Is there some other reason?
Do we have control that this does what we want? We scale D before passing it to create_coefficient_matrix, where H is defined as:
H := D + S * W
= scale(E + observation_values - response_ensemble) + S * W
= scale(center(E) + observation_values - response_ensemble) + S * W
H is the right hand side of equation (42), but for the equation to hold, the same scaling must be done on both sides of the equation. Is this really the case here?
I want to investigate this.
If it checks out, it should be re-written so all scaling happens in one place. A property that should hold is solve(scale(arg1), scale(arg2)) = solve(arg1, arg2). This can and should tested.
Back to the sentence. I don't understand "A scaling of the equations should always be done to ensure a well-conditioned inverse". What type of scaling? Does it always improve the conditioning? How so?
Consider solving $(S^T S + D)x = S^T$ as in eqn (51), where $D$ is diagonal. We could scale it to $(D^{-1/2} S^T S D^{-1/2} + I) (D^{1/2} x) = D^{-1/2} S^T$, solve that for $y = D^{1/2} x$ and then solve for $x$. Is this what is meant? I'm not sure.
Anyway, the condition number of the left hand side matrix does not always improve. Here is a counterexample:
# Create matrices S.T and D
import numpy as np
sT = np.arange(3) + 1
D = np.diag([3, 2, 1])
print(np.linalg.cond(np.outer(sT, sT) + D)) # 9.299360
# Create D^{-1/2}
D_inv_sq = np.diag(1/np.sqrt(np.diag(D)))
print(np.linalg.cond(D_inv_sq @ np.outer(sT, sT) @ D_inv_sq + np.eye(3))) # 12.33333
In the 2012 Emerick paper he proposes rescaling using the cholesky factorization in appendix A1. But this is different from (1) scaling by dividing by standard deviations and (2) scaling by transforming covariance matrices to correlation matrices. The reason for his proposal is to not truncate small singular values in the SVD when measurements are of different orders of magnitude. Which seems like a different reason than "ensuring a well conditioned inverse".
Thoughts @dafeda ? I think you mentioned something about this once.
Scaling the equations by transforming to the correlation matrix seems to improve the scaling.
There's no guarantee that it does. One can easily construct counterexamples.
There are two scalings that can be considered: scaling to the correlation matrix, or scaling to the identity matrix (using cholesky factors).
In #136, the scaling has been documented in detail in the code. It's still part of the code, but much more understandable now IMO. Closing this, since much of the initial confusion is gone.
In this line we scale the covariance matrix to a correlation matrix:
https://github.com/equinor/iterative_ensemble_smoother/blob/24d771149134299102f8770c397b670c1d1efbf8/src/iterative_ensemble_smoother/utils.py#L120
And here we scale D and E:
https://github.com/equinor/iterative_ensemble_smoother/blob/24d771149134299102f8770c397b670c1d1efbf8/src/iterative_ensemble_smoother/_iterative_ensemble_smoother.py#L124
create_coefficient_matrix
, where H is defined as:H is the right hand side of equation (42), but for the equation to hold, the same scaling must be done on both sides of the equation. Is this really the case here?
solve(scale(arg1), scale(arg2)) = solve(arg1, arg2)
. This can and should tested.Back to the sentence. I don't understand "A scaling of the equations should always be done to ensure a well-conditioned inverse". What type of scaling? Does it always improve the conditioning? How so?
Consider solving $(S^T S + D)x = S^T$ as in eqn (51), where $D$ is diagonal. We could scale it to $(D^{-1/2} S^T S D^{-1/2} + I) (D^{1/2} x) = D^{-1/2} S^T$, solve that for $y = D^{1/2} x$ and then solve for $x$. Is this what is meant? I'm not sure.
Anyway, the condition number of the left hand side matrix does not always improve. Here is a counterexample:
In the 2012 Emerick paper he proposes rescaling using the cholesky factorization in appendix A1. But this is different from (1) scaling by dividing by standard deviations and (2) scaling by transforming covariance matrices to correlation matrices. The reason for his proposal is to not truncate small singular values in the SVD when measurements are of different orders of magnitude. Which seems like a different reason than "ensuring a well conditioned inverse".
Thoughts @dafeda ? I think you mentioned something about this once.
If not I will investigate this in the future.