Verify that SIES works when realizations become inactive over the sequence

Blunde1 commented 11 months ago

The function propose_W_masked takes the argument ensemble_mask. This is to handle realizations that are inactive (typically forward model fails on a realization, and this cannot be avoided and must be handled).

In the implementation, the proposed W will, I think, change shape to (N, K) where N is previous realizations and K<N is currently active.

This situation is not accounted for in the reference paper, but the situation cannot be avoided. We should verify that the IES algorithm still works when doing so.

Blunde1 commented 11 months ago

Also, in ERT we use

 X + X @ sies_smoother.W / np.sqrt(
                    len(iens_active_index)
                )

is this the correct N to scale with?

tommyod commented 11 months ago

We have two tests for dying realizations:

I think these are sensible, and would fail if the implementation was obviously wrong. Can we think of more, or better, tests?

tommyod commented 11 months ago

Also, in ERT we use
 X + X @ sies_smoother.W / np.sqrt(
                    len(iens_active_index)
                )
is this the correct N to scale with?

I think is this wrong. We should have a minus one. See this issue.

I am not sure if it's the correct N to scale with. I guess is that $N$ should be the number of realizations still alive. But not sure.

tommyod commented 10 months ago

Will soon close this issue, since:

Minus one bug in ERT has been fixed: https://github.com/equinor/ert/pull/6839
This PR adds even more tests for dying realizations: https://github.com/equinor/iterative_ensemble_smoother/pull/198

I think we are as safe as we can be, unless we dig into the SIES paper and generalizes the equations to the case of dying realizations with proofs etc. This is not a priority as this time.

geirev commented 10 months ago

As far as I remember, I implemented a strategy where, if a realization fails, we remove the corresponding row and column in W and proceed with a smaller ensemble.

When the IES algorithm is used with ES and ESMDA, there are no issues related to this strategy as we don't store and use the W from the previous step (it is always initialized to zero).

However, with the IES algorithm, we need to be more careful and ensure the removal of the correct row and column. The tests were designed to verify this algorithm and to check that the iterations still converge.

geirev commented 10 months ago

Note that we change the shape of W from W(N,N) to W(K,K) using the notation in the thread, where K<N is the new number of active realizations. W is always quadratic.

tommyod commented 10 months ago

I implemented removal of the columns of $W$, not the rows and columns. My intuition might be wrong, but here is an example:

Ex1. Suppose the prior for $X$ is $I$ with shape (100, 100). Suppose the posterior mean is approximately $(25, 25, ...., 25) \in \mathbb{R}$ with 100 elements. Assume that we run 99 iterations, where in each iteration a realization dies in consecutive order. At the end we have one realization left in the posterior. If we remove the rows and columns of $W$, the posterior must be in the space spanned by $(0, 0, 0, ...., 0, 25)$. If we instead only remove columns, then the posterior will be approx $(25, 25, ..., 25)$ - it will be in the space spanned by the prior, but when realization $i$ dies at iteration $i$, then that posterior realization is not updated in the future.

Ex2. A more concrete example. Suppose the prior consists of vectors $(1, 0)$, $(1, 0)$ and $(0, 1)$. The posterior mean is approx $(6, 6)$ and each column in $W$ is around $(2, 2, 2)$ or so. We're very close to the solution. With the square $W$, if the first realization dies, then columns update to $(2, 2)$ and we remove the first column in $X$. The solution no long near the posterior mean. If we instead remove the first column in $W$, then we get a posterior where every entry is still close to $(2, 2, 2)$ - we simply lost one of the realizations.

In summary removing rows and columns forces the posterior to lie in the subspace spanned by the prior realizations that are still alive at the end. Removing columns forces the posterior to lie in the subspace spanned by the realizations in the previous iteration.

This is a bit hand-wavy, but hopefully clear. If not let me know.

Blunde1 commented 10 months ago

In summary removing rows and columns forces the posterior to lie in the subspace spanned by the prior realizations that are still alive at the end. Removing columns forces the posterior to lie in the subspace spanned by the realizations in the previous iteration.

I think this is really the key, and a great insight @tommyod. Do you agree @geirev ?

geirev commented 10 months ago

I see your point about retaining the complete initial ensemble. But, in the iterative methods IES and ESMDA, you need a new set of predicted measurements in each iteration or step. And if a realization fails, you can not get these predicted measurements. So, my thought was that when a realization fails, we remove it from the computation. In that case, we don't change anything in the algorithm, we just call it with a smaller ensemble size. Makes sense?

tommyod commented 10 months ago

But, in the iterative methods IES and ESMDA, you need a new set of predicted measurements in each iteration or step. And if a realization fails, you can not get these predicted measurements.

Yes, so if $X$ has e.g. 20 realizations and 3 fail, then $Y = g(X)$ has 17 columns. The 3 failed realizations must be removed from $X$ and $Y$ in that iteration. No disagreement with this. Also, if realizations fail on the first iteration then I agree we can just re-define the problem and forget we ever had the realizations that failed. What follows concerns an arbitrary iteration.

So, my thought was that when a realization fails, we remove it from the computation. In that case, we don't change anything in the algorithm, we just call it with a smaller ensemble size. Makes sense?

Removing dying realizations from the computation makes sense. The question is exactly how to do it. Here we seem to have two different approaches. Let's look at the equations.

Let $K \leq N$ be the number of realizations still "alive."

If $N -K$ rows and columns are removed from $W$, then in line 6 we solve a system with shapes $\Omega^T \in \mathbb{R}^{K \times K}$, $S^T \in \mathbb{R}^{K \times m}$ and $Y^T \in \mathbb{R}^{K \times m}$.
If columns are removed from $W$, then we first compute $\Omega$ using a full $W$, then slice out the relevant columns. We solve a system of shapes $\Omega^T \in \mathbb{R}^{K \times N}$, $S^T \in \mathbb{R}^{N \times m}$ and $Y^T \in \mathbb{R}^{K \times m}$. In the code this approach found in the method propose_W_masked, which computes a $W \in \mathbb{R}^{N \times K}$.

My intuition for this is outlined above, so I won't repeat it in detail. If the $W$ in line 9 has shape $K \times K$, then $X{i+1}$ must be a linear combination of the realizations that are still alive at the end of the iterations. If $W$ has shape $N \times K$, then $X{i+1}$ can be a linear combination of all realizations in the prior. Either way $X_{i+1}$ has $K$ columns and dying realizations are not updated further.

The question is which of the specific approaches outlined above that is correct and why.

geirev commented 9 months ago

I would say you should eliminate and forget all dead realizations. Including them in them un-updated in the posterior ensemble leads to a too-large and incorrect posterior uncertainty that will impact the updates in the following steps.

Blunde1 commented 9 months ago

I would say you should eliminate and forget all dead realizations. Including them in them un-updated in the posterior ensemble leads to a too-large and incorrect posterior uncertainty that will impact the updates in the following steps.

@geirev how does @tommyod's proposal include dead realizations, and how does this inclusion leads to a too-large posterior uncertainty? I do not understand how it is possible to include dead realizations, as there is no response.

So, my thought was that when a realization fails, we remove it from the computation. In that case, we don't change anything in the algorithm, we just call it with a smaller ensemble size. Makes sense?

No, this only makes sense if you forget about the state $W$. You can forget state only at the first iteration because $W_0=0$ and here the quadratic $K\times K$ method and $N\times K$ methods will be equivalent (guessing, I have not done the math).

In the extreme, If you have 100 iterations and all realizations live until the last iteration where all but one die, then the influence of the dead realizations on the state $W$ is massive. Two things go wrong.

Subsetting it to $1\times 1$ and doing the iteration calculation is not the same as "call it with a smaller ensemble size". The properties of the subsetted $W$ are unknown, but certainly not the same as IES with a smaller ensemble size, and doing an iteration on it is unknown territory.
We discard information that the now dead realizations gave in the 99 previous iterations while they were living, and which is captured in $W_{99}$.

In the other extreme, the quadratic $W$ approach will be okay if $K/N \approx 1$, because the influence of dead realizations on the state $W$ will be marginal.

The question is if the proposal of @tommyod better alleviates the two things going wrong with $K\times K$ approach pointed out above. What are your thoughts?

Edit: the world of ERT is that $N$ is small and the dimensions of our problems large, so the points 1. and 2. above are important. We are not in an asymptotic $N$ large case where dimensions and number of observations are small in comparison.

geirev commented 9 months ago

I think the extreme case is a bit too extreme. If you loose more than a few realizations you have a problem and ERT is not going to fix it. Also, the failing realizations are typically failing for a reason, and the reason is likely the initial parameterization for the failing realization. Hence, it makes sense to get rid of it.

I don't understand the argumentation for keeping the dead realizations. Are you only keeping them in the current iteration when they fail, and not in the subsequent iterations? In that case it doesn't make a big difference (except you may introduce some poorly defined parameters into the other successful realizations).

If you retain the dead not-updated realizations in the prior ensemble for the next iteration you will have a too large and inconsistent ensemble variance.

Blunde1 commented 9 months ago

I think the extreme case is a bit too extreme. If you loose more than a few realizations you have a problem and ERT is not going to fix it.

The extreme case showcases how to think and understand the problems. The problems then occur but to a different extent, how large depends on multiple factors. It is not known if they can be discarded.

Also, the failing realizations are typically failing for a reason, and the reason is likely the initial parameterization for the failing realization. Hence, it makes sense to get rid of it.

What does "get rid of it" mean, precisely? If one believes realizations fail due to a bad sample that should not be sampled, then one should change the prior. One way to change the prior is to run the full algorithm again only with the realizations that survived (use the previous run as a filter on the prior). I think this is what you are thinking is happening when we subset $W$ to $K\times K$, but this is not the case. The subset of $W$ will still be influenced by the dead realizations, but its behaviour and properties is undefined. It is not the same as running IES from scratch.

I don't understand the argumentation for keeping the dead realizations. Are you only keeping them in the current iteration when they fail, and not in the subsequent iterations? In that case it doesn't make a big difference (except you may introduce some poorly defined parameters into the other successful realizations).

If the realizations fail, at some later iteration, due to random chance. The state of $W$ before the iteration where they die is correctly influenced by them. This information should be brought along, even when realizations die at the current iteration. To "get rid of" realizations gracefully then implies bringing this information along with the current iteration. This is what @tommyod attempts with the $N\times K$ approach. But we would like the mathematics to back this up.

If you retain the dead not-updated realizations in the prior ensemble for the next iteration you will have a too large and inconsistent ensemble variance.

Realizations that die will not remain un-updated and brought along with @tommyod's approach. The $W$ is $N\times K$. Only the information from when they lived is brought along through $W$.

Blunde1 commented 9 months ago

I did the math. The square $K \times K$ subset is wrong and @tommyod is correct, given that I have calculated correctly.

Assume no realizations die until iteration $i$. From definition we have $X_i = X_0 + AW_i = X_0(I+W_i/\sqrt{N-1})$ (Eq. 42 in the reference paper). So all columns in $X_i$ are linear combinations of the columns in $X_0$ at this iteration.

Assume further that at the next iteration, computing $X_{i+1}$, we are left with $K < N$ realizations, as some $N-K$ realizations somehow could not compute $g((Xi)_j)$ . The average sensitivity $\overline{G}\{i}$ must now be based on the $K$ realizations, say $X_i^\ast=X_i[:,\texttt{living\_realizations}]$ that could compute a corresponding $m\times K$ matrix $Y_i^\ast=g(Xi^\ast)$. So $\overline{G}\{i}=Y_i^\ast A_i^{\ast +}$.

We could use the above average sensitivity directly in Eq. 26 to update the living realizations. And this would be fine.

However, using the notation of the paper with $\Omega$ and how Algorithm 1 works. We can also do it the way @tommyod has defined. From the definition of $\Omega_i$ in Eq. 31, we can write $A_i^{\ast}$ as $$A_i^{\ast}=c(X_i^\ast)=c(X_i)[:,\texttt{living\_realizations}]=A\Omega_i[:,\texttt{living\_realizations}]=A\Omega_i^\ast$$ where $c(\cdot)$ does centering. This is exactly how $\Omega_i^\ast$ is defined in the code, i.e. $\Omega_i^\ast=\Omega_i[:,\texttt{living\_realizations}]$.

$S_i^\ast$ follows from Eq. 38 and is implemented as such in the code. Finally $W_{i+1}^\ast$ follows from line 8 in Algorithm 1 together other $\ast$ definitions and in particular $W_i^\ast$ defined from $$X_i^\ast = X_i[:,\texttt{living\_realizations}] = X_0[:,\texttt{living\_realizations}]+AW_i[:,\texttt{living\_realizations}] = X_0^\ast + A W_i^\ast$$

Edit: In particular, we have $$X_i^\ast \neq X_0[:,\texttt{living\_realizations}] + A[:,\texttt{living\_realizations}] W_i[\texttt{living\_realizations},\texttt{living\_realizations}]$$ so the $K\times K$ modification does not work.

geirev commented 9 months ago

Hi again, I don't fully understand the discussion, but below is my take on the issue of losing realizations.

You have to look at the sizes of the matrices.

In iteration $i$, you would normally have: $X_i\in \Re^{n\times N}$, $Y_i=g(X_i)\Pi \in \Re^{m\times N}$, $S\in \Re^{m\times N}$, and $H\in \Re^{m\times N}$. where $m$ is the number of measurements, $n$ is the number of state variables, and $N$ is the ensemble size.

If you lose one realization in iteration $i$, you will have the following: $X_i\in \Re^{n\times N}$, $Y_i^\ast=g(X_i)\Pi^\ast \in \Re^{m\times (N-1)}$, $S^\ast_i\in \Re^{m\times (N-1)}$, and $H^\ast_i\in \Re^{m\times (N-1)}$ where the failed realizations are excluded from $Y^\ast_i$, $S^\ast_i$, and $H^\ast_i$.

We compute the update from:

$W_{i+1} = W_i - \gamma (W_i - S^T (S S^T + E E^T)^{-1} H)$

Here, $C=S S^T + E E^T \in \Re^{m\times m}$ and $S^\ast (S^\ast)^T + E E^T \in \Re^{m\times m}$ Note that we can use as many realizations as we like to represent $E$, and I often increase the size of $E$ to reduce sampling errors.

Now, in the case of lost realizations, we have

$W_{i+1} = W - \gamma ( W_i - (S_i^\ast)^T(m,N-1) C^{-1}(m,m) H(m,N-1) )$,

where it is clear that the increment

$(S_i^\ast)^T(m,N-1) C^{-1}(m,m) H(m,N-1) ) \in \Re^{N-1 \times N-1}$

So, we can not compute updates to $W_i$ for the row and columns corresponding to the lost realizations, and this is the argument for entirely removing all lost realizations and defining a new $W^\ast \in \Re^{(N-1)\times (N-1)}$ in the case of on lost realization.

For IES, we will only compute the update in the space spanned by the initial `surviving' realizations.

We only compute one update for ES, and the situation is the same as for IES.

For ESMDA, the situation changes. Here, you compute increments to the previous update, so $X_{i+1}=X_i W_i $. Thus, if you lose a realization at iteration $i$, we will remove it from the continued iterations, but the other realizations will still contain the impact of this realization through the previous MDA updates.

Maybe we are all talking around each other, and it would be helpful to discuss on a blackboard.

Blunde1 commented 9 months ago

Thanks, @geirev

For ESMDA, the situation changes. Here, you compute increments to the previous update, so $X_{i+1} = X_iW_i$. Thus, if you lose a realization at iteration, we will remove it from the continued iterations, but the other realizations will still contain the impact of this realization through the previous MDA updates.

I believe this is the case also with IES. It follows from Equation 22 in the reference paper, on how to update a realization $j$ at iteration $i$ of a "weight" $w_j^{i+1}$ as a function of $wj^i$, $\overline{G}_{i}$, $A$, and $C{dd}$.

I think we fully agree that $\overline{G}_i$ can only be estimated from living realizations.

I think we do not agree on $S^\ast$. From definitions, working backwards

Eq. 38 defining $S_i^\ast$: $S_i^\ast = Y_i^\ast \Omega_i^{\ast-1}$.
Eq. 31 defining $\Omega_i^\ast$: $A_i^\ast = A\Omega_i^\ast$. Note that $A$ is the original prior ensemble.
Eq. 31 defining $A_i^*$: $A_i^\ast = c(X_i^\ast)$.

We use the full $A$ even though we base $\overline{G}_i^\ast$ on only living realizations.

So, we can not compute updates to $W_i$ for the row and columns corresponding to the lost realizations.

We cannot compute updates to the columns of $W_i$ corresponding to failing realizations. The rows are a different matter. They hold the information on previous computations leading to $W_i$ for all columns, also for living realizaitons.

Edit: But I agree on moving this discussion to a blackboard if we still do not agree.

geirev commented 9 months ago

You want to retain the information from previous iterations in the IES coming from failed realizations.

You must then work with a rectangular $W_i^\ast$ and $\Omega^\ast$. I will need to spend time on the algorithm to comprehend what this implies fully. E.g., how do you define $S^\ast_i = Y^\ast_i (\Omega^\ast_i)^{-1}$ when $\Omega^\ast_i$ is a rectangular matrix? The columns in $S$ are scaled versions of the columns in $Y$.

I am curious whether this approach will improve the results compared to abandoning lost realizations, but an implementation and comparison would help.

An alternative could be built on the following redefinition of the IES algorithm:

$X_0$ $X_1=X_0 T_1$ $X_2=X_0 T_2 = X_1 T_1^{-1} T_2 = X_1 T_2^\ast$ $X_3=X_0 T_3 = X_2 T_2^{-1} T_3 = X_2 T_3^\ast$ and so on.

Then, we compute the updates $X_{i+1}$ from the previous iterate $X_i$, and the algorithm will carry forward the information from the previous iterations, just like in the ESMDA.

Let's look at this on the blackboard at the next occasion.

equinor / iterative_ensemble_smoother

Verify that SIES works when realizations become inactive over the sequence #196