equinor / ert

ERT - Ensemble based Reservoir Tool - is designed for running ensembles of dynamical models such as reservoir models, in order to do sensitivity analysis and data assimilation. ERT supports data assimilation using the Ensemble Smoother (ES), Ensemble Smoother with Multiple Data Assimilation (ES-MDA) and Iterative Ensemble Smoother (IES).
https://ert.readthedocs.io/en/latest/
GNU General Public License v3.0
99 stars 104 forks source link

Regularize linear regression using LASSO (without structure) #6599

Open Blunde1 opened 8 months ago

Blunde1 commented 8 months ago

In ES we are updating a realization, $j$, of a parameter $x_i$ according to

$$x{i,j}^a = x{i,j}^f + K_{EnKF} (d_j-y_j)$$

where "a" means "analyzed" (posterior), and "f" means forecasted (prior). The $(d_j-yj)$ is the "innovation" term, i.e. the perturbed observations minus the responses from the forward model. The $K{EnKF}$ is an estimator of the Kalman gain $K=\Sigma_{x,y}\Sigmad^{-1}$ where $\Sigma{x,y}$ is the cross-covariance between parameters $x$ and responses $y$, and the inverse of the observation covariance matrix. Sometimes we use knowledge that randomness from $y=h(x)$ and observation measurement error $\epsilon$ is uncorrelated and write $\Sigma_d=\Sigmay+\Sigma\epsilon$.

In general $\Sigma_{x,y}$ and $\Sigmay$ are unknown analytically. We therefore must estimate them to arrive at an estimator for $K{EnKF}$.

In the TDI-SUB hackathon, team Ted-LASSO used the representation

$$K = \Sigma{x} \hat{H}^\top(\hat{H}\Sigma{x}\hat{H}^\top + \Sigma_{\epsilon})^{-1}$$ where $\hat{H}\approx E[\nabla_x h(x)]$ for $h:x\mapsto y$ was estimated using the LASSO with re-scaled coefficients. This was possible for GenKW parameters, as due to parameter transforms ERT samples standard normal variables and thus $\Sigma{x}=I$. This means that structure was used, which is much harder to know and employ for surfaces and fields. The code estimating $\hat{H}$ can however be re-used.

If we do not know (or want to assume) any structure, like $\Sigma_{x}=I$, then we may reformulate the estimated Kalman gain again to

$$ K_{EnKF} \approx XD^+ \to E[\nabla_d \tilde{h}^{-1}(d)] $$

where $\tilde{h}^{-1}:d\mapsto x$. This shows that

  1. we are approximating a type of gradient descent $x{i,j}^a = x{i,j}^f + E[\nabla_d \tilde{h}^{-1}(d)] (d_j-y_j)$ and
  2. we are approximating it (approximately) using lest squares linear regression (LLS) $XD^+$.

Similarly, Ted-LASSO exchanged the LLS $YX^{+}$ with rescaled LASSO coefficients, and this is why the code can be re-used.

The multiple LASSO regression is found e.g. here https://github.com/Blunde1/ert/blob/aee10a5722752f0ce1d4a91ae37167b528450001/src/ert/analysis/_es_update.py#L481-L507 Here X are the features (so perturbed observations D in ourcase) and Y the multiple response variables (in ourcase X).

A few things are worth noting:

  1. We are splitting up multiple linear regression to single regressions, but still potentially high dimensional where regularization is important.
  2. Regularization strength is optimized. For each observation to parameter map. This is important for adaptively handling complexity, so that users do not have to think about tuning. We do this using cross-validation, which is costly. Thus computational savings are likely to come from work done here. E.g. it is not infeasible that one parameter group could share a regularization strength $\alpha$ learnt from say 10-50 (throwing numbers out) random parameters in that group.
  3. We are not learning coefficients for the problem $y=\beta^\top x$ but rather $y=\beta^\top c(x)$ where $c(x_k)=\sigma_k^{-1}(x_k-\mu_k)$ is the sklearn standard scaler (important so that $x_k$'s does not have scale when penalising corresponding coefficients). Consequently, for the udpate, we must find corresponding $\tilde{\beta}$ so that $\tilde{\beta}^\top x + \beta_0 = \beta^\top c(x)$. This yields for a single coefficient $\tilde{\beta} = \sigma^{-1} \beta$. For a parameter $xi$ we then find $K{i,k}=\tilde{\beta}_{i,k}$ as the $k$'th rescaled coefficient in the $i$'th regression. This is used here: https://github.com/Blunde1/ert/blob/aee10a5722752f0ce1d4a91ae37167b528450001/src/ert/analysis/_es_update.py#L513-L514
  4. It can be done according to the streaming algorithm of parameters. Choose one parameter, and find the regression from perturbed observations to that parameter, extract and re-scale regression coefficients, update.
  5. The scaling of observations should only be done once (it is outside the loop over responses).
  6. We may use other regression types that are suitable (e.g. RIDGE), but LASSO has the sparsity benefit (important if used together with structure, and model interpretability). Note that e.g. multiple LASSO will force usage of the same features, I think (but definitely not 100% certain).

I think the code I pasted expects data in the format row: realizations, col: features, i.e. transpose of what ERT normally employs.

tommyod commented 8 months ago

A comment on point 3. If we center and scale before fitting, then $y = \sum_i \beta_i (x_i - \mu_i)/\sigma_i = \sum_i (\beta_i / \sigma_i ) x_i - \sum_i (\beta_i / \sigma_i ) \mu_i$. So I agree that the coefficients in the original untransformed space become $\beta_i / \sigma_i$. But what about the intercept $- \sum_i (\beta_i / \sigma_i ) \mu_i$ we implicitly introduce? How does this affect the approach?

Blunde1 commented 8 months ago

But what about the intercept −∑i(βi/σi)μi we implicitly introduce? How does this affect the approach?

It should be unimportant because we seek an expected gradient w.r.t. the features (not the full regression or some prediction), which for linear regression therefore "picks out" the coefficients on the features (and not the intercept). Additionally if the responses are centered (which they are in the ERT ES case), then the sum of the intercept terms should sum to zero (I think), but even if they do not it should not matter.

kwinkunks commented 8 months ago

A small note on your (very nice!) implementation.

You fit a Lasso model again after the CV step:

https://github.com/Blunde1/ert/blob/aee10a5722752f0ce1d4a91ae37167b528450001/src/ert/analysis/_es_update.py#L501-L503

But I think you can skip this step, because sklearn does this for you (as in its other CV classes). See Notes in the docs:

In fit, once the best parameter alpha is found through cross-validation, the model is fit again using the entire training set.

So after CV, you can treat model_cv as the optimized model. Not a huge optimization, but a little less code :)

kwinkunks commented 8 months ago

I also note this remark in the docs:

To avoid unnecessary memory duplication the X argument of the fit method should be directly passed as a Fortran-contiguous numpy array.

NumPy arrays are normally C-contiguous, you can check with print(arr.flags).

I have no idea how much difference it will make, but you might want to check performance after converting with something like arr.asfortranarray() or pass order='F') when reshaping.

kwinkunks commented 8 months ago

We talked about this, but just for the record you should maybe create a pipeline containing the scaler and the LassoCV model, and pass in unscaled data. That way, the scaling in each CV fold will only use the statistics of the training fold.

Blunde1 commented 8 months ago

We talked about this, but just for the record you should maybe create a pipeline containing the scaler and the LassoCV model, and pass in unscaled data. That way, the scaling in each CV fold will only use the statistics of the training fold.

Actually we will have only one scaling (on x), but multiple trained models (different responses y), so I am less certain that a pipeline makes sense.

dafeda commented 8 months ago

Adding Patricks talk from last EnKF workshop as it helps in understanding this:

$$ K_{EnKF} \approx XD^+ \to E[\nabla_d \tilde{h}^{-1}(d)] $$

https://ncda-fs.web.norce.cloud/WS2023/raanes.pdf

Blunde1 commented 8 months ago

See in particular https://github.com/equinor/graphite-maps/blob/main/graphite_maps/linear_regression.py we could import this directly

dafeda commented 5 months ago

Moving this out of the Adaptive Localization milestone as the definition of done of that milestone does not include LASSO. This is just a technicality so that we can close the milestone.