Add Stratified Sampling to ERT

sondreso commented 2 months ago

Feature Request:

Implementing stratified sampling in ERT to improve the sampling process. Stratified sampling could potentially provide a better coverage of the sample space and reduce the risk of clustering. The suggestion includes the possibility of adding a configuration option for the parameter group to define the sampling strategy, such as RANDOM or STRATIFIED, with the potential to add LATIN_HYPERCUBE in the future.

Suggested Feature

Stratified Sampling Implementation: Introduce stratified sampling as an option for parameter updates within ERT. This would involve setting d=1 and converting to normal distribution, following the method described in this Stack Overflow post.
Configuration Option: Add a new configuration option SAMPLING_STRATEGY on the parameter group level in the ERT configuration files. The user could specify SAMPLING_STRATEGY:STRATIFIED to enable stratified sampling for that parameter group.
Naming Consideration: Instead of naming the random sampling strategy as RANDOM, consider using MONTE_CARLO, if this is appropriate.

Benefits

Better sample space coverage.
Reduced risk of clustering in parameter sampling.
More intuitive user experience with control over parameter sampling.
Alignment with requests for support of Latin Hypercube sampling.

Considerations

Determine if the stratified sampling should be applied per parameter or for the entire parameter group.
Assess the impact of the proposed stratified sampling on the update process and correlated parameters.
Discuss the naming and approach with statisticians for validation.
- In ERT we understand the update through the statistical properties of the ensemble X (independently sampled). We estimate Cov(X), have monte carlo samples of Y=g(X), estimate Cov(Y) and regress Y on X. The properties are understood. Changing the sampling, one risks loosing the understanding. The statistical properties, at least of these specific estimators, as a function of an LHC sample, must be understood before changing ERT.
Consider making STRATIFIED the default sampling method. This is a breaking change and would require communication to users.

Additional Context

Support for Latin Hypercube has been requested multiple times, with an issue dating back to 2018.
Stratified sampling is seen as a more straightforward approach and could be the first step before considering more complex sampling strategies like Latin Hypercube.
See this PR for context on how the UPDATE:FALSE option was added

This feature request has been compiled from an internal discussion (link).

Blunde1 commented 2 months ago

This should be under Considerations:

In ERT we understand the update through the statistical properties of the ensemble X (independently sampled). We estimate Cov(X), have monte carlo samples of Y=g(X), estimate Cov(Y) and regress Y on X. The properties are understood. Changing the sampling, one risks loosing the understanding. The statistical properties, at least of these specific estimators, as a function of an LHC sample, must be understood before changing ERT.

sondreso commented 2 months ago

If we use the d=1 option, the variables would still be independently sampled, or no? 🤔

Blunde1 commented 2 months ago

If we use the d=1 option, the variables would still be independently sampled, or no? 🤔

I think across dimensions, yes, but not across samples.

equinor / ert