equinor / ert

ERT - Ensemble based Reservoir Tool - is designed for running ensembles of dynamical models such as reservoir models, in order to do sensitivity analysis and data assimilation. ERT supports data assimilation using the Ensemble Smoother (ES), Ensemble Smoother with Multiple Data Assimilation (ES-MDA) and Iterative Ensemble Smoother (IES).
https://ert.readthedocs.io/en/latest/
GNU General Public License v3.0
101 stars 106 forks source link

On the data format of numerical records #1731

Closed markusdregi closed 1 year ago

markusdregi commented 3 years ago

Introduction Data is being passed around between ert-storage, webviz-ert, prefectevaluator, ert and ert3 as records. While records are to encompass functionality to persist them to/from storage and file, to be transmitted (transporter laizly) and give some provenance data, their main responsibility is to carry data. And while we have multiple different types of record data the only one for which ert should be opinionated about the content, is the numerical records. It is about time we have a discussion on the format of a numerical record and create a consistent first definition that can be used across the above mentioned responsibilities.

Motivation To ensure a coherent user experience we need to strictly separate between business data and opaque data. Opaque data is data that we take responsibility for transporting around, but do not inspect the content of. While business data is data that we both transport around and care about the content of. In our setting, records for ERT, business data is numerical data...

We should standardize on a layout for numerical data such that as long as the data is valid numerical data we can:

According to the development strategy the data model of ert3 should be a superset of the ert2 data model. An immediate consequence of this is that all current data types should fit into the numerical record concept.

Current data types Here we explore the currently supported data types of ert2. Notice that the current data types of ert3 are basically GEN_KW and GEN_PARAM. Hence, support for the ert3 data types follows from the ert2 data types. For more detailed information we refer to the ERT documentation.

_GENKW Represents data which is a mapping from strings to floats.

FIELD Represents 3D data which can be indexed using integer coordinates (i, j, k), each element being a float.

SURFACE Represents 2D data which can be indexed using integer coordinates (i, j), each element being a float.

_GENPARAM Represents 1D data which can be indexed using integer coordinates (i), each element being a float.

_SUMMARYDATA Represents summary data from Eclipse. We can start by considering this to be a time series data, each element being a float.

_GENDATA Represents 1D data which can be indexed using integer coordinates (i), each element being a float.

Content, index and dimensionality I suggest that we start by creating a natural, yet limited data format that allows us to represent the above data. A suggestion would be that all numerical data can be considered as a:

This should allow us to represent all of the above data types.

sondreso commented 3 years ago

I think this is very good! The only thing I think is missing is how we should do float indexes (with units). Maybe we could generalise the index concept such that we for each dimension store a mapping from the indexs to a set of integers, a set of strings, a set of ISO timestamps or a set of floats, with units.

markusdregi commented 3 years ago

Thanks for the feedback @sondreso :) I would say that we should do floats with units similarly to how we do the other indices, but I deliberately left them out as we don't have an end to end usecase for them yet (that will definitively come though) and I think we should avoid implementing support for functionality that is left unused bu the consumers...

pinkwah commented 3 years ago

Can you explain the motivation behind doing sets of integers as indices? Dimensionality of max 3 is a bit arbitrary for my liking, why not do any? It would require (very slightly) more code to limit the dimensionality.

This is mostly fine in line with what I had in mind for storing the data. However, my concern is that for visualisation purposes, we'll need to have more context which isn't present in the matrix. I hope we're not assuming that there exists a one-size-fits-all visualisation solution that works for eg. any 3D matrix with integer labels. Data can describe a lot of entirely different things even though it is in the same format, and needs to be treated on a case-by-case basis. That is, I hope we agree that this is just for transporting and storing the data.

xjules commented 3 years ago

Nice start @markusdregi ! For webviz-ert purposes we need to classify the numerical records into parameters, responses and observations, which means to provide a fixed format also (ie., dimensionality, index semantics, etc.).

The only thing I think is missing is how we should do float indexes (with units). Maybe we could generalise the index concept such that we for each dimension store a mapping from the indexs to a set of integers, a set of strings, a set of ISO timestamps or a set of floats, with units.

Not sure that I understand what you mean @sondreso, ie. axis labels are strings currently, which could be floats of-course. What role would units play in here?

sondreso commented 3 years ago

Not sure that I understand what you mean @sondreso, ie. axis labels are strings currently, which could be floats of-course. What role would units play in here?

Units play a role when you for example want to do interpolation, or any operation which requires a space where the norm is well defined.

I would say that we should do floats with units similarly to how we do the other indices, but I deliberately left them out as we don't have an end to end usecase for them yet (that will definitively come though) and I think we should avoid implementing support for functionality that is left unused bu the consumers...

I agree, and my point was rather in the "store a mapping" part, in the sense that I think we should not couple the indexes and the matrixes itself too tightly together. The current behaviour of the index in the Record in ert3 is slightly strange, in that some record types keep the index also as part of the data field, while others only have them stored in the index field. I think we should keep the matrix and the axis labels/indexes separate as much as we can, which I also think will make it much easier to extend them to floats in the future or add additional metadata such as units.

pinkwah commented 3 years ago

Units only play a role when you want to perform actions between different units. As long as you're within the same units then everything should be good. I'd say having ERT Storage and co. understand units can wait a while, since that is something that needs to be done right. In the meantime, my proposal has been to treat units as strings in the metadata, and have webviz-ert blindly put the units on the axes.

markusdregi commented 3 years ago

Can you explain the motivation behind doing sets of integers as indices?

I have certain integers for which translates into something in my model that I want to control from the outside of the template model 🤷🏻 If I want {0: x, 1000: y} it would be odd to either make a full array with 1001 elements or force them to be strings when they are in reality numerical values...

Dimensionality of max 3 is a bit arbitrary for my liking, why not do any? It would require (very slightly) more code to limit the dimensionality.

3D is not a max, it is a min ;)

markusdregi commented 3 years ago

When it comes to visualization I would expect a numerical record to be possible to visualize independently of whether it is a parameter, response or some other record as long as it satisfies the current requirements... That we can make it nicer, or add more to the visualization given context I agree with, but the raw data should be possible to visualize independently of additional context.

sondreso commented 3 years ago

This could be a good candidate for the numerical records: http://xarray.pydata.org/en/stable/getting-started-guide/why-xarray.html

eivindjahren commented 2 years ago

I believe this issue needs to be reframed in the light of ert3 no longer being the development direction.

dafeda commented 1 year ago

Could this issue be closed now that we are using xarray? What do you think @oyvindeide ?

oyvindeide commented 1 year ago

I think that makes sense, yes. We could potentially convert it to a discussion, but am fine with closing it