Reconsider the design of the Record class hierarchy

equinor / ert

ERT - Ensemble based Reservoir Tool - is designed for running ensembles of dynamical models such as reservoir models, in order to do sensitivity analysis and data assimilation. ERT supports data assimilation using the Ensemble Smoother (ES), Ensemble Smoother with Multiple Data Assimilation (ES-MDA) and Iterative Ensemble Smoother (IES).

https://ert.readthedocs.io/en/latest/

GNU General Public License v3.0

103 stars 107 forks source link

Reconsider the design of the Record class hierarchy #1877

Closed verveerpj closed 3 years ago

verveerpj commented 3 years ago

Currently the Record class hierarchy is somewhat flawed: it is an abstract data class, but it misses an abstract data property, which is in fact its most important property. The main reason for this is that derived classes use pydantic, and define concrete types of the data field. Having a existing field in the base would interfere with pydantic. Pydantic has a solution involving renaming fields, but using that interferes with the abstract class scheme of Python.

The solution of not defining a data member in the base class has two problems:

The abstract type has no data field, which subverts the idea of an ABC as an interface.
Static typing with mypy fails since accessing a data field of a record that is statically typed as the base, is considered an error. Hence, we ended up with several type: ignore comments, which defeats the purpose of type hints.

There are a few options to consider:

Find a better solution for pydantic, ABC, and mypy to work together, if possible.
Consider if the use of pydantic is necessary for records, do we use a lot of its features? Note that pydantic may also imply a performance penalty.
As soon as we can move to python3.8, investigate the use of protocols instead of ABC: Structural subtyping
Live with, document it properly.

markusdregi commented 3 years ago

I'm thinking that most likely we want to skip using pydantic for the data itself at some point. For numerical data there are a vast number of alternatives from numpy, pandas and others that are far superior for this. I would prefer to keep the structure somewhat simple and a personal favourite at the moment is the structured arrays from numpy. It is like the simplistic little brother of pandas Dataframes 🤷🏻

tldr: I'm guessing we actively want to drop pydantic for a better alternative at some point...

verveerpj commented 3 years ago

I'm thinking that most likely we want to skip using pydantic for the data itself at some point. For numerical data there are a vast number of alternatives from numpy, pandas and others that are far superior for this. I would prefer to keep the structure somewhat simple and a personal favourite at the moment is the structured arrays from numpy. It is like the simplistic little brother of pandas Dataframes 🤷🏻

Using numpy makes a lot of sense, considering the application domain.

tldr: I'm guessing we actively want to drop pydantic for a better alternative at some point...

I do like pydantic to store configuration data, it seems a bit of overkill for data validation of core data structures.

jondequinor commented 3 years ago

I do like pydantic to store configuration data, it seems a bit of overkill for data validation of core data structures.

Yes, so we need two objects representing the record in these two different domains. The record should be a pydantic derived type in the configuration domain, and an other type in the arithmetical domain. A function can map between these two domains.

Shoehorning these two types into a hierarchical structure doesn't really make sense.

DanSava commented 3 years ago

I was a bit unaware of this discussion and I already proposed a small change to the Record class. Mainly removing the need for the Record class to be a pure abstract class and moving the functionality of the make_record helper function inside the Record class. https://github.com/equinor/ert/commit/4092406c87c01f020e393a077f94ea05646a4b86

verveerpj commented 3 years ago

When rewriting the RecordCollection class I also noted that pydantic did not add much, and did not play too well with mypy. So, same considerations apply.

jondequinor commented 3 years ago

For the time being, I think this is blocked by https://github.com/equinor/ert/issues/2009

jondequinor commented 3 years ago

I picked this up and implemented the NumericalRecord and BlobRecord as subclasses of a pure ABC Record without using Pydantic here.

We used pydantic in a lot of ways:

implicit conversion is used
type annotations are enforced runtime (unit tested heavily)
validators
other dunder methods (eq)

It was fairly trivial to replace conversion, validators (they were few and trivial) as well as dunder methods. Runtime enforcement of type annotations was less trivial to replace, and the pydantic checker was sadly not publicly exported. I used the constant-time type checker beartype which does a random-walk check of datasets to ensure correctness. False positives are possible, but as the author says, not very likely for scientific data. All ert3 tests pass (locally at least), and instantiation of a record is now 2-3 orders of magnitude faster.

As you have stated earlier, there are benefits with a pure-abc-record where the data property is defined. A lot of code won't care what specific type of record it is, and it is easier to reason about a concrete (uh) abstract contract.

I've also looked at introducing xarray through composition in order to implement meta-records as well as playing what is discussed above—letting numerical data be ndarrays. An ABC will allow the implementation to vary, since it's likely going to be very volatile in the start.

verveerpj commented 3 years ago

I had a quick look at beartype, which seems pretty impressive. It seems to be very low-cost to use it, so using it may have a real benefit. However, do we really need to have runtime validation for records? Assuming reasonable converters when we create them should that not be sufficient? After al, these are an internal data structure, that we handle only ourselves, or get from own storage, not from an external source. Not saying that using beartype wrong, just wondering if it may be a bit of overkill.

jondequinor commented 3 years ago

However, do we really need to have runtime validation for records?

The unit tests currently indicate that we do.

that we handle only ourselves, or get from own storage, not from an external source

We do get record data from external sources, so the validation of that data needs to lie somewhere. Probably not in the record, but the converters/transformers—do they exist currently?

verveerpj commented 3 years ago

We do get record data from external sources, so the validation of that data needs to lie somewhere. Probably not in the record, but the converters/transformers—do they exist currently?

They don't, I think, because we relied on pydantic. You do have a point, we load data into records, so that needs to be validated. I guess our issue was with pydantic being not so suitable for large data volumes. So if pydantic is now replaced with something more efficient, that'll do the trick.