Closed verveerpj closed 3 years ago
I'm thinking that most likely we want to skip using pydantic for the data itself at some point. For numerical data there are a vast number of alternatives from numpy, pandas and others that are far superior for this. I would prefer to keep the structure somewhat simple and a personal favourite at the moment is the structured arrays from numpy. It is like the simplistic little brother of pandas Dataframes 🤷🏻
tldr: I'm guessing we actively want to drop pydantic for a better alternative at some point...
I'm thinking that most likely we want to skip using pydantic for the data itself at some point. For numerical data there are a vast number of alternatives from numpy, pandas and others that are far superior for this. I would prefer to keep the structure somewhat simple and a personal favourite at the moment is the structured arrays from numpy. It is like the simplistic little brother of pandas Dataframes 🤷🏻
Using numpy makes a lot of sense, considering the application domain.
tldr: I'm guessing we actively want to drop pydantic for a better alternative at some point...
I do like pydantic to store configuration data, it seems a bit of overkill for data validation of core data structures.
I do like pydantic to store configuration data, it seems a bit of overkill for data validation of core data structures.
Yes, so we need two objects representing the record in these two different domains. The record should be a pydantic derived type in the configuration domain, and an other type in the arithmetical domain. A function can map between these two domains.
Shoehorning these two types into a hierarchical structure doesn't really make sense.
I was a bit unaware of this discussion and I already proposed a small change to the Record class. Mainly removing the need for the Record class to be a pure abstract class and moving the functionality of the make_record
helper function inside the Record class. https://github.com/equinor/ert/commit/4092406c87c01f020e393a077f94ea05646a4b86
When rewriting the RecordCollection
class I also noted that pydantic did not add much, and did not play too well with mypy. So, same considerations apply.
For the time being, I think this is blocked by https://github.com/equinor/ert/issues/2009
I picked this up and implemented the NumericalRecord
and BlobRecord
as subclasses of a pure ABC Record
without using Pydantic here.
We used pydantic in a lot of ways:
It was fairly trivial to replace conversion, validators (they were few and trivial) as well as dunder methods. Runtime enforcement of type annotations was less trivial to replace, and the pydantic checker was sadly not publicly exported. I used the constant-time type checker beartype which does a random-walk check of datasets to ensure correctness. False positives are possible, but as the author says, not very likely for scientific data. All ert3 tests pass (locally at least), and instantiation of a record is now 2-3 orders of magnitude faster.
As you have stated earlier, there are benefits with a pure-abc-record where the data
property is defined. A lot of code won't care what specific type of record it is, and it is easier to reason about a concrete (uh) abstract contract.
I've also looked at introducing xarray
through composition in order to implement meta-records as well as playing what is discussed above—letting numerical data be ndarrays. An ABC will allow the implementation to vary, since it's likely going to be very volatile in the start.
I had a quick look at beartype
, which seems pretty impressive. It seems to be very low-cost to use it, so using it may have a real benefit. However, do we really need to have runtime validation for records? Assuming reasonable converters when we create them should that not be sufficient? After al, these are an internal data structure, that we handle only ourselves, or get from own storage, not from an external source. Not saying that using beartype
wrong, just wondering if it may be a bit of overkill.
However, do we really need to have runtime validation for records?
The unit tests currently indicate that we do.
that we handle only ourselves, or get from own storage, not from an external source
We do get record data from external sources, so the validation of that data needs to lie somewhere. Probably not in the record, but the converters/transformers—do they exist currently?
We do get record data from external sources, so the validation of that data needs to lie somewhere. Probably not in the record, but the converters/transformers—do they exist currently?
They don't, I think, because we relied on pydantic. You do have a point, we load data into records, so that needs to be validated. I guess our issue was with pydantic being not so suitable for large data volumes. So if pydantic is now replaced with something more efficient, that'll do the trick.
Currently the
Record
class hierarchy is somewhat flawed: it is an abstract data class, but it misses an abstractdata
property, which is in fact its most important property. The main reason for this is that derived classes use pydantic, and define concrete types of the data field. Having a existing field in the base would interfere with pydantic. Pydantic has a solution involving renaming fields, but using that interferes with the abstract class scheme of Python.The solution of not defining a data member in the base class has two problems:
type: ignore
comments, which defeats the purpose of type hints.There are a few options to consider: