Separate `RawDataValidation` from `DataArray`

daquinteroflex commented 4 months ago

Hi @tylerflex , @caseyflex and I were chatting recently. I am writing this issue as a reference of our conversation before I forget this concept. We seemed happy with this idea so thought to write it down. Names are not decided below.

The Problem:

I think Casey mentioned validating the DataArray was a big pain. Also it being slow too.

The Concept:

We separate validator classes and data representation class. DataArrayMixin classes are composed from RawDataValidatorClasses

It'd work something like this:

Backend Raw Data -> RawDataValidatorClass -> to_datarray_function_or_similar -> DataArrayMixin

In the example above:

Backend Raw Data  = (eg. numpy array directly from the backend)
RawDataValidatorClass =  pure pd.BaseClass representation with validators, contains basic data fields and validators
DataArrayMixin = contains a Mixin that composes the DataArray class with the easy custom functionality we provide for users
to_datarray_function_or_similar = can be an external function independent of classmethod

Implications:

Validation should be faster as it is operating on fundamental data classes rather than on xr.DataArray.
Custom functionality can be composed in the Mixin which can replace our existing DataArray classes mostly equivaently. No need to deal with validation in an xr context.
Possibility of using raw data more effectively/faster in future functionality by raw access

Affected Classes:

All our custom DataArray classes will need an equivalent RawDataValidatorClass. Conversion method can be shared. Eg. FieldData would require a FieldRawData

tylerflex commented 4 months ago

I think Casey mentioned validating the DataArray was a big pain. Also it being slow too.

Can you give some specifics?

Basically this is like creating some object that stores values, coords, dims and then constructs a DataArray with various methods? Also, maybe a minimum example could be helpful, I'm having trouble visualizing it.

We just would need to weigh the benefits with what sounds like maybe adding more complexity to defining new data array classes. Somewhat orthogonal to this, but I could also be ok with allowing use of regular (unvalidated) td.DataArray with unvalidated dims and coords? or at least maybe doing this like a meta class, similar to how ArrayLike is handled

tylerflex commented 4 months ago

So for example,

class CustomMedium:
    permittivity : td.DataArray[('x', 'y', 'z', 'f')]

and then the allowed dims are set in the permittivity field DataArray

caseyflex commented 4 months ago

Thanks @daquinteroflex for the proposal. I had previously found this a bit abstract and hard to understand, but I think I understand a bit better after our discussions. I think the general suggestion was that a lot of current pain points could be avoided by changing the way we use pydantic and xarray, and by separating (validated) data from functionality. I'll try to summarize below.

Currently, we have a hierarchy of data classes: data arrays, datasets, monitor data, simulation data. All of these are pydantic classes, and all of these have validators and utility functions. Data arrays are special in that they inherit their validators and utility functions from xarray.

Under the new proposal, we have a hierarchy of raw data classes: raw data arrays, raw datasets, raw monitor data, raw simulation data. These are all pydantic classes which store and validate data, but they do not have any additional functionality. As part of this, the raw data arrays no longer inherit from xarray; instead, we use pydantic to validate the coordinates of a raw data array directly.

In addition to the raw data classes, we can still have the enriched data classes that we are used to. These can be constructed from the raw data by mixing in the desired utility functions. Xarray functionality can also be introduced at this level.

We would also have raw versions of other classes, like Simulation; this is not a data class in the usual sense, but it actually is a data class in a way, since it stores the fields and parameters that define the simulation. So we would have RawSimulation which has the fields, parameters, and validation, and Simulation which mixes in all the other functionality.

Some pain points that this proposal should improve:

Issues with circular imports. We often want to convert between a pair of data types. For example (although we decided not to do this particular one), you could imagine Simulation having to_eme_simulation and from_eme_simulation, and EMESimulation having to_simulation and from_simulation. Currently, we would need to decide, say, for EMESimulation to import Simulation and have to_simulation and from_simulation, but we could not have those methods also in Simulation because it would introduce circular imports. On the other hand, if we had RawSimulation and RawEMESimulation, it seems like Simulation could import RawEMESimulation and EMESimulation could import RawSimulation without difficulty.
Issues with pydantic validators and missing fields. I don't have anything concrete here at the moment, but it seems like if we only use the validators for the raw data class, this would avoid some of the missing field issues that arise when we manipulate a data class in certain ways.
The fact that our datasets don't have xarray-like functionality (isel, sel, ...). The enriched datasets could have this functionality under the new proposal.
The inefficiencies associated with using xarray for raw data storage (like I encountered in outer_dot and smatrix_in_basis). If we commit to a precise format for the raw data arrays, rather than trusting xarray to do this for us, then we can have our own efficient but also well-designed accessors.
More flexibility in the data array structures. For example, a bunch of the EME data arrays have sweep_index as a coordinate. It would be nice to have this coordinate be optional. This works currently but it is a bit hacky, dropping the dimension in xarray sometimes. If our raw data array is not based on xarray, we can support this directly.

I was thinking about how we could try this out without a major refactor. It seems like we could just pick one data class and implement a raw version of it, separating out the data and validation from the other functionality, and changing the enriched data class to use mixins. I'm not sure about how the details of this would work, but it seems like @daquinteroflex has a clearer idea. One difficulty I imagine is with the way to_file and from_file work for data arrays. This might require some reworking for a raw data array that is based on pydantic plus raw data (say, numpy arrays) rather than based on xarray.

tylerflex commented 4 months ago

This seems like potentially a pretty big change, and also introduces the need to implement two separate classes for every user-facing class? so I'm kind of skeptical about the benefits, but we can discuss more.

I think some of this makes sense for xarray, originally I had played around with a design like this

class DataArray(Tidy3dBaseModel):
    data : np.ndarray
    coords : dict[str, Any]
    dims : tuple[str, ...]

    @cached_property
    def xarray(self) -> XR.DataArray:
        return xr.DataArray(self.data, coords=self.coords, dims=self.dims)

    def sel(*args, **kwargs):
        return self.xarray.sel(*args, **kwargs)

    ...

This way we can use xarray functionality, but still have everything be a pydantic base model. It does require implementing all of the xarray methods, which could be a bit tedious for users. We can also potentially help this by just defining things like

class FieldData
    Ex : DataArray
    ...

    @property
    def field_components(self) -> dict[str, DataArray]:
        return dict(
            Ex=self.Ex.xarray,
            ....
        )

Would a limited refactor of DataArray accomplish some of the same goals without needing a complete re-do of the entire codebase?

As for more specific discussion points

Issues with circular imports. We often want to convert between a pair of data types. For example (although we decided not to do this particular one), you could imagine Simulation having to_eme_simulation and from_eme_simulation, and EMESimulation having to_simulation and from_simulation. Currently, we would need to decide, say, for EMESimulation to import Simulation and have to_simulation and from_simulation, but we could not have those methods also in Simulation because it would introduce circular imports. On the other hand, if we had RawSimulation and RawEMESimulation, it seems like Simulation could import RawEMESimulation and EMESimulation could import RawSimulation without difficulty.

Yea so basically if the user-facing methods add dependencies, then separating like this could help. But we still have to worry about circular dependencies in the methods? For example, it seems like in this proposal, the to_simulation could only return a RawSimulation / RawEMESimulation, whereas the users probably want to get an actual EMESimulation or Simulation? so the circular import issue still remains.

Maybe this can be resolved by sticking these conversion methods at the abstract class level, for example,

AbstractSimulation.convert(sim_type: type,  **kwargs) -> AbstractSimulation`

something to essentially let the user convert between types of their choice? and just pass optional arguments to the constructor of sim_type?

Issues with pydantic validators and missing fields. I don't have anything concrete here at the moment, but it seems like if we only use the validators for the raw data class, this would avoid some of the missing field issues that arise when we manipulate a data class in certain ways.

I kind of feel like we can consider dealing with this by changing how pydantic works when validators fail. For example, maybe it should just error right away?

The fact that our datasets don't have xarray-like functionality (isel, sel, ...). The enriched datasets could have this functionality under the new proposal.

We could also implement a xr.Dataset subclass. I agree something like this could be nice, especially for FieldData. However, the main issue is that the field components are generally not co-located. When you make a Dataset of out of this, it will basically do an outer product of all of the coordinates and add nan. So we'll end up with a huge dataset full of mostly nans. But a colocated FIeldDataset could be nice to have, especially for Poynting vector and related things.

The inefficiencies associated with using xarray for raw data storage (like I encountered in outer_dot and smatrix_in_basis). If we commit to a precise format for the raw data arrays, rather than trusting xarray to do this for us, then we can have our own efficient but also well-designed accessors.

Not sure this is xarry's fault? it seems even if we wrote our own data array containers, we'd still deal with this issue? sounds like more of an implementation detail in how eg tensor dot Is performed? (eg can we loop over certain variables?). I also still wonder if there are ways to do more memory efficient data array manipulations in xarray that we just don't know about. might be worth looking into more deeply. We can also try using some of the dask functionality from xarray, which I know focuses on this specific thing. This is worth a read: https://docs.xarray.dev/en/stable/user-guide/dask.html

More flexibility in the data array structures. For example, a bunch of the EME data arrays have sweep_index as a coordinate. It would be nice to have this coordinate be optional. This works currently but it is a bit hacky, dropping the dimension in xarray sometimes. If our raw data array is not based on xarray, we can support this directly.

Yea the DataArray's are a bit rigid right now, I could be interested in relaxing some of the validation.

Could be worth generally thinking about what DataArray's should look like, maybe for 3.0 and then working towards that. I feel they are one of the areas that definitely need some work.

daquinteroflex commented 4 months ago

Hello! Still thinking about all these ideas before further discussing, but thought this could be interesting related @dbochkov-flexcompute's ideas on the solver data interactions in terms of structuring the relationships imports.

flexcompute / tidy3d

Separate `RawDataValidation` from `DataArray` #1787