Feature: DataDescription object

Neuraxio / Neuraxle

The world's cleanest AutoML library ✨ - Do hyperparameter tuning with the right pipeline abstractions to write clean deep learning production pipelines. Let your pipeline steps have hyperparameter spaces. Design steps in your pipeline like components. Compatible with Scikit-Learn, TensorFlow, and most other libraries, frameworks and MLOps environments.

Apache License 2.0

608 stars 62 forks source link

Is your feature request related to a problem? Please describe. In certain applications, we may have mix-types data input (e,g, a mix of categorical and real data) which may even come from different source. We need to have an efficient way to differentiate them and apply different preprocessing.

Furthermore, some transformations may be conditional on hyperparameters. This imply that our feature vector may vary of size depending of hp and thus a user should, as much as possible, avoid to hardcode values that represents feature indexes or feature vector length. From this arise the needs to track individual feature through transformations.

Describe the solution you'd like I believe a good way to handle this would be a DataDescription object or something of the like within the DataContainer (or the Context?). This DataDescription object would contain, for every feature, a reference to its type, source and a name. This object would be accessible to every step along the way and would evolve with every transformation applied.

Considerations:

Manipulation of a DataDescription may introduce a computation overhead if many transformation are applied, thus its usage should not be automatic. Because of this I think it should either be a service within the context or a special type of DataContainer.
Steps that transform the data may need to implement some extra processing to update the DataDescription object with its transformation. This, again, should only be required if a user choose to use the DataDescription service.

Describe alternatives you've considered I'm opened to alternative suggestion. My suggestions feels like it would be a heavy modification. I remember having a conversation with @guillaume-chevalier on this a while ago but I couldn't find any issue that covers it.

A DataDescriptor (dd) contains:
- A list of DataHeaders, or nested DataHeaders (if not contiguous), or partially nested DataHeaders (e.g.: first two dims are nested, but last inner dim is shared)
- A wrapped array in the .values (like Pandas) that could be Lists of Lists, or Numpy, or other (use inheritance for method overrides). For instance people could override that with tensorflow arrays as a .values.
- methods for averaging, means, stds, and other. Could implement most of numpy's ndarray interface. E.g.: dd.mean(axis=-1). dd.expand_dims(axis=2), __getitem__: https://numpy.org/doc/stable/reference/generated/numpy.ndarray.html

The getitem method would look like:

def __getitem__(self, key): 
    new_self = self.copy()
    new_self.values: Any = self.values[key]
    new_self.headers: List[DataHeader] = self.headers[key]
    return new_self

Note that the List[DataHeader] type above could itself be an object managing the lists of normal and/or nested combined headers for contiguous and non-contiguous DataHeaders.

That would be the Pandas of Machine Learning arrays. Pandas is just a wrapper of numpy after all. We'd be such a wrapper by doing so.

Neuraxio / Neuraxle

Feature: DataDescription object #453