Neuraxio / Neuraxle

The world's cleanest AutoML library ✨ - Do hyperparameter tuning with the right pipeline abstractions to write clean deep learning production pipelines. Let your pipeline steps have hyperparameter spaces. Design steps in your pipeline like components. Compatible with Scikit-Learn, TensorFlow, and most other libraries, frameworks and MLOps environments.
https://www.neuraxle.org/
Apache License 2.0
608 stars 62 forks source link

Feature: DataDescription object #453

Closed vincent-antaki closed 1 year ago

vincent-antaki commented 3 years ago

Is your feature request related to a problem? Please describe. In certain applications, we may have mix-types data input (e,g, a mix of categorical and real data) which may even come from different source. We need to have an efficient way to differentiate them and apply different preprocessing.

Furthermore, some transformations may be conditional on hyperparameters. This imply that our feature vector may vary of size depending of hp and thus a user should, as much as possible, avoid to hardcode values that represents feature indexes or feature vector length. From this arise the needs to track individual feature through transformations.

Describe the solution you'd like I believe a good way to handle this would be a DataDescription object or something of the like within the DataContainer (or the Context?). This DataDescription object would contain, for every feature, a reference to its type, source and a name. This object would be accessible to every step along the way and would evolve with every transformation applied.

Considerations:

Describe alternatives you've considered I'm opened to alternative suggestion. My suggestions feels like it would be a heavy modification. I remember having a conversation with @guillaume-chevalier on this a while ago but I couldn't find any issue that covers it.

guillaume-chevalier commented 3 years ago

The getitem method would look like:

def __getitem__(self, key): 
    new_self = self.copy()
    new_self.values: Any = self.values[key]
    new_self.headers: List[DataHeader] = self.headers[key]
    return new_self

Note that the List[DataHeader] type above could itself be an object managing the lists of normal and/or nested combined headers for contiguous and non-contiguous DataHeaders.

That would be the Pandas of Machine Learning arrays. Pandas is just a wrapper of numpy after all. We'd be such a wrapper by doing so.

stale[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs in the next 180 days. Thank you for your contributions.