Closed vincent-antaki closed 1 year ago
DataDescriptor
(dd) contains:
DataHeader
s, or nested DataHeaders (if not contiguous), or partially nested DataHeaders (e.g.: first two dims are nested, but last inner dim is shared).values
(like Pandas) that could be Lists of Lists, or Numpy, or other (use inheritance for method overrides). For instance people could override that with tensorflow arrays as a .values
. ndarray
interface. E.g.: dd.mean(axis=-1)
. dd.expand_dims(axis=2)
, __getitem__
: https://numpy.org/doc/stable/reference/generated/numpy.ndarray.htmlThe getitem method would look like:
def __getitem__(self, key):
new_self = self.copy()
new_self.values: Any = self.values[key]
new_self.headers: List[DataHeader] = self.headers[key]
return new_self
Note that the List[DataHeader]
type above could itself be an object managing the lists of normal and/or nested combined headers for contiguous and non-contiguous DataHeaders.
That would be the Pandas of Machine Learning arrays. Pandas is just a wrapper of numpy after all. We'd be such a wrapper by doing so.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs in the next 180 days. Thank you for your contributions.
Is your feature request related to a problem? Please describe. In certain applications, we may have mix-types data input (e,g, a mix of categorical and real data) which may even come from different source. We need to have an efficient way to differentiate them and apply different preprocessing.
Furthermore, some transformations may be conditional on hyperparameters. This imply that our feature vector may vary of size depending of hp and thus a user should, as much as possible, avoid to hardcode values that represents feature indexes or feature vector length. From this arise the needs to track individual feature through transformations.
Describe the solution you'd like I believe a good way to handle this would be a DataDescription object or something of the like within the DataContainer (or the Context?). This DataDescription object would contain, for every feature, a reference to its type, source and a name. This object would be accessible to every step along the way and would evolve with every transformation applied.
Considerations:
Describe alternatives you've considered I'm opened to alternative suggestion. My suggestions feels like it would be a heavy modification. I remember having a conversation with @guillaume-chevalier on this a while ago but I couldn't find any issue that covers it.