aeon-toolkit / aeon

A toolkit for machine learning from time series
https://aeon-toolkit.org/
BSD 3-Clause "New" or "Revised" License
993 stars 117 forks source link

[ENH] Transformer interaction with 2D arrays/datatypes #149

Closed MatthewMiddlehurst closed 11 months ago

MatthewMiddlehurst commented 1 year ago

The transformation's module does not currently interact with 2D arrays in an intuitive way for the classification/regression/clustering task. As an example, a 2D numpy array is currently treated as a multivariate single series (n_timepoints, n_series), but someone coming from sklearn who is familiar with their framework will assume it is multiple univariate series (n_cases, n_timepoints).

If this mistake is made, there is a chance that there will be no indication of any problem, as the base class will convert it to a usable format (regardless of intention). For example, this can result in multiple TSFreshRelevantFeatureExtractor objects being fitted on many single series, which makes no sense at all. Even in cases where the output is not effected, i.e. ROCKET, it still makes the transformation grossly inefficient.

In my opinion, the growth and usability of the module is currently constrained by trying to force 2 distinct learning tasks into a single framework. It is not sensible to have the class infer what task the input is when the tasks share valid input datatypes but use them in different ways.

This still needs further discussion on actions to take (if any). In the last developer meeting, there was generally agreement that the current implicit conversion of 2D data is not the design we want. A few options:

TonyBagnall commented 1 year ago

Good summary, I will form lists of affected transformers. We could take the opportunity to change the name panel too ...

hadifawaz1999 commented 1 year ago

There is the constraint when using the SAX algorithm as well. If the user chooses to use the algorithm with a 2D numpy array, he should add an axis himself in the between to make the shape (n_cases, n_channels, time_series_length). For deep learners i remember there was an automated way to do that internally. Plus the usage pandas series can be removed, i think most people will use numpy here. Especially that there is no option to return a numpy array.

TonyBagnall commented 11 months ago

this has been resolved by #709 any residual issues related to this should be raised as specific separate issues I think